opennlp.tools.tokenize.TokenizerME

All Implemented Interfaces:: opennlp.tools.ml.Probabilistic, opennlp.tools.tokenize.Tokenizer

public class TokenizerME extends Object implements opennlp.tools.ml.Probabilistic

A Tokenizer for converting raw text into separated tokens. It uses Maximum Entropy to make its decisions. The features are loosely based off of Jeff Reynar's UPenn thesis "Topic Segmentation: Algorithms and Applications.", which is available from his homepage: http://www.cis.upenn.edu/~jcreynar.

This implementation needs a statistical model to tokenize a text which reproduces the tokenization observed in the training data used to create the model. The TokenizerModel class encapsulates that model and provides methods to create it from the binary representation.

A tokenizer instance is not thread-safe. For each thread, one tokenizer must be instantiated which can share one TokenizerModel instance to safe memory.

To train a new model, the train(ObjectStream, TokenizerFactory, TrainingParameters) method can be used.

Sample usage:

InputStream modelIn; ... TokenizerModel model = TokenizerModel(modelIn); Tokenizer tokenizer = new TokenizerME(model); String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");

See Also:

Field Summary

Fields

Modifier and Type

Field

Description

static final String

NO_SPLIT

Constant indicates no token split.

static final String

SPLIT

Constant indicates a token split.
Constructor Summary

Constructors

Constructor

Description

TokenizerME(String language)

Initializes a TokenizerME by downloading a default model.

TokenizerME(TokenizerModel model)

Instantiates a TokenizerME with an existing TokenizerModel.

TokenizerME(TokenizerModel model, Dictionary abbDict)

Instantiates a TokenizerME with an existing TokenizerModel.
Method Summary

Modifier and Type

Method

Description

double[]

getTokenProbabilities()

Deprecated, for removal: This API element is subject to removal in a future version.
Use probs() instead.

double[]

probs()

The sequence was determined based on the previous call to tokenizePos(String).

void

setKeepNewLines(boolean arg0)

String[]

tokenize(String arg0)

opennlp.tools.util.Span[]

tokenizePos(String d)

Tokenizes the string.

static TokenizerModel

train(opennlp.tools.util.ObjectStream<opennlp.tools.tokenize.TokenSample> samples, TokenizerFactory factory, opennlp.tools.util.TrainingParameters mlParams)

Trains a model for the TokenizerME.

boolean

useAlphaNumericOptimization()

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- SPLIT
  public static final String SPLIT
  
  Constant indicates a token split.
  
  See Also:
  
  Constant Field Values
- NO_SPLIT
  public static final String NO_SPLIT
  
  Constant indicates no token split.
  
  See Also:
  
  Constant Field Values
Constructor Details
- TokenizerME
  
  public TokenizerME(String language) throws IOException
  
  Initializes a TokenizerME by downloading a default model.
  
  Parameters:
  
  language - The language of the tokenizer.
  
  Throws:
  
  IOException - Thrown if the model cannot be downloaded or saved.
- TokenizerME
  
  public TokenizerME(TokenizerModel model)
  
  Instantiates a TokenizerME with an existing TokenizerModel.
  
  Parameters:
  
  model - The TokenizerModel to be used.
- TokenizerME
  
  public TokenizerME(TokenizerModel model, Dictionary abbDict)
  
  Instantiates a TokenizerME with an existing TokenizerModel.
  
  Parameters:
  
  model - The TokenizerModel to be used.
  
  abbDict - The Dictionary to be used. It must fit the language of the model.
Method Details
- probs
  
  public double[] probs()
  
  The sequence was determined based on the previous call to tokenizePos(String).
  
  Specified by:
  
  probs in interface opennlp.tools.ml.Probabilistic
  
  Returns:
  
  An array with the same number of probabilities as tokens were sent to the computational method when tokenizePos(String) was last called. If not applicable an empty array is returned.
- getTokenProbabilities
  
  @Deprecated(forRemoval=true, since="2.5.5") public double[] getTokenProbabilities()
  
  Deprecated, for removal: This API element is subject to removal in a future version.
  Use probs() instead.
  
  Returns:
  
  the probabilities associated with the most recent calls to tokenizePos(String). If not applicable an empty array is returned.
- tokenizePos
  
  public opennlp.tools.util.Span[] tokenizePos(String d)
  
  Tokenizes the string.
  
  Specified by:
  
  tokenizePos in interface opennlp.tools.tokenize.Tokenizer
  
  Parameters:
  
  d - The string to be tokenized.
  
  Returns:
  
  A Span array containing individual tokens as elements.
- train
  
  public static TokenizerModel train(opennlp.tools.util.ObjectStream<opennlp.tools.tokenize.TokenSample> samples, TokenizerFactory factory, opennlp.tools.util.TrainingParameters mlParams) throws IOException
  
  Trains a model for the TokenizerME.
  
  Parameters:
  
  samples - The samples used for the training.
  
  factory - A TokenizerFactory to get resources from.
  
  mlParams - The machine learning train parameters.
  
  Returns:
  
  A trained TokenizerModel.
  
  Throws:
  
  IOException - Thrown during IO operations on a temp file which is created during training. Or if reading from the ObjectStream fails.
- useAlphaNumericOptimization
  
  public boolean useAlphaNumericOptimization()
  
  Returns:
  
  true if the tokenizer uses alphanumeric optimization, false otherwise.
- tokenize
  
  public String[] tokenize(String arg0)
  
  Specified by:
  
  tokenize in interface opennlp.tools.tokenize.Tokenizer
- setKeepNewLines
  
  public void setKeepNewLines(boolean arg0)

Class TokenizerME

Field Summary

Constructor Summary

Method Summary

Methods inherited from class Object

Field Details

SPLIT

NO_SPLIT

Constructor Details

TokenizerME

TokenizerME

TokenizerME

Method Details

probs

getTokenProbabilities

tokenizePos

train

useAlphaNumericOptimization

tokenize

setKeepNewLines