Class TokenizerME

java.lang.Object
opennlp.tools.tokenize.TokenizerME
All Implemented Interfaces:
Tokenizer

public class TokenizerME extends Object
A Tokenizer for converting raw text into separated tokens. It uses Maximum Entropy to make its decisions. The features are loosely based off of Jeff Reynar's UPenn thesis "Topic Segmentation: Algorithms and Applications.", which is available from his homepage: http://www.cis.upenn.edu/~jcreynar.

This implementation needs a statistical model to tokenize a text which reproduces the tokenization observed in the training data used to create the model. The TokenizerModel class encapsulates that model and provides methods to create it from the binary representation.

A tokenizer instance is not thread-safe. For each thread, one tokenizer must be instantiated which can share one TokenizerModel instance to safe memory.

To train a new model, the train(ObjectStream, TokenizerFactory, TrainingParameters) method can be used.

Sample usage:

InputStream modelIn;

...

TokenizerModel model = TokenizerModel(modelIn);

Tokenizer tokenizer = new TokenizerME(model);

String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");

See Also:
  • Field Details

  • Constructor Details

  • Method Details

    • getTokenProbabilities

      public double[] getTokenProbabilities()
      Returns:
      the probabilities associated with the most recent calls to Tokenizer.tokenize(String) or tokenizePos(String). If not applicable an empty array is returned.
    • tokenizePos

      public Span[] tokenizePos(String d)
      Tokenizes the string.
      Parameters:
      d - The string to be tokenized.
      Returns:
      A Span array containing individual tokens as elements.
    • train

      public static TokenizerModel train(ObjectStream<TokenSample> samples, TokenizerFactory factory, TrainingParameters mlParams) throws IOException
      Trains a model for the TokenizerME.
      Parameters:
      samples - The samples used for the training.
      factory - A TokenizerFactory to get resources from.
      mlParams - The machine learning train parameters.
      Returns:
      A trained TokenizerModel.
      Throws:
      IOException - Thrown during IO operations on a temp file which is created during training. Or if reading from the ObjectStream fails.
    • useAlphaNumericOptimization

      public boolean useAlphaNumericOptimization()
      Returns:
      true if the tokenizer uses alphanumeric optimization, false otherwise.
    • tokenize

      public String[] tokenize(String s)
      Description copied from interface: Tokenizer
      Splits a string into its atomic parts.
      Specified by:
      tokenize in interface Tokenizer
      Parameters:
      s - The string to be tokenized.
      Returns:
      The String[] with the individual tokens as the array elements.
    • setKeepNewLines

      public void setKeepNewLines(boolean keepNewLines)
      Switches whether to keep new lines or not.
      Parameters:
      keepNewLines - True if new lines are kept, false otherwise.