opennlp.tools.tokenize
Class TokenizerME

java.lang.Object
  extended by opennlp.tools.tokenize.TokenizerME
All Implemented Interfaces:
Tokenizer

public class TokenizerME
extends Object

A Tokenizer for converting raw text into separated tokens. It uses Maximum Entropy to make its decisions. The features are loosely based off of Jeff Reynar's UPenn thesis "Topic Segmentation: Algorithms and Applications.", which is available from his homepage: .

This tokenizer needs a statistical model to tokenize a text which reproduces the tokenization observed in the training data used to create the model. The TokenizerModel class encapsulates the model and provides methods to create it from the binary representation.

A tokenizer instance is not thread safe. For each thread one tokenizer must be instantiated which can share one TokenizerModel instance to safe memory.

To train a new model {train(String, ObjectStream, boolean, TrainingParameters) method can be used.

Sample usage:

InputStream modelIn;

...

TokenizerModel model = TokenizerModel(modelIn);

Tokenizer tokenizer = new TokenizerME(model);

String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");

See Also:
Tokenizer, TokenizerModel, TokenSample

Field Summary
static Pattern alphaNumeric
          Deprecated. As of release 1.5.2, replaced by Factory.getAlphanumeric(String)
static String NO_SPLIT
          Constant indicates no token split.
static String SPLIT
          Constant indicates a token split.
 
Constructor Summary
TokenizerME(TokenizerModel model)
           
TokenizerME(TokenizerModel model, Factory factory)
          Deprecated. use TokenizerFactory to extend the Tokenizer functionality
 
Method Summary
 double[] getTokenProbabilities()
          Returns the probabilities associated with the most recent calls to AbstractTokenizer.tokenize(String) or tokenizePos(String).
 String[] tokenize(String s)
          Splits a string into its atomic parts
 Span[] tokenizePos(String d)
          Tokenizes the string.
static TokenizerModel train(ObjectStream<TokenSample> samples, TokenizerFactory factory, TrainingParameters mlParams)
          Trains a model for the TokenizerME.
static TokenizerModel train(String languageCode, ObjectStream<TokenSample> samples, boolean useAlphaNumericOptimization)
          Deprecated. Use #train(String, ObjectStream, TokenizerFactory, TrainingParameters) and pass in a TokenizerFactory
static TokenizerModel train(String languageCode, ObjectStream<TokenSample> samples, boolean useAlphaNumericOptimization, int cutoff, int iterations)
          Deprecated. Use #train(String, ObjectStream, TokenizerFactory, TrainingParameters) and pass in a TokenizerFactory
static TokenizerModel train(String languageCode, ObjectStream<TokenSample> samples, boolean useAlphaNumericOptimization, TrainingParameters mlParams)
          Deprecated. Use #train(String, ObjectStream, TokenizerFactory, TrainingParameters) and pass in a TokenizerFactory
static TokenizerModel train(String languageCode, ObjectStream<TokenSample> samples, Dictionary abbreviations, boolean useAlphaNumericOptimization, TrainingParameters mlParams)
          Deprecated. Use #train(String, ObjectStream, TokenizerFactory, TrainingParameters) and pass in a TokenizerFactory
 boolean useAlphaNumericOptimization()
          Returns the value of the alpha-numeric optimization flag.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

SPLIT

public static final String SPLIT
Constant indicates a token split.

See Also:
Constant Field Values

NO_SPLIT

public static final String NO_SPLIT
Constant indicates no token split.

See Also:
Constant Field Values

alphaNumeric

@Deprecated
public static final Pattern alphaNumeric
Deprecated. As of release 1.5.2, replaced by Factory.getAlphanumeric(String)
Alpha-Numeric Pattern

Constructor Detail

TokenizerME

public TokenizerME(TokenizerModel model)

TokenizerME

public TokenizerME(TokenizerModel model,
                   Factory factory)
Deprecated. use TokenizerFactory to extend the Tokenizer functionality

Method Detail

getTokenProbabilities

public double[] getTokenProbabilities()
Returns the probabilities associated with the most recent calls to AbstractTokenizer.tokenize(String) or tokenizePos(String).

Returns:
probability for each token returned for the most recent call to tokenize. If not applicable an empty array is returned.

tokenizePos

public Span[] tokenizePos(String d)
Tokenizes the string.

Parameters:
d - The string to be tokenized.
Returns:
A span array containing individual tokens as elements.

train

public static TokenizerModel train(ObjectStream<TokenSample> samples,
                                   TokenizerFactory factory,
                                   TrainingParameters mlParams)
                            throws IOException
Trains a model for the TokenizerME.

Parameters:
samples - the samples used for the training.
factory - a TokenizerFactory to get resources from
mlParams - the machine learning train parameters
Returns:
the trained TokenizerModel
Throws:
IOException - it throws an IOException if an IOException is thrown during IO operations on a temp file which is created during training. Or if reading from the ObjectStream fails.

train

public static TokenizerModel train(String languageCode,
                                   ObjectStream<TokenSample> samples,
                                   boolean useAlphaNumericOptimization,
                                   TrainingParameters mlParams)
                            throws IOException
Deprecated. Use #train(String, ObjectStream, TokenizerFactory, TrainingParameters) and pass in a TokenizerFactory

Trains a model for the TokenizerME.

Parameters:
languageCode - the language of the natural text
samples - the samples used for the training.
useAlphaNumericOptimization - - if true alpha numerics are skipped
mlParams - the machine learning train parameters
Returns:
the trained TokenizerModel
Throws:
IOException - it throws an IOException if an IOException is thrown during IO operations on a temp file which is created during training. Or if reading from the ObjectStream fails.

train

public static TokenizerModel train(String languageCode,
                                   ObjectStream<TokenSample> samples,
                                   Dictionary abbreviations,
                                   boolean useAlphaNumericOptimization,
                                   TrainingParameters mlParams)
                            throws IOException
Deprecated. Use #train(String, ObjectStream, TokenizerFactory, TrainingParameters) and pass in a TokenizerFactory

Trains a model for the TokenizerME.

Parameters:
languageCode - the language of the natural text
samples - the samples used for the training.
abbreviations - an abbreviations dictionary
useAlphaNumericOptimization - - if true alpha numerics are skipped
mlParams - the machine learning train parameters
Returns:
the trained TokenizerModel
Throws:
IOException - it throws an IOException if an IOException is thrown during IO operations on a temp file which is created during training. Or if reading from the ObjectStream fails.

train

@Deprecated
public static TokenizerModel train(String languageCode,
                                              ObjectStream<TokenSample> samples,
                                              boolean useAlphaNumericOptimization,
                                              int cutoff,
                                              int iterations)
                            throws IOException
Deprecated. Use #train(String, ObjectStream, TokenizerFactory, TrainingParameters) and pass in a TokenizerFactory

Trains a model for the TokenizerME.

Parameters:
languageCode - the language of the natural text
samples - the samples used for the training.
useAlphaNumericOptimization - - if true alpha numerics are skipped
cutoff - number of times a feature must be seen to be considered
iterations - number of iterations to train the maxent model
Returns:
the trained TokenizerModel
Throws:
IOException - it throws an IOException if an IOException is thrown during IO operations on a temp file which is created during training. Or if reading from the ObjectStream fails.

train

public static TokenizerModel train(String languageCode,
                                   ObjectStream<TokenSample> samples,
                                   boolean useAlphaNumericOptimization)
                            throws IOException,
                                   ObjectStreamException
Deprecated. Use #train(String, ObjectStream, TokenizerFactory, TrainingParameters) and pass in a TokenizerFactory

Trains a model for the TokenizerME with a default cutoff of 5 and 100 iterations.

Parameters:
languageCode - the language of the natural text
samples - the samples used for the training.
useAlphaNumericOptimization - - if true alpha numerics are skipped
Returns:
the trained TokenizerModel
Throws:
IOException - it throws an IOException if an IOException is thrown during IO operations on a temp file which is
ObjectStreamException - if reading from the ObjectStream fails created during training.

useAlphaNumericOptimization

public boolean useAlphaNumericOptimization()
Returns the value of the alpha-numeric optimization flag.

Returns:
true if the tokenizer should use alpha-numeric optimization, false otherwise.

tokenize

public String[] tokenize(String s)
Description copied from interface: Tokenizer
Splits a string into its atomic parts

Specified by:
tokenize in interface Tokenizer
Parameters:
s - The string to be tokenized.
Returns:
The String[] with the individual tokens as the array elements.


Copyright © 2013 The Apache Software Foundation. All Rights Reserved.