TokenizerME (Apache OpenNLP Tools 1.6.0 API)

java.lang.Object
- opennlp.tools.tokenize.TokenizerME

All Implemented Interfaces:

Tokenizer
```
public class TokenizerME
extends Object
```
A Tokenizer for converting raw text into separated tokens. It uses Maximum Entropy to make its decisions. The features are loosely based off of Jeff Reynar's UPenn thesis "Topic Segmentation: Algorithms and Applications.", which is available from his homepage: http://www.cis.upenn.edu/~jcreynar.
This tokenizer needs a statistical model to tokenize a text which reproduces the tokenization observed in the training data used to create the model. The TokenizerModel class encapsulates the model and provides methods to create it from the binary representation.
A tokenizer instance is not thread safe. For each thread one tokenizer must be instantiated which can share one TokenizerModel instance to safe memory.
To train a new model {train(String, ObjectStream, boolean, TrainingParameters) method can be used.
Sample usage:
InputStream modelIn; ... TokenizerModel model = TokenizerModel(modelIn); Tokenizer tokenizer = new TokenizerME(model); String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");

See Also:
Tokenizer, TokenizerModel, TokenSample

Field Summary

Fields
Modifier and Type	Field and Description
`static Pattern`	`alphaNumeric` Deprecated. As of release 1.5.2, replaced by `Factory.getAlphanumeric(String)`
`static String`	`NO_SPLIT` Constant indicates no token split.
`static String`	`SPLIT` Constant indicates a token split.

Constructor Summary

Constructors
Constructor and Description
`TokenizerME(TokenizerModel model)`
`TokenizerME(TokenizerModel model, Factory factory)` Deprecated. use `TokenizerFactory` to extend the Tokenizer functionality

Method Summary

Methods
Modifier and Type	Method and Description
`double[]`	`getTokenProbabilities()` Returns the probabilities associated with the most recent calls to `AbstractTokenizer.tokenize(String)` or `tokenizePos(String)`.
`String[]`	`tokenize(String s)` Splits a string into its atomic parts
`Span[]`	`tokenizePos(String d)` Tokenizes the string.
`static TokenizerModel`	`train(ObjectStream<TokenSample> samples, TokenizerFactory factory, TrainingParameters mlParams)` Trains a model for the `TokenizerME`.
`static TokenizerModel`	`train(String languageCode, ObjectStream<TokenSample> samples, boolean useAlphaNumericOptimization)` Deprecated. Use `train(ObjectStream, TokenizerFactory, TrainingParameters)` and pass in a `TokenizerFactory`
`static TokenizerModel`	`train(String languageCode, ObjectStream<TokenSample> samples, boolean useAlphaNumericOptimization, TrainingParameters mlParams)` Deprecated. Use `train(ObjectStream, TokenizerFactory, TrainingParameters)` and pass in a `TokenizerFactory`
`static TokenizerModel`	`train(String languageCode, ObjectStream<TokenSample> samples, Dictionary abbreviations, boolean useAlphaNumericOptimization, TrainingParameters mlParams)` Deprecated. Use `train(ObjectStream, TokenizerFactory, TrainingParameters)` and pass in a `TokenizerFactory`
`boolean`	`useAlphaNumericOptimization()` Returns the value of the alpha-numeric optimization flag.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - SPLIT
```
public static final String SPLIT
```
    Constant indicates a token split.
    
    See Also:
    Constant Field Values
  - NO_SPLIT
```
public static final String NO_SPLIT
```
    Constant indicates no token split.
    
    See Also:
    Constant Field Values
  - alphaNumeric
```
@Deprecated
public static final Pattern alphaNumeric
```
    Deprecated. As of release 1.5.2, replaced by Factory.getAlphanumeric(String)
    
    Alpha-Numeric Pattern
- Constructor Detail
  - TokenizerME
```
public TokenizerME(TokenizerModel model)
```
  - TokenizerME
```
public TokenizerME(TokenizerModel model,
           Factory factory)
```
    Deprecated. use TokenizerFactory to extend the Tokenizer functionality
- Method Detail
  - getTokenProbabilities
```
public double[] getTokenProbabilities()
```
    Returns the probabilities associated with the most recent calls to AbstractTokenizer.tokenize(String) or tokenizePos(String).
    
    Returns:
    probability for each token returned for the most recent call to tokenize. If not applicable an empty array is returned.
  - tokenizePos
```
public Span[] tokenizePos(String d)
```
    Tokenizes the string.
    
    Parameters:
    d - The string to be tokenized.
    
    Returns:
    A span array containing individual tokens as elements.
  - train
```
public static TokenizerModel train(ObjectStream<TokenSample> samples,
                   TokenizerFactory factory,
                   TrainingParameters mlParams)
                            throws IOException
```
    Trains a model for the TokenizerME.
    
    Parameters:
    samples - the samples used for the training.
    factory - a TokenizerFactory to get resources from
    mlParams - the machine learning train parameters
    
    Returns:
    the trained TokenizerModel
    
    Throws:
    
    IOException - it throws an IOException if an IOException is thrown during IO operations on a temp file which is created during training. Or if reading from the ObjectStream fails.
  - train
```
public static TokenizerModel train(String languageCode,
                   ObjectStream<TokenSample> samples,
                   boolean useAlphaNumericOptimization,
                   TrainingParameters mlParams)
                            throws IOException
```
    Deprecated. Use train(ObjectStream, TokenizerFactory, TrainingParameters) and pass in a TokenizerFactory
    
    Trains a model for the TokenizerME.
    
    Parameters:
    languageCode - the language of the natural text
    samples - the samples used for the training.
    useAlphaNumericOptimization - - if true alpha numerics are skipped
    mlParams - the machine learning train parameters
    
    Returns:
    the trained TokenizerModel
    
    Throws:
    
    IOException - it throws an IOException if an IOException is thrown during IO operations on a temp file which is created during training. Or if reading from the ObjectStream fails.
  - train
```
public static TokenizerModel train(String languageCode,
                   ObjectStream<TokenSample> samples,
                   Dictionary abbreviations,
                   boolean useAlphaNumericOptimization,
                   TrainingParameters mlParams)
                            throws IOException
```
    Deprecated. Use train(ObjectStream, TokenizerFactory, TrainingParameters) and pass in a TokenizerFactory
    
    Trains a model for the TokenizerME.
    
    Parameters:
    languageCode - the language of the natural text
    samples - the samples used for the training.
    abbreviations - an abbreviations dictionary
    useAlphaNumericOptimization - - if true alpha numerics are skipped
    mlParams - the machine learning train parameters
    
    Returns:
    the trained TokenizerModel
    
    Throws:
    
    IOException - it throws an IOException if an IOException is thrown during IO operations on a temp file which is created during training. Or if reading from the ObjectStream fails.
  - train
```
public static TokenizerModel train(String languageCode,
                   ObjectStream<TokenSample> samples,
                   boolean useAlphaNumericOptimization)
                            throws IOException,
                                   ObjectStreamException
```
    Deprecated. Use train(ObjectStream, TokenizerFactory, TrainingParameters) and pass in a TokenizerFactory
    
    Trains a model for the TokenizerME with a default cutoff of 5 and 100 iterations.
    
    Parameters:
    languageCode - the language of the natural text
    samples - the samples used for the training.
    useAlphaNumericOptimization - - if true alpha numerics are skipped
    
    Returns:
    the trained TokenizerModel
    
    Throws:
    
    IOException - it throws an IOException if an IOException is thrown during IO operations on a temp file which is
    
    ObjectStreamException - if reading from the ObjectStream fails created during training.
  - useAlphaNumericOptimization
```
public boolean useAlphaNumericOptimization()
```
    Returns the value of the alpha-numeric optimization flag.
    
    Returns:
    true if the tokenizer should use alpha-numeric optimization, false otherwise.
  - tokenize
```
public String[] tokenize(String s)
```
    Description copied from interface: Tokenizer
    
    Splits a string into its atomic parts
    
    Specified by:
    
    tokenize in interface Tokenizer
    
    Parameters:
    s - The string to be tokenized.
    
    Returns:
    The String[] with the individual tokens as the array elements.

Class TokenizerME

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

SPLIT

NO_SPLIT

alphaNumeric

Constructor Detail

TokenizerME

TokenizerME

Method Detail

getTokenProbabilities

tokenizePos

train

train

train

train

useAlphaNumericOptimization

tokenize