Class TokenizerME
java.lang.Object
opennlp.tools.tokenize.TokenizerME
- All Implemented Interfaces:
opennlp.tools.ml.Probabilistic, opennlp.tools.tokenize.Tokenizer
A
Tokenizer for converting raw text into separated tokens. It uses
Maximum Entropy to make its decisions. The features are loosely
based off of Jeff Reynar's UPenn thesis "Topic Segmentation:
Algorithms and Applications.", which is available from his
homepage: http://www.cis.upenn.edu/~jcreynar.
This implementation needs a statistical model to tokenize a text which reproduces
the tokenization observed in the training data used to create the model.
The TokenizerModel class encapsulates that model and provides
methods to create it from the binary representation.
A tokenizer instance is not thread-safe. For each thread, one tokenizer
must be instantiated which can share one TokenizerModel instance
to safe memory.
To train a new model, the train(ObjectStream, TokenizerFactory, TrainingParameters) method
can be used.
Sample usage:
InputStream modelIn;
...
TokenizerModel model = TokenizerModel(modelIn);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");
- See Also:
-
Field Summary
Fields -
Constructor Summary
ConstructorsConstructorDescriptionTokenizerME(String language) Initializes aTokenizerMEby downloading a default model.TokenizerME(TokenizerModel model) Instantiates aTokenizerMEwith an existingTokenizerModel.TokenizerME(TokenizerModel model, Dictionary abbDict) Instantiates aTokenizerMEwith an existingTokenizerModel. -
Method Summary
Modifier and TypeMethodDescriptiondouble[]Deprecated, for removal: This API element is subject to removal in a future version.double[]probs()The sequence was determined based on the previous call totokenizePos(String).voidsetKeepNewLines(boolean arg0) String[]opennlp.tools.util.Span[]Tokenizes the string.static TokenizerModeltrain(opennlp.tools.util.ObjectStream<opennlp.tools.tokenize.TokenSample> samples, TokenizerFactory factory, opennlp.tools.util.TrainingParameters mlParams) Trains a model for theTokenizerME.boolean
-
Field Details
-
SPLIT
-
NO_SPLIT
-
-
Constructor Details
-
TokenizerME
Initializes aTokenizerMEby downloading a default model.- Parameters:
language- The language of the tokenizer.- Throws:
IOException- Thrown if the model cannot be downloaded or saved.
-
TokenizerME
Instantiates aTokenizerMEwith an existingTokenizerModel.- Parameters:
model- TheTokenizerModelto be used.
-
TokenizerME
Instantiates aTokenizerMEwith an existingTokenizerModel.- Parameters:
model- TheTokenizerModelto be used.abbDict- TheDictionaryto be used. It must fit the language of themodel.
-
-
Method Details
-
probs
public double[] probs()The sequence was determined based on the previous call totokenizePos(String).- Specified by:
probsin interfaceopennlp.tools.ml.Probabilistic- Returns:
- An array with the same number of probabilities as tokens were sent to
the computational method when
tokenizePos(String)was last called. If not applicable an empty array is returned.
-
getTokenProbabilities
Deprecated, for removal: This API element is subject to removal in a future version.Useprobs()instead.- Returns:
- the probabilities associated with the most recent calls to
tokenizePos(String). If not applicable an empty array is returned.
-
tokenizePos
Tokenizes the string.- Specified by:
tokenizePosin interfaceopennlp.tools.tokenize.Tokenizer- Parameters:
d- The string to be tokenized.- Returns:
- A
Spanarray containing individual tokens as elements.
-
train
public static TokenizerModel train(opennlp.tools.util.ObjectStream<opennlp.tools.tokenize.TokenSample> samples, TokenizerFactory factory, opennlp.tools.util.TrainingParameters mlParams) throws IOException Trains a model for theTokenizerME.- Parameters:
samples- The samples used for the training.factory- ATokenizerFactoryto get resources from.mlParams- The machine learningtrain parameters.- Returns:
- A trained
TokenizerModel. - Throws:
IOException- Thrown during IO operations on a temp file which is created during training. Or if reading from theObjectStreamfails.
-
useAlphaNumericOptimization
public boolean useAlphaNumericOptimization()- Returns:
trueif the tokenizer uses alphanumeric optimization,falseotherwise.
-
tokenize
-
setKeepNewLines
public void setKeepNewLines(boolean arg0)
-
probs()instead.