Class TokenizerME
- All Implemented Interfaces:
- Tokenizer
Tokenizer for converting raw text into separated tokens. It uses
 Maximum Entropy to make its decisions. The features are loosely
 based off of Jeff Reynar's UPenn thesis "Topic Segmentation:
 Algorithms and Applications.", which is available from his
 homepage: http://www.cis.upenn.edu/~jcreynar.
 
 This implementation needs a statistical model to tokenize a text which reproduces
 the tokenization observed in the training data used to create the model.
 The TokenizerModel class encapsulates that model and provides
 methods to create it from the binary representation.
 
 A tokenizer instance is not thread-safe. For each thread, one tokenizer
 must be instantiated which can share one TokenizerModel instance
 to safe memory.
 
 To train a new model, the train(ObjectStream, TokenizerFactory, TrainingParameters) method
 can be used.
 
Sample usage:
 
 InputStream modelIn;
 
 ...
 
 TokenizerModel model = TokenizerModel(modelIn);
 
 Tokenizer tokenizer = new TokenizerME(model);
 
 String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");
 
- See Also:
- 
Field SummaryFields
- 
Constructor SummaryConstructorsConstructorDescriptionTokenizerME(String language) Initializes aTokenizerMEby downloading a default model.TokenizerME(TokenizerModel model) Instantiates aTokenizerMEwith an existingTokenizerModel.TokenizerME(TokenizerModel model, Dictionary abbDict) Instantiates aTokenizerMEwith an existingTokenizerModel.
- 
Method SummaryModifier and TypeMethodDescriptiondouble[]voidsetKeepNewLines(boolean keepNewLines) Switches whether to keep new lines or not.String[]Splits a string into its atomic parts.Span[]Tokenizes the string.static TokenizerModeltrain(ObjectStream<TokenSample> samples, TokenizerFactory factory, TrainingParameters mlParams) Trains a model for theTokenizerME.boolean
- 
Field Details- 
SPLITConstant indicates a token split.- See Also:
 
- 
NO_SPLITConstant indicates no token split.- See Also:
 
 
- 
- 
Constructor Details- 
TokenizerMEInitializes aTokenizerMEby downloading a default model.- Parameters:
- language- The language of the tokenizer.
- Throws:
- IOException- Thrown if the model cannot be downloaded or saved.
 
- 
TokenizerMEInstantiates aTokenizerMEwith an existingTokenizerModel.- Parameters:
- model- The- TokenizerModelto be used.
 
- 
TokenizerMEInstantiates aTokenizerMEwith an existingTokenizerModel.- Parameters:
- model- The- TokenizerModelto be used.
- abbDict- The- Dictionaryto be used. It must fit the language of the- model.
 
 
- 
- 
Method Details- 
getTokenProbabilitiespublic double[] getTokenProbabilities()- Returns:
- the probabilities associated with the most recent calls to
         Tokenizer.tokenize(String)ortokenizePos(String). If not applicable an empty array is returned.
 
- 
tokenizePosTokenizes the string.- Parameters:
- d- The string to be tokenized.
- Returns:
- A Spanarray containing individual tokens as elements.
 
- 
trainpublic static TokenizerModel train(ObjectStream<TokenSample> samples, TokenizerFactory factory, TrainingParameters mlParams) throws IOException Trains a model for theTokenizerME.- Parameters:
- samples- The samples used for the training.
- factory- A- TokenizerFactoryto get resources from.
- mlParams- The machine learning- train parameters.
- Returns:
- A trained TokenizerModel.
- Throws:
- IOException- Thrown during IO operations on a temp file which is created during training. Or if reading from the- ObjectStreamfails.
 
- 
useAlphaNumericOptimizationpublic boolean useAlphaNumericOptimization()- Returns:
- trueif the tokenizer uses alphanumeric optimization,- falseotherwise.
 
- 
tokenizeDescription copied from interface:TokenizerSplits a string into its atomic parts.
- 
setKeepNewLinespublic void setKeepNewLines(boolean keepNewLines) Switches whether to keep new lines or not.- Parameters:
- keepNewLines-- Trueif new lines are kept,- falseotherwise.
 
 
-