Klasse TokenizerME
- Alle implementierten Schnittstellen:
Tokenizer
Tokenizer
for converting raw text into separated tokens. It uses
Maximum Entropy to make its decisions. The features are loosely
based off of Jeff Reynar's UPenn thesis "Topic Segmentation:
Algorithms and Applications.", which is available from his
homepage: http://www.cis.upenn.edu/~jcreynar.
This implementation needs a statistical model to tokenize a text which reproduces
the tokenization observed in the training data used to create the model.
The TokenizerModel
class encapsulates that model and provides
methods to create it from the binary representation.
A tokenizer instance is not thread-safe. For each thread, one tokenizer
must be instantiated which can share one TokenizerModel
instance
to safe memory.
To train a new model, the train(ObjectStream, TokenizerFactory, TrainingParameters)
method
can be used.
Sample usage:
InputStream modelIn;
...
TokenizerModel model = TokenizerModel(modelIn);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");
- Siehe auch:
-
Feldübersicht
Felder -
Konstruktorübersicht
KonstruktorenKonstruktorBeschreibungTokenizerME
(String language) Initializes aTokenizerME
by downloading a default model.TokenizerME
(TokenizerModel model) Instantiates aTokenizerME
with an existingTokenizerModel
.TokenizerME
(TokenizerModel model, Dictionary abbDict) Instantiates aTokenizerME
with an existingTokenizerModel
. -
Methodenübersicht
Modifizierer und TypMethodeBeschreibungdouble[]
void
setKeepNewLines
(boolean keepNewLines) Switches whether to keep new lines or not.String[]
Splits a string into its atomic parts.Span[]
Tokenizes the string.static TokenizerModel
train
(ObjectStream<TokenSample> samples, TokenizerFactory factory, TrainingParameters mlParams) Trains a model for theTokenizerME
.boolean
-
Felddetails
-
SPLIT
Constant indicates a token split.- Siehe auch:
-
NO_SPLIT
Constant indicates no token split.- Siehe auch:
-
-
Konstruktordetails
-
TokenizerME
Initializes aTokenizerME
by downloading a default model.- Parameter:
language
- The language of the tokenizer.- Löst aus:
IOException
- Thrown if the model cannot be downloaded or saved.
-
TokenizerME
Instantiates aTokenizerME
with an existingTokenizerModel
.- Parameter:
model
- TheTokenizerModel
to be used.
-
TokenizerME
Instantiates aTokenizerME
with an existingTokenizerModel
.- Parameter:
model
- TheTokenizerModel
to be used.abbDict
- TheDictionary
to be used. It must fit the language of themodel
.
-
-
Methodendetails
-
getTokenProbabilities
public double[] getTokenProbabilities()- Gibt zurück:
- the probabilities associated with the most recent calls to
Tokenizer.tokenize(String)
ortokenizePos(String)
. If not applicable an empty array is returned.
-
tokenizePos
Tokenizes the string.- Parameter:
d
- The string to be tokenized.- Gibt zurück:
- A
Span
array containing individual tokens as elements.
-
train
public static TokenizerModel train(ObjectStream<TokenSample> samples, TokenizerFactory factory, TrainingParameters mlParams) throws IOException Trains a model for theTokenizerME
.- Parameter:
samples
- The samples used for the training.factory
- ATokenizerFactory
to get resources from.mlParams
- The machine learningtrain parameters
.- Gibt zurück:
- A trained
TokenizerModel
. - Löst aus:
IOException
- Thrown during IO operations on a temp file which is created during training. Or if reading from theObjectStream
fails.
-
useAlphaNumericOptimization
public boolean useAlphaNumericOptimization()- Gibt zurück:
true
if the tokenizer uses alphanumeric optimization,false
otherwise.
-
tokenize
Beschreibung aus Schnittstelle kopiert:Tokenizer
Splits a string into its atomic parts. -
setKeepNewLines
public void setKeepNewLines(boolean keepNewLines) Switches whether to keep new lines or not.- Parameter:
keepNewLines
-True
if new lines are kept,false
otherwise.
-