opennlp.tools.tokenize (Apache OpenNLP Tools 2.3.2 API)

package opennlp.tools.tokenize

Contains classes related to finding token or words in a string. All tokenizer implement the Tokenizer interface. Currently, there is the learnable TokenizerME, the WhitespaceTokenizer and the SimpleTokenizer which is a character class tokenizer.

Related Packages

Package

Description

opennlp.tools.tokenize.lang
Class

Description

DefaultTokenContextGenerator

A default TokenContextGenerator which produces events for maxent decisions for tokenization.

DetokenizationDictionary

DetokenizationDictionary.Operation

Detokenizer

A Detokenizer merges tokens back to their detokenized representation.

Detokenizer.DetokenizationOperation

This enum contains an operation for every token to merge the tokens together to their detokenized form.

DetokenizerEvaluator

The DetokenizerEvaluator measures the performance of the given Detokenizer with the provided reference samples.

DictionaryDetokenizer

A rule based detokenizer.

SimpleTokenizer

A basic Tokenizer implementation which performs tokenization using character classes.

TokenContextGenerator

Interface for context generators required for TokenizerME.

Tokenizer

The interface for tokenizers, which segment a string into its tokens.

TokenizerCrossValidator

A cross validator for tokenizers.

TokenizerEvaluationMonitor

A marker interface for evaluating tokenizers.

TokenizerEvaluator

The TokenizerEvaluator measures the performance of the given Tokenizer with the provided reference samples.

TokenizerFactory

The factory that provides Tokenizer default implementation and resources.

TokenizerME

A Tokenizer for converting raw text into separated tokens.

TokenizerModel

The TokenizerModel is the model used by a learnable Tokenizer.

TokenizerStream

The TokenizerStream uses a Tokenizer to tokenize the input string and output samples.

TokenSample

A TokenSample is text with token spans.

TokenSampleStream

This class is a stream filter which reads in string encoded samples and creates samples out of them.

TokSpanEventStream

This class reads the samples via an Iterator and converts the samples into events which can be used by the maxent library for training.

WhitespaceTokenizer

A basic Tokenizer implementation which performs tokenization using white spaces.

WhitespaceTokenStream

This stream formats ObjectStream of samples into whitespace separated token strings.

WordpieceTokenizer

A Tokenizer implementation which performs tokenization using word pieces.

Package opennlp.tools.tokenize