Tokenizer (Apache OpenNLP Tools 1.5.3 API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

opennlp.tools.tokenize
Interface Tokenizer

All Known Implementing Classes:: SimpleTokenizer, TokenizerME, WhitespaceTokenizer

public interface Tokenizer

The interface for tokenizers, which segment a string into its tokens.

Tokenization is a necessary step before more complex NLP tasks can be applied, these usually process text on a token level. The quality of tokenization is important because it influences the performance of high-level task applied to it.

In segmented languages like English most words are segmented by white spaces expect for punctuations, etc. which is directly attached to the word without a white space in between, it is not possible to just split at all punctuations because in abbreviations dots are a part of the token itself. A tokenizer is now responsible to split these tokens correctly.

In non-segmented languages like Chinese tokenization is more difficult since words are not segmented by a whitespace.

Tokenizers can also be used to segment already identified tokens further into more atomic parts to get a deeper understanding. This approach helps more complex task to gain insight into tokens which do not represent words like numbers, units or tokens which are part of a special notation.

For most further task it is desirable to over tokenize rather than under tokenize.

Method Summary
`String[]`	`tokenize(String s)` Splits a string into its atomic parts
`Span[]`	`tokenizePos(String s)` Finds the boundaries of atomic parts in a string.

Method Detail

tokenize

String[] tokenize(String s)

Splits a string into its atomic parts

Parameters:: s - The string to be tokenized.
Returns:: The String[] with the individual tokens as the array elements.

tokenizePos

Span[] tokenizePos(String s)

Finds the boundaries of atomic parts in a string.

Parameters:: s - The string to be tokenized.
Returns:: The Span[] with the spans (offsets into s) for each token as the individuals array elements.