Interface Tokenizer
- All Known Implementing Classes:
- SimpleTokenizer,- ThreadSafeTokenizerME,- TokenizerME,- WhitespaceTokenizer,- WordpieceTokenizer
Tokenization is a necessary step before more complex NLP tasks can be applied. These usually process text on a token level. The quality of tokenization is important because it influences the performance of high-level task applied to it.
 In segmented languages like English most words are segmented by whitespaces
 expect for punctuations, etc. which is directly attached to the word without a white space
 in between, it is not possible to just split at all punctuations because in abbreviations dots
 are a part of the token itself. A Tokenizer is now responsible to split those tokens
 correctly.
 
In non-segmented languages like Chinese, tokenization is more difficult since words are not segmented by a whitespace.
Tokenizers can also be used to segment already identified tokens further into more atomic parts to get a deeper understanding. This approach helps more complex task to gain insight into tokens which do not represent words like numbers, units or tokens which are part of a special notation.
For most subsequent NLP tasks, it is desirable to over-tokenize rather than to under-tokenize.
- 
Method Summary
- 
Method Details- 
tokenizeSplits a string into its atomic parts.- Parameters:
- s- The string to be tokenized.
- Returns:
- The String[] with the individual tokens as the array elements.
 
- 
tokenizePosFinds the boundaries of atomic parts in a string.- Parameters:
- s- The string to be tokenized.
- Returns:
- The spans (offsets intofor each token as the individuals array elements.s)
 
 
-