Tokenizer (Apache OpenNLP Tools 2.1.1 API)

All Known Implementing Classes:

SimpleTokenizer, TokenizerME, WhitespaceTokenizer, WordpieceTokenizer
```
public interface Tokenizer
```
The interface for tokenizers, which segment a string into its tokens.
Tokenization is a necessary step before more complex NLP tasks can be applied. These usually process text on a token level. The quality of tokenization is important because it influences the performance of high-level task applied to it.
In segmented languages like English most words are segmented by whitespaces expect for punctuations, etc. which is directly attached to the word without a white space in between, it is not possible to just split at all punctuations because in abbreviations dots are a part of the token itself. A Tokenizer is now responsible to split those tokens correctly.
In non-segmented languages like Chinese, tokenization is more difficult since words are not segmented by a whitespace.
Tokenizers can also be used to segment already identified tokens further into more atomic parts to get a deeper understanding. This approach helps more complex task to gain insight into tokens which do not represent words like numbers, units or tokens which are part of a special notation.
For most subsequent NLP tasks, it is desirable to over-tokenize rather than to under-tokenize.

Method Summary

All Methods Instance Methods Abstract Methods
Modifier and Type	Method	Description
`String[]`	`tokenize(String s)`	Splits a string into its atomic parts.
`Span[]`	`tokenizePos(String s)`	Finds the boundaries of atomic parts in a string.

- Method Detail
  - tokenize
```
String[] tokenize(String s)
```
    Splits a string into its atomic parts.
    
    Parameters:
    
    s - The string to be tokenized.
    
    Returns:
    
    The String[] with the individual tokens as the array elements.
  - tokenizePos
```
Span[] tokenizePos(String s)
```
    Finds the boundaries of atomic parts in a string.
    
    Parameters:
    
    s - The string to be tokenized.
    
    Returns:
    
    The spans (offsets into {@code s}) for each token as the individuals array elements.