Package opennlp.tools.tokenize
Class WordpieceTokenizer
java.lang.Object
opennlp.tools.tokenize.WordpieceTokenizer
- All Implemented Interfaces:
- Tokenizer
A 
Tokenizer implementation which performs tokenization
 using word pieces.
 Adapted under MIT license from https://github.com/robrua/easy-bert.
For reference see:
- 
Constructor SummaryConstructorsConstructorDescriptionWordpieceTokenizer(Set<String> vocabulary) WordpieceTokenizer(Set<String> vocabulary, int maxTokenLength) 
- 
Method SummaryModifier and TypeMethodDescriptionintString[]Splits a string into its atomic parts.Span[]tokenizePos(String text) Finds the boundaries of atomic parts in a string.
- 
Constructor Details- 
WordpieceTokenizer- Parameters:
- vocabulary- A set of tokens considered the vocabulary.
 
- 
WordpieceTokenizer- Parameters:
- vocabulary- A set of tokens considered the vocabulary.
- maxTokenLength- A non-negative number that is used as maximum token length.
 
 
- 
- 
Method Details- 
tokenizePosDescription copied from interface:TokenizerFinds the boundaries of atomic parts in a string.- Specified by:
- tokenizePosin interface- Tokenizer
- Parameters:
- text- The string to be tokenized.
- Returns:
- The spans (offsets intofor each token as the individuals array elements.s)
 
- 
tokenizeDescription copied from interface:TokenizerSplits a string into its atomic parts.
- 
getMaxTokenLengthpublic int getMaxTokenLength()- Returns:
- The maximum token length.
 
 
-