Package opennlp.tools.tokenize
Class WordpieceTokenizer
- java.lang.Object
- 
- opennlp.tools.tokenize.WordpieceTokenizer
 
- 
- All Implemented Interfaces:
- Tokenizer
 
 public class WordpieceTokenizer extends Object implements Tokenizer ATokenizerimplementation which performs tokenization using word pieces.Adapted under MIT license from https://github.com/robrua/easy-bert. For reference see: 
- 
- 
Constructor SummaryConstructors Constructor Description WordpieceTokenizer(Set<String> vocabulary)WordpieceTokenizer(Set<String> vocabulary, int maxTokenLength)
 - 
Method SummaryAll Methods Instance Methods Concrete Methods Modifier and Type Method Description intgetMaxTokenLength()String[]tokenize(String text)Splits a string into its atomic parts.Span[]tokenizePos(String text)Finds the boundaries of atomic parts in a string.
 
- 
- 
- 
Method Detail- 
tokenizePospublic Span[] tokenizePos(String text) Description copied from interface:TokenizerFinds the boundaries of atomic parts in a string.- Specified by:
- tokenizePosin interface- Tokenizer
- Parameters:
- text- The string to be tokenized.
- Returns:
- The spans (offsets into {@code s})for each token as the individuals array elements.
 
 - 
tokenizepublic String[] tokenize(String text) Description copied from interface:TokenizerSplits a string into its atomic parts.
 - 
getMaxTokenLengthpublic int getMaxTokenLength() - Returns:
- The maximum token length.
 
 
- 
 
-