Package opennlp.tools.tokenize
Class WordpieceTokenizer
- java.lang.Object
-
- opennlp.tools.tokenize.WordpieceTokenizer
-
- All Implemented Interfaces:
Tokenizer
public class WordpieceTokenizer extends Object implements Tokenizer
ATokenizer
implementation which performs tokenization using word pieces.Adapted under MIT license from https://github.com/robrua/easy-bert.
For reference see:
-
-
Constructor Summary
Constructors Constructor Description WordpieceTokenizer(Set<String> vocabulary)
WordpieceTokenizer(Set<String> vocabulary, int maxTokenLength)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description int
getMaxTokenLength()
String[]
tokenize(String text)
Splits a string into its atomic parts.Span[]
tokenizePos(String text)
Finds the boundaries of atomic parts in a string.
-
-
-
Method Detail
-
tokenizePos
public Span[] tokenizePos(String text)
Description copied from interface:Tokenizer
Finds the boundaries of atomic parts in a string.- Specified by:
tokenizePos
in interfaceTokenizer
- Parameters:
text
- The string to be tokenized.- Returns:
- The
spans (offsets into {@code s})
for each token as the individuals array elements.
-
tokenize
public String[] tokenize(String text)
Description copied from interface:Tokenizer
Splits a string into its atomic parts.
-
getMaxTokenLength
public int getMaxTokenLength()
- Returns:
- The maximum token length.
-
-