Package opennlp.tools.tokenize
Class WordpieceTokenizer
- java.lang.Object
-
- opennlp.tools.tokenize.WordpieceTokenizer
-
- All Implemented Interfaces:
Tokenizer
public class WordpieceTokenizer extends Object implements Tokenizer
A WordPiece tokenizer. Adapted from https://github.com/robrua/easy-bert under the MIT license. For reference see: - https://www.tensorflow.org/text/guide/subwords_tokenizer#applying_wordpiece - https://cran.r-project.org/web/packages/wordpiece/vignettes/basic_usage.html
-
-
Constructor Summary
Constructors Constructor Description WordpieceTokenizer(Set<String> vocabulary)WordpieceTokenizer(Set<String> vocabulary, int maxTokenLength)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description intgetMaxTokenLength()String[]tokenize(String text)Splits a string into its atomic partsSpan[]tokenizePos(String text)Finds the boundaries of atomic parts in a string.
-
-
-
Method Detail
-
tokenizePos
public Span[] tokenizePos(String text)
Description copied from interface:TokenizerFinds the boundaries of atomic parts in a string.- Specified by:
tokenizePosin interfaceTokenizer- Parameters:
text- The string to be tokenized.- Returns:
- The Span[] with the spans (offsets into s) for each token as the individuals array elements.
-
tokenize
public String[] tokenize(String text)
Description copied from interface:TokenizerSplits a string into its atomic parts
-
getMaxTokenLength
public int getMaxTokenLength()
-
-