Class WordpieceTokenizer

  • All Implemented Interfaces:
    Tokenizer

    public class WordpieceTokenizer
    extends Object
    implements Tokenizer
    A WordPiece tokenizer. Adapted from https://github.com/robrua/easy-bert under the MIT license. For reference see: - https://www.tensorflow.org/text/guide/subwords_tokenizer#applying_wordpiece - https://cran.r-project.org/web/packages/wordpiece/vignettes/basic_usage.html
    • Constructor Detail

      • WordpieceTokenizer

        public WordpieceTokenizer​(Set<String> vocabulary)
      • WordpieceTokenizer

        public WordpieceTokenizer​(Set<String> vocabulary,
                                  int maxTokenLength)
    • Method Detail

      • tokenizePos

        public Span[] tokenizePos​(String text)
        Description copied from interface: Tokenizer
        Finds the boundaries of atomic parts in a string.
        Specified by:
        tokenizePos in interface Tokenizer
        Parameters:
        text - The string to be tokenized.
        Returns:
        The Span[] with the spans (offsets into s) for each token as the individuals array elements.
      • tokenize

        public String[] tokenize​(String text)
        Description copied from interface: Tokenizer
        Splits a string into its atomic parts
        Specified by:
        tokenize in interface Tokenizer
        Parameters:
        text - The string to be tokenized.
        Returns:
        The String[] with the individual tokens as the array elements.
      • getMaxTokenLength

        public int getMaxTokenLength()