Class WordpieceTokenizer

    • Constructor Detail

      • WordpieceTokenizer

        public WordpieceTokenizer​(Set<String> vocabulary)
        Initializes a WordpieceTokenizer with a vocabulary and a default maxTokenLength of 50.
        Parameters:
        vocabulary - A set of tokens considered the vocabulary.
      • WordpieceTokenizer

        public WordpieceTokenizer​(Set<String> vocabulary,
                                  int maxTokenLength)
        Initializes a WordpieceTokenizer with a vocabulary and a custom maxTokenLength.
        Parameters:
        vocabulary - A set of tokens considered the vocabulary.
        maxTokenLength - A non-negative number that is used as maximum token length.
    • Method Detail

      • tokenizePos

        public Span[] tokenizePos​(String text)
        Description copied from interface: Tokenizer
        Finds the boundaries of atomic parts in a string.
        Specified by:
        tokenizePos in interface Tokenizer
        Parameters:
        text - The string to be tokenized.
        Returns:
        The spans (offsets into {@code s}) for each token as the individuals array elements.
      • tokenize

        public String[] tokenize​(String text)
        Description copied from interface: Tokenizer
        Splits a string into its atomic parts.
        Specified by:
        tokenize in interface Tokenizer
        Parameters:
        text - The string to be tokenized.
        Returns:
        The String[] with the individual tokens as the array elements.
      • getMaxTokenLength

        public int getMaxTokenLength()
        Returns:
        The maximum token length.