Class WordpieceTokenizer

java.lang.Object
opennlp.tools.tokenize.WordpieceTokenizer
All Implemented Interfaces:
Tokenizer

public class WordpieceTokenizer extends Object implements Tokenizer
  • Constructor Details

    • WordpieceTokenizer

      public WordpieceTokenizer(Set<String> vocabulary)
      Initializes a WordpieceTokenizer with a vocabulary and a default maxTokenLength of 50.
      Parameters:
      vocabulary - A set of tokens considered the vocabulary.
    • WordpieceTokenizer

      public WordpieceTokenizer(Set<String> vocabulary, int maxTokenLength)
      Initializes a WordpieceTokenizer with a vocabulary and a custom maxTokenLength.
      Parameters:
      vocabulary - A set of tokens considered the vocabulary.
      maxTokenLength - A non-negative number that is used as maximum token length.
  • Method Details

    • tokenizePos

      public Span[] tokenizePos(String text)
      Description copied from interface: Tokenizer
      Finds the boundaries of atomic parts in a string.
      Specified by:
      tokenizePos in interface Tokenizer
      Parameters:
      text - The string to be tokenized.
      Returns:
      The spans (offsets into s) for each token as the individuals array elements.
    • tokenize

      public String[] tokenize(String text)
      Description copied from interface: Tokenizer
      Splits a string into its atomic parts.
      Specified by:
      tokenize in interface Tokenizer
      Parameters:
      text - The string to be tokenized.
      Returns:
      The String[] with the individual tokens as the array elements.
    • getMaxTokenLength

      public int getMaxTokenLength()
      Returns:
      The maximum token length.