Klasse WordpieceTokenizer

java.lang.Object
opennlp.tools.tokenize.WordpieceTokenizer
Alle implementierten Schnittstellen:
Tokenizer

public class WordpieceTokenizer extends Object implements Tokenizer
  • Konstruktordetails

    • WordpieceTokenizer

      public WordpieceTokenizer(Set<String> vocabulary)
      Initializes a WordpieceTokenizer with a vocabulary and a default maxTokenLength of 50.
      Parameter:
      vocabulary - A set of tokens considered the vocabulary.
    • WordpieceTokenizer

      public WordpieceTokenizer(Set<String> vocabulary, int maxTokenLength)
      Initializes a WordpieceTokenizer with a vocabulary and a custom maxTokenLength.
      Parameter:
      vocabulary - A set of tokens considered the vocabulary.
      maxTokenLength - A non-negative number that is used as maximum token length.
  • Methodendetails

    • tokenizePos

      public Span[] tokenizePos(String text)
      Beschreibung aus Schnittstelle kopiert: Tokenizer
      Finds the boundaries of atomic parts in a string.
      Angegeben von:
      tokenizePos in Schnittstelle Tokenizer
      Parameter:
      text - The string to be tokenized.
      Gibt zurück:
      The spans (offsets into s) for each token as the individuals array elements.
    • tokenize

      public String[] tokenize(String text)
      Beschreibung aus Schnittstelle kopiert: Tokenizer
      Splits a string into its atomic parts.
      Angegeben von:
      tokenize in Schnittstelle Tokenizer
      Parameter:
      text - The string to be tokenized.
      Gibt zurück:
      The String[] with the individual tokens as the array elements.
    • getMaxTokenLength

      public int getMaxTokenLength()
      Gibt zurück:
      The maximum token length.