java.lang.Object

opennlp.tools.tokenize.WordpieceTokenizer

Alle implementierten Schnittstellen:: Tokenizer

public class WordpieceTokenizer extends Object implements Tokenizer

A Tokenizer implementation which performs tokenization using word pieces.

Adapted under MIT license from https://github.com/robrua/easy-bert.

For reference see:

Konstruktorübersicht

Konstruktoren

Konstruktor

Beschreibung

WordpieceTokenizer(Set<String> vocabulary)

Initializes a WordpieceTokenizer with a vocabulary and a default maxTokenLength of 50.

WordpieceTokenizer(Set<String> vocabulary, int maxTokenLength)

Initializes a WordpieceTokenizer with a vocabulary and a custom maxTokenLength.
Methodenübersicht

Modifizierer und Typ

Methode

Beschreibung

int

getMaxTokenLength()

String[]

tokenize(String text)

Splits a string into its atomic parts.

Span[]

tokenizePos(String text)

Finds the boundaries of atomic parts in a string.

Von Klasse geerbte Methoden java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Konstruktordetails
- WordpieceTokenizer
  
  public WordpieceTokenizer(Set<String> vocabulary)
  
  Initializes a WordpieceTokenizer with a vocabulary and a default maxTokenLength of 50.
  
  Parameter:
  
  vocabulary - A set of tokens considered the vocabulary.
- WordpieceTokenizer
  
  public WordpieceTokenizer(Set<String> vocabulary, int maxTokenLength)
  
  Initializes a WordpieceTokenizer with a vocabulary and a custom maxTokenLength.
  
  Parameter:
  
  vocabulary - A set of tokens considered the vocabulary.
  
  maxTokenLength - A non-negative number that is used as maximum token length.
Methodendetails
- tokenizePos
  
  public Span[] tokenizePos(String text)
  
  Beschreibung aus Schnittstelle kopiert: Tokenizer
  
  Finds the boundaries of atomic parts in a string.
  
  Angegeben von:
  
  tokenizePos in Schnittstelle Tokenizer
  
  Parameter:
  
  text - The string to be tokenized.
  
  Gibt zurück:
  
  The spans (offsets into s) for each token as the individuals array elements.
- tokenize
  
  public String[] tokenize(String text)
  
  Beschreibung aus Schnittstelle kopiert: Tokenizer
  
  Splits a string into its atomic parts.
  
  Angegeben von:
  
  tokenize in Schnittstelle Tokenizer
  
  Parameter:
  
  text - The string to be tokenized.
  
  Gibt zurück:
  
  The String[] with the individual tokens as the array elements.
- getMaxTokenLength
  
  public int getMaxTokenLength()
  
  Gibt zurück:
  
  The maximum token length.

Klasse WordpieceTokenizer

Konstruktorübersicht

Methodenübersicht

Von Klasse geerbte Methoden java.lang.Object

Konstruktordetails

WordpieceTokenizer

WordpieceTokenizer

Methodendetails

tokenizePos

tokenize

getMaxTokenLength