Package opennlp.tools.tokenize
Klasse WordpieceTokenizer
java.lang.Object
opennlp.tools.tokenize.WordpieceTokenizer
- Alle implementierten Schnittstellen:
Tokenizer
A
Tokenizer implementation which performs tokenization
using word pieces.
Adapted under MIT license from https://github.com/robrua/easy-bert.
For reference see:
-
Konstruktorübersicht
KonstruktorenKonstruktorBeschreibungWordpieceTokenizer(Set<String> vocabulary) WordpieceTokenizer(Set<String> vocabulary, int maxTokenLength) -
Methodenübersicht
Modifizierer und TypMethodeBeschreibungintString[]Splits a string into its atomic parts.Span[]tokenizePos(String text) Finds the boundaries of atomic parts in a string.
-
Konstruktordetails
-
WordpieceTokenizer
- Parameter:
vocabulary- A set of tokens considered the vocabulary.
-
WordpieceTokenizer
- Parameter:
vocabulary- A set of tokens considered the vocabulary.maxTokenLength- A non-negative number that is used as maximum token length.
-
-
Methodendetails
-
tokenizePos
Beschreibung aus Schnittstelle kopiert:TokenizerFinds the boundaries of atomic parts in a string.- Angegeben von:
tokenizePosin SchnittstelleTokenizer- Parameter:
text- The string to be tokenized.- Gibt zurück:
- The
spans (offsets intofor each token as the individuals array elements.s)
-
tokenize
Beschreibung aus Schnittstelle kopiert:TokenizerSplits a string into its atomic parts. -
getMaxTokenLength
public int getMaxTokenLength()- Gibt zurück:
- The maximum token length.
-