Package opennlp.tools.tokenize
Klasse WordpieceTokenizer
java.lang.Object
opennlp.tools.tokenize.WordpieceTokenizer
- Alle implementierten Schnittstellen:
Tokenizer
A
Tokenizer
implementation which performs tokenization
using word pieces.
Adapted under MIT license from https://github.com/robrua/easy-bert.
For reference see:
-
Konstruktorübersicht
KonstruktorenKonstruktorBeschreibungWordpieceTokenizer
(Set<String> vocabulary) WordpieceTokenizer
(Set<String> vocabulary, int maxTokenLength) -
Methodenübersicht
Modifizierer und TypMethodeBeschreibungint
String[]
Splits a string into its atomic parts.Span[]
tokenizePos
(String text) Finds the boundaries of atomic parts in a string.
-
Konstruktordetails
-
WordpieceTokenizer
- Parameter:
vocabulary
- A set of tokens considered the vocabulary.
-
WordpieceTokenizer
- Parameter:
vocabulary
- A set of tokens considered the vocabulary.maxTokenLength
- A non-negative number that is used as maximum token length.
-
-
Methodendetails
-
tokenizePos
Beschreibung aus Schnittstelle kopiert:Tokenizer
Finds the boundaries of atomic parts in a string.- Angegeben von:
tokenizePos
in SchnittstelleTokenizer
- Parameter:
text
- The string to be tokenized.- Gibt zurück:
- The
spans (offsets into
for each token as the individuals array elements.s
)
-
tokenize
Beschreibung aus Schnittstelle kopiert:Tokenizer
Splits a string into its atomic parts. -
getMaxTokenLength
public int getMaxTokenLength()- Gibt zurück:
- The maximum token length.
-