WordpieceTokenizer (Apache OpenNLP Tools 2.1.0 API)

java.lang.Object
- opennlp.tools.tokenize.WordpieceTokenizer

All Implemented Interfaces:

Tokenizer
```
public class WordpieceTokenizer
extends Object
implements Tokenizer
```
A WordPiece tokenizer. Adapted from https://github.com/robrua/easy-bert under the MIT license. For reference see: - https://www.tensorflow.org/text/guide/subwords_tokenizer#applying_wordpiece - https://cran.r-project.org/web/packages/wordpiece/vignettes/basic_usage.html

Constructor Summary

Constructors
Constructor Description

WordpieceTokenizer(Set<String> vocabulary)

WordpieceTokenizer(Set<String> vocabulary, int maxTokenLength)

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`int`	`getMaxTokenLength()`
`String[]`	`tokenize(String text)`	Splits a string into its atomic parts
`Span[]`	`tokenizePos(String text)`	Finds the boundaries of atomic parts in a string.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - WordpieceTokenizer
```
public WordpieceTokenizer(Set<String> vocabulary)
```
  - WordpieceTokenizer
```
public WordpieceTokenizer(Set<String> vocabulary,
                          int maxTokenLength)
```
- Method Detail
  - tokenizePos
```
public Span[] tokenizePos(String text)
```
    Description copied from interface: Tokenizer
    
    Finds the boundaries of atomic parts in a string.
    
    Specified by:
    
    tokenizePos in interface Tokenizer
    
    Parameters:
    
    text - The string to be tokenized.
    
    Returns:
    
    The Span[] with the spans (offsets into s) for each token as the individuals array elements.
  - tokenize
```
public String[] tokenize(String text)
```
    Description copied from interface: Tokenizer
    
    Splits a string into its atomic parts
    
    Specified by:
    
    tokenize in interface Tokenizer
    
    Parameters:
    
    text - The string to be tokenized.
    
    Returns:
    
    The String[] with the individual tokens as the array elements.
  - getMaxTokenLength
```
public int getMaxTokenLength()
```