WordpieceTokenizer (Apache OpenNLP Tools 2.5.3 API)

java.lang.Object

opennlp.tools.tokenize.WordpieceTokenizer

All Implemented Interfaces:: Tokenizer

public class WordpieceTokenizer extends Object implements Tokenizer

A Tokenizer implementation which performs tokenization using word pieces.

Adapted under MIT license from https://github.com/robrua/easy-bert.

For reference see:

Constructor Summary

Constructors

Constructor

Description

WordpieceTokenizer(Set<String> vocabulary)

Initializes a WordpieceTokenizer with a vocabulary and a default maxTokenLength of 50.

WordpieceTokenizer(Set<String> vocabulary, int maxTokenLength)

Initializes a WordpieceTokenizer with a vocabulary and a custom maxTokenLength.
Method Summary

Modifier and Type

Method

Description

int

getMaxTokenLength()

String[]

tokenize(String text)

Splits a string into its atomic parts.

Span[]

tokenizePos(String text)

Finds the boundaries of atomic parts in a string.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- WordpieceTokenizer
  
  public WordpieceTokenizer(Set<String> vocabulary)
  
  Initializes a WordpieceTokenizer with a vocabulary and a default maxTokenLength of 50.
  
  Parameters:
  
  vocabulary - A set of tokens considered the vocabulary.
- WordpieceTokenizer
  
  public WordpieceTokenizer(Set<String> vocabulary, int maxTokenLength)
  
  Initializes a WordpieceTokenizer with a vocabulary and a custom maxTokenLength.
  
  Parameters:
  
  vocabulary - A set of tokens considered the vocabulary.
  
  maxTokenLength - A non-negative number that is used as maximum token length.
Method Details
- tokenizePos
  
  public Span[] tokenizePos(String text)
  
  Description copied from interface: Tokenizer
  
  Finds the boundaries of atomic parts in a string.
  
  Specified by:
  
  tokenizePos in interface Tokenizer
  
  Parameters:
  
  text - The string to be tokenized.
  
  Returns:
  
  The spans (offsets into s) for each token as the individuals array elements.
- tokenize
  
  public String[] tokenize(String text)
  
  Description copied from interface: Tokenizer
  
  Splits a string into its atomic parts.
  
  Specified by:
  
  tokenize in interface Tokenizer
  
  Parameters:
  
  text - The string to be tokenized.
  
  Returns:
  
  The String[] with the individual tokens as the array elements.
- getMaxTokenLength
  
  public int getMaxTokenLength()
  
  Returns:
  
  The maximum token length.