Package opennlp.tools.tokenize
Class BPETokenizerTrainer
java.lang.Object
opennlp.tools.tokenize.BPETokenizerTrainer
- All Implemented Interfaces:
opennlp.tools.commons.Trainer<opennlp.tools.util.Parameters>
public final class BPETokenizerTrainer
extends Object
implements opennlp.tools.commons.Trainer<opennlp.tools.util.Parameters>
Learns BPE merge operations from a training corpus and
produces a
BPEModel.
Implements the BPE learning algorithm from Sennrich et al. (2016):
- Build a vocabulary of character-level symbol sequences from the corpus, where each word is split into individual characters with an end-of-word marker.
- Count all adjacent symbol pairs across the vocabulary, weighted by word frequency.
- Merge the most frequent pair into a single new symbol.
- Repeat until the desired number of merges
(
numMerges) is reached.
The number of merges controls the granularity of the resulting vocabulary: fewer merges produce finer-grained (more character-level) tokens, while more merges produce coarser (more word-level) tokens. A typical value ranges from a few thousand to tens of thousands, depending on the corpus size and language.
Usage:
List<String> corpus = List.of(
"the cat sat on the mat",
"the dog sat on the log"
);
BPETokenizerTrainer trainer = new BPETokenizerTrainer();
BPEModel model = trainer.train(corpus, 10000, "en");
// Persist the model
model.serialize(Path.of("bpe-en.bin"));
// Use it for tokenization
BPETokenizer tokenizer = new BPETokenizer(model);
String[] tokens = tokenizer.tokenize("the cat");
For reference see:
- Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. https://arxiv.org/abs/1508.07909
- See Also:
-
Constructor Details
-
BPETokenizerTrainer
public BPETokenizerTrainer()Creates a newBPETokenizerTrainer.
-
-
Method Details
-
init
- Specified by:
initin interfaceopennlp.tools.commons.Trainer<opennlp.tools.util.Parameters>
-
init
public void init(opennlp.tools.util.Parameters trainParams, Map<String, String> reportMap, opennlp.tools.util.TrainingConfiguration config) - Specified by:
initin interfaceopennlp.tools.commons.Trainer<opennlp.tools.util.Parameters>
-
train
Learns BPE merge operations from a training corpus and returns aBPEModel.- Parameters:
corpus- An iterable of text strings (e.g., sentences or documents). Must not benull.numMerges- The number of merge operations to learn. Must be positive.languageCode- The ISO language code (e.g., "en", "de"). Must not benull.- Returns:
- A trained
BPEModelcontaining the learned merge operations. - Throws:
IllegalArgumentException- ifnumMergesis not positive, or ifcorpusorlanguageCodeisnull.
-