Class BPETokenizerTrainer

java.lang.Object
opennlp.tools.tokenize.BPETokenizerTrainer
All Implemented Interfaces:
opennlp.tools.commons.Trainer<opennlp.tools.util.Parameters>

public final class BPETokenizerTrainer extends Object implements opennlp.tools.commons.Trainer<opennlp.tools.util.Parameters>
Learns BPE merge operations from a training corpus and produces a BPEModel.

Implements the BPE learning algorithm from Sennrich et al. (2016):

  1. Build a vocabulary of character-level symbol sequences from the corpus, where each word is split into individual characters with an end-of-word marker.
  2. Count all adjacent symbol pairs across the vocabulary, weighted by word frequency.
  3. Merge the most frequent pair into a single new symbol.
  4. Repeat until the desired number of merges (numMerges) is reached.

The number of merges controls the granularity of the resulting vocabulary: fewer merges produce finer-grained (more character-level) tokens, while more merges produce coarser (more word-level) tokens. A typical value ranges from a few thousand to tens of thousands, depending on the corpus size and language.

Usage:


 List<String> corpus = List.of(
     "the cat sat on the mat",
     "the dog sat on the log"
 );

 BPETokenizerTrainer trainer = new BPETokenizerTrainer();
 BPEModel model = trainer.train(corpus, 10000, "en");

 // Persist the model
 model.serialize(Path.of("bpe-en.bin"));

 // Use it for tokenization
 BPETokenizer tokenizer = new BPETokenizer(model);
 String[] tokens = tokenizer.tokenize("the cat");
 

For reference see:

See Also:
  • Constructor Details

  • Method Details

    • init

      public void init(opennlp.tools.util.Parameters trainParams, Map<String,String> reportMap)
      Specified by:
      init in interface opennlp.tools.commons.Trainer<opennlp.tools.util.Parameters>
    • init

      public void init(opennlp.tools.util.Parameters trainParams, Map<String,String> reportMap, opennlp.tools.util.TrainingConfiguration config)
      Specified by:
      init in interface opennlp.tools.commons.Trainer<opennlp.tools.util.Parameters>
    • train

      public BPEModel train(Iterable<String> corpus, int numMerges, String languageCode)
      Learns BPE merge operations from a training corpus and returns a BPEModel.
      Parameters:
      corpus - An iterable of text strings (e.g., sentences or documents). Must not be null.
      numMerges - The number of merge operations to learn. Must be positive.
      languageCode - The ISO language code (e.g., "en", "de"). Must not be null.
      Returns:
      A trained BPEModel containing the learned merge operations.
      Throws:
      IllegalArgumentException - if numMerges is not positive, or if corpus or languageCode is null.