Apache OpenNLP 3.0.0-M4 released

The Apache OpenNLP team is pleased to announce the release of Apache OpenNLP 3.0.0-M4.

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.

It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.

Apache OpenNLP 3.0.0-M4 binary and source distributions are available for download from our download page.

The OpenNLP library is distributed by Maven Central as well. See the Maven dependency page for more details.

What’s new in Apache OpenNLP 3.0.0-M4

This release focuses on security hardening, new NLP capabilities, and dependency maintenance.

Security Fixes

One security issue is addressed in this release.

CWE-502 Deserialization of Untrusted Data (OPENNLP-1823, CVE-2026-43825)

Fixed unsafe Java deserialization in SvmDoccatModel.deserialize() (libsvm doccat module, 3.x before 3.0.0-M4) that could allow remote code execution via a crafted stream when a gadget chain is present on the classpath; an ObjectInputFilter is now applied.

New Features and Improvements

Include list of stop words for various languages (OPENNLP-660)
Add SymSpell-based spell correction component (OPENNLP-1832)
Add BertTokenizer with BERT basic tokenization (OPENNLP-1837)
Harden SvmDoccatModel.deserialize() with ObjectInputFilter and resource limits (OPENNLP-1823)
Tolerate unsupported XML parser security options (OPENNLP-1835)
Fix NameFinderDL only worked with Person, expand to all types (OPENNLP-1846)
Several updates of dependencies were conducted, see Jira release notes listing - URL down below
Some minor tasks have been completed

Bug fixes

This release ships four bug fixes for: OPENNLP-1826, OPENNLP-1836, OPENNLP-1839, and OPENNLP-1840

IMPORTANT CHANGES

The ONNX input encoding in SentenceVectorsDL was fixed, which changes the produced sentence vectors. Any embeddings persisted with the old encoding are not comparable to the new output and must be re-generated. (OPENNLP-1836 - PR #1072)
WordpieceTokenizer (public API, used by opennlp-dl) now splits punctuation runs into single tokens, collapses partially-matched words to a single [UNK], and throws from tokenizePos instead of returning null. These change tokenization output for existing callers. (OPENNLP-1837 - PR #1073)
NameFinderDL now decodes all BIO entity types (PER/ORG/LOC/…) instead of only persons. Span.getType() now returns the entity label rather than the covered text, which is a contract change for existing callers. (OPENNLP-1846 - PR #1086)
The opennlp-dl components are now thread-safe; as part of this, loadVocab became public static (source- and binary-incompatible) and AbstractDL’s implicit no-arg constructor was removed. Both affect downstream code that calls loadVocab or extends AbstractDL. (OPENNLP-1844 - PR #1084)

Dependency Updates

Update ONNX runtime to 1.26.0 (OPENNLP-1824)
Update gRPC to 1.81.0 (OPENNLP-1825)
Update log4j2 to 2.26.0 (OPENNLP-1827)
Update Slf4j to 2.0.18 (OPENNLP-1831)
Update gRPC to 1.82.0 (OPENNLP-1847)
Update Morfologik to 2.2.0 (OPENNLP-1848)

For further details, check the full list of changes via the project’s issue tracker.

--The Apache OpenNLP Team