New pre-trained sentence detection, tokenization, parts of speech tagging, and lemmatization models are now available for 4 languages:
Afrikaans, Indonesian, Irish, and Persian.
The Czech models required a change in training, that is a different treebank: 'PDTC', see: https://github.com/UniversalDependencies/UD_Czech-PDTC/blob/master/README.md.
All models, for a total of 36 languages, were trained with OpenNLP 2.5.4 based on the UD 2.16 release 2.16 and are intended to provide usable models under the Apache 2.0 license. These models will soon be available as JAR artifacts via Maven Central, or directly as binary files via our models page. See the models' README for more information, including how each was created and evaluated.
--The Apache OpenNLP Team
16 June 2025