Fork me on GitHub

OpenNLP Pre-trained Models 1.2 released

New pre-trained sentence detection, tokenization, parts of speech tagging, and lemmatization models for 9 languages are now available for:

Armenian, Basque, Catalan, Georgian, Greek, Kazakh, Korean, Icelandic, and Turkish.

The existing sentence detection, tokenization, and parts of speech tagging models for the 23 languages, published with the models release 1.1, have been re-trained. In addition, new lemmatization models have been trained and added, based on Universal Dependencies (UD) treebanks.

All models, for a total of 32 languages, were trained with OpenNLP 2.5.0 based on the UD release 2.15 and are intended to provide usable models under the Apache 2.0 license. These models will be available as JAR artifacts via Maven Central, or directly as binary files via our models page. See the models' README for more information, including how each was created and evaluated.

--The Apache OpenNLP Team

23 November 2024