New pre-trained sentence detection, tokenization, and parts of speech tagging models for 18 (Indo-European) languages are now available for:
Bulgarian, Czech, Croatian, Danish, Estonian, Finnish, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, and Ukrainian.
The existing sentence detection, tokenization, and parts of speech tagging models for these 5 languages:
Dutch, English, French, German, and Italian
were re-trained. The French models are now based on a GSD treebank, as the previously used FTB treebank is not maintained and has therefore been discontinued by the Universal Dependencies (UD) project.
All models were trained with OpenNLP 2.4.0 based on the UD release 2.14 and are intended to provide usable models under the Apache 2.0 license. These models are available as JAR artifacts via Maven Central, or directly as plain, binary files via our models page. See the models' README for more information on the models including how each was created and evaluated.
--The Apache OpenNLP Team
28 October 2024