New pre-trained sentence, parts of speech, and token models for 18 (Indo-European) languages are now available for:
Bulgarian, Czech, Croatian, Danish, Estonian, Finnish, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, and Ukrainian.
The existing sentence, parts of speech, and token models for these 5 languages:
Dutch, English, French, German, and Italian
were re-trained. The French models are now based on a GSD treebank, as the previously used FTB treebank is not maintained and has therefore been discontinued by the Universal Dependencies (UD) project.
All models were trained with OpenNLP 2.4.0 based on the UD release 2.14 and are intended to provide usable models under the Apache 2.0 license. These models are available as JAR artifacts via Maven Central, or directly as plain, binary files via our models page. See the models' README for more information on the models including how each was created and evaluated.
28 October 2024