opennlp.tools.formats
Class LeipzigDoccatSampleStream
java.lang.Object
opennlp.tools.util.FilterObjectStream<String,DocumentSample>
opennlp.tools.formats.LeipzigDoccatSampleStream
- All Implemented Interfaces:
- ObjectStream<DocumentSample>
public class LeipzigDoccatSampleStream
- extends FilterObjectStream<String,DocumentSample>
Stream filter to produce document samples out of a Leipzig sentences.txt file.
In the Leipzig corpus the encoding of the various sentences.txt file is defined by
the language. The language must be specified to produce the category tags and is used
to determine the correct input encoding.
The input text is tokenized with the SimpleTokenizer
. The input text classified
by the language model must also be tokenized by the SimpleTokenizer
to produce
exactly the same tokenization during testing and training.
read
public DocumentSample read()
throws IOException
- Description copied from interface:
ObjectStream
- Returns the next object. Calling this method repeatedly until it returns
null will return each object from the underlying source exactly once.
- Returns:
- the next object or null to signal that the stream is exhausted
- Throws:
IOException
Copyright © 2013 The Apache Software Foundation. All Rights Reserved.