opennlp.tools.formats
Class LeipzigDoccatSampleStream

java.lang.Object
  extended by opennlp.tools.util.FilterObjectStream<String,DocumentSample>
      extended by opennlp.tools.formats.LeipzigDoccatSampleStream
All Implemented Interfaces:
ObjectStream<DocumentSample>

public class LeipzigDoccatSampleStream
extends FilterObjectStream<String,DocumentSample>

Stream filter to produce document samples out of a Leipzig sentences.txt file. In the Leipzig corpus the encoding of the various sentences.txt file is defined by the language. The language must be specified to produce the category tags and is used to determine the correct input encoding.

The input text is tokenized with the SimpleTokenizer. The input text classified by the language model must also be tokenized by the SimpleTokenizer to produce exactly the same tokenization during testing and training.


Method Summary
 DocumentSample read()
          Returns the next object.
 
Methods inherited from class opennlp.tools.util.FilterObjectStream
close, reset
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

read

public DocumentSample read()
                    throws IOException
Description copied from interface: ObjectStream
Returns the next object. Calling this method repeatedly until it returns null will return each object from the underlying source exactly once.

Returns:
the next object or null to signal that the stream is exhausted
Throws:
IOException


Copyright © 2013 The Apache Software Foundation. All Rights Reserved.