Class MascDocument

java.lang.Object
opennlp.tools.formats.masc.MascDocument

public class MascDocument extends Object
  • Constructor Details

  • Method Details

    • parseDocument

      public static MascDocument parseDocument(String path, InputStream f_primary, InputStream f_seg, InputStream f_penn, InputStream f_s, InputStream f_ne) throws IOException
      Initializes a MascDocument with all the stand-off annotations translated into the internal structure.
      Parameters:
      path - The path where the document header is.
      f_primary - The file with the raw corpus text.
      f_seg - The file with segmentation into quarks.
      f_ne - The file with named entities.
      f_penn - The file with tokenization and Penn POS tags produced by GATE-5.0 ANNIE application.
      f_s - The file with sentence boundaries.
      Returns:
      A document containing the text and its annotations. Immutability is not guaranteed yet.
      Throws:
      IOException - if the raw data cannot be read or the alignment of the raw data with annotations fails
    • hasPennTags

      public boolean hasPennTags()
      Checks whether there is Penn tagging produced by GATE-5.0 ANNIE.
      Returns:
      true if this file has aligned tags/tokens, false otherwise.
    • hasNamedEntities

      public boolean hasNamedEntities()
      Checks whether there is NER by GATE-5.0 ANNIE.
      Returns:
      true if this file has named entities, false otherwise.
    • read

      public MascSentence read()
      Returns:
      Retrieves the next sentence or null if end of document reached.
    • reset

      public void reset()
      Resets the reading of sentences to the beginning of the document.