Class MascDocument


  • public class MascDocument
    extends Object
    • Method Detail

      • parseDocument

        public static MascDocument parseDocument​(String path,
                                                 InputStream f_primary,
                                                 InputStream f_seg,
                                                 InputStream f_penn,
                                                 InputStream f_s,
                                                 InputStream f_ne)
                                          throws IOException
        Creates a MASC document with all of the stand-off annotations translated into the internal structure.
        Parameters:
        path - The path where the document header is.
        f_primary - The file with the raw corpus text.
        f_seg - The file with segmentation into quarks.
        f_ne - The file with named entities.
        f_penn - The file with tokenization and Penn POS tags produced by GATE-5.0 ANNIE application.
        f_s - The file with sentence boundaries.
        Returns:
        A document containing the text and its annotations. Immutability is not guaranteed yet.
        Throws:
        IOException - if the raw data cannot be read or the alignment of the raw data with annotations fails
      • hasPennTags

        public boolean hasPennTags()
        Check whether there is Penn tagging produced by GATE-5.0 ANNIE
        Returns:
        true if this file has aligned tags/tokens
      • hasNamedEntities

        public boolean hasNamedEntities()
      • read

        public MascSentence read()
        Get next sentence.
        Returns:
        Next sentence or null if end of document reached.
      • reset

        public void reset()
        Return the reading of sentences to the beginning of the document.