Class MascDocument


  • public class MascDocument
    extends Object
    • Method Detail

      • parseDocument

        public static MascDocument parseDocument​(String path,
                                                 InputStream f_primary,
                                                 InputStream f_seg,
                                                 InputStream f_penn,
                                                 InputStream f_s,
                                                 InputStream f_ne)
                                          throws IOException
        Initializes a MascDocument with all the stand-off annotations translated into the internal structure.
        Parameters:
        path - The path where the document header is.
        f_primary - The file with the raw corpus text.
        f_seg - The file with segmentation into quarks.
        f_ne - The file with named entities.
        f_penn - The file with tokenization and Penn POS tags produced by GATE-5.0 ANNIE application.
        f_s - The file with sentence boundaries.
        Returns:
        A document containing the text and its annotations. Immutability is not guaranteed yet.
        Throws:
        IOException - if the raw data cannot be read or the alignment of the raw data with annotations fails
      • hasPennTags

        public boolean hasPennTags()
        Checks whether there is Penn tagging produced by GATE-5.0 ANNIE.
        Returns:
        true if this file has aligned tags/tokens, false otherwise.
      • hasNamedEntities

        public boolean hasNamedEntities()
        Checks whether there is NER by GATE-5.0 ANNIE.
        Returns:
        true if this file has named entities, false otherwise.
      • read

        public MascSentence read()
        Returns:
        Retrieves the next sentence or null if end of document reached.
      • reset

        public void reset()
        Resets the reading of sentences to the beginning of the document.