Class DictionaryLemmatizer

  • All Implemented Interfaces:
    Lemmatizer

    public class DictionaryLemmatizer
    extends Object
    implements Lemmatizer
    A Lemmatizer implementation that works by simple dictionary lookup into a Map built from a file containing, for each line:

    word\tabpostag\tablemma.

    • Constructor Detail

      • DictionaryLemmatizer

        public DictionaryLemmatizer​(InputStream dictionaryStream,
                                    Charset charset)
                             throws IOException
        Initializes a DictionaryLemmatizer and related HashMap from the input tab separated dictionary.

        The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03.

        Parameters:
        dictionaryStream - The dictionary referenced by an open InputStream.
        charset - The character encoding of the dictionary.
        Throws:
        IOException - Thrown if IO errors occurred while reading in from dictionaryStream.
      • DictionaryLemmatizer

        public DictionaryLemmatizer​(InputStream dictionaryStream)
                             throws IOException
        Initializes a DictionaryLemmatizer and related HashMap from the input tab separated dictionary.

        The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03.

        Parameters:
        dictionaryStream - The dictionary referenced by an open InputStream.
        Throws:
        IOException - Thrown if IO errors occurred while reading in from dictionaryStream.
      • DictionaryLemmatizer

        public DictionaryLemmatizer​(File dictionaryFile)
                             throws IOException
        Initializes a DictionaryLemmatizer and related HashMap from the input tab separated dictionary.

        The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03.

        Parameters:
        dictionaryFile - The dictionary referenced by a valid, readable File.
        Throws:
        IOException - Thrown if IO errors occurred while reading in from dictionaryFile.
      • DictionaryLemmatizer

        public DictionaryLemmatizer​(File dictionaryFile,
                                    Charset charset)
                             throws IOException
        Initializes a DictionaryLemmatizer and related HashMap from the input tab separated dictionary.

        The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03.

        Parameters:
        dictionaryFile - The dictionary referenced by a valid, readable File.
        charset - The character encoding of the dictionary.
        Throws:
        IOException - Thrown if IO errors occurred while reading in from dictionaryFile.
      • DictionaryLemmatizer

        public DictionaryLemmatizer​(Path dictionaryPath)
                             throws IOException
        Initializes a DictionaryLemmatizer and related HashMap from the input tab separated dictionary.

        The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03.

        Parameters:
        dictionaryPath - The dictionary referenced via a valid, readable Path.
        Throws:
        IOException - Thrown if IO errors occurred while reading in from dictionaryPath.
    • Method Detail

      • lemmatize

        public String[] lemmatize​(String[] tokens,
                                  String[] postags)
        Description copied from interface: Lemmatizer
        Generates lemmas for the word and postag.
        Specified by:
        lemmatize in interface Lemmatizer
        Parameters:
        tokens - An array of the tokens
        postags - an array of the pos tags
        Returns:
        An array of possible lemmas for each token in the toks sequence.
      • lemmatize

        public List<List<String>> lemmatize​(List<String> tokens,
                                            List<String> posTags)
        Description copied from interface: Lemmatizer
        Generates lemma tags for the word and postag.
        Specified by:
        lemmatize in interface Lemmatizer
        Parameters:
        tokens - An array of the tokens
        posTags - An array of the pos tags
        Returns:
        A list of every possible lemma for each token in the toks sequence.