Class DictionaryLemmatizer

  • All Implemented Interfaces:
    Lemmatizer

    public class DictionaryLemmatizer
    extends Object
    implements Lemmatizer
    Lemmatize by simple dictionary lookup into a hashmap built from a file containing, for each line, word\tabpostag\tablemma.
    Version:
    2014-07-08
    • Constructor Detail

      • DictionaryLemmatizer

        public DictionaryLemmatizer​(InputStream dictionary,
                                    Charset charset)
                             throws IOException
        Construct a hashmap from the input tab separated dictionary. The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word,postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03
        Parameters:
        dictionary - the input dictionary via inputstream
        charset - the encoding of the inputstream
        Throws:
        IOException
    • Method Detail

      • getDictMap

        public Map<List<String>,​List<String>> getDictMap()
        Get the Map containing the dictionary.
        Returns:
        dictMap the Map
      • lemmatize

        public String[] lemmatize​(String[] tokens,
                                  String[] postags)
        Description copied from interface: Lemmatizer
        Generates lemmas for the word and postag returning the result in an array.
        Specified by:
        lemmatize in interface Lemmatizer
        Parameters:
        tokens - an array of the tokens
        postags - an array of the pos tags
        Returns:
        an array of possible lemmas for each token in the sequence.
      • lemmatize

        public List<List<String>> lemmatize​(List<String> tokens,
                                            List<String> posTags)
        Description copied from interface: Lemmatizer
        Generates a lemma tags for the word and postag returning the result in a list of every possible lemma for each token and postag.
        Specified by:
        lemmatize in interface Lemmatizer
        Parameters:
        tokens - an array of the tokens
        posTags - an array of the pos tags
        Returns:
        a list of every possible lemma for each token in the sequence.