Class DictionaryLemmatizer

java.lang.Object
opennlp.tools.lemmatizer.DictionaryLemmatizer
All Implemented Interfaces:
opennlp.tools.lemmatizer.Lemmatizer

public class DictionaryLemmatizer extends Object implements opennlp.tools.lemmatizer.Lemmatizer
A Lemmatizer implementation that works by simple dictionary lookup into a Map built from a file containing, for each line:

word\tabpostag\tablemma.

  • Constructor Details

    • DictionaryLemmatizer

      public DictionaryLemmatizer(InputStream dictionaryStream, Charset charset) throws IOException
      Initializes a DictionaryLemmatizer and related HashMap from the input tab separated dictionary.

      The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03.

      Parameters:
      dictionaryStream - The dictionary referenced by an open InputStream.
      charset - The character encoding of the dictionary.
      Throws:
      IOException - Thrown if IO errors occurred while reading in from dictionaryStream.
    • DictionaryLemmatizer

      public DictionaryLemmatizer(InputStream dictionaryStream) throws IOException
      Initializes a DictionaryLemmatizer and related HashMap from the input tab separated dictionary.

      The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03.

      Parameters:
      dictionaryStream - The dictionary referenced by an open InputStream.
      Throws:
      IOException - Thrown if IO errors occurred while reading in from dictionaryStream.
    • DictionaryLemmatizer

      public DictionaryLemmatizer(File dictionaryFile) throws IOException
      Initializes a DictionaryLemmatizer and related HashMap from the input tab separated dictionary.

      The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03.

      Parameters:
      dictionaryFile - The dictionary referenced by a valid, readable File.
      Throws:
      IOException - Thrown if IO errors occurred while reading in from dictionaryFile.
    • DictionaryLemmatizer

      public DictionaryLemmatizer(File dictionaryFile, Charset charset) throws IOException
      Initializes a DictionaryLemmatizer and related HashMap from the input tab separated dictionary.

      The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03.

      Parameters:
      dictionaryFile - The dictionary referenced by a valid, readable File.
      charset - The character encoding of the dictionary.
      Throws:
      IOException - Thrown if IO errors occurred while reading in from dictionaryFile.
    • DictionaryLemmatizer

      public DictionaryLemmatizer(Path dictionaryPath) throws IOException
      Initializes a DictionaryLemmatizer and related HashMap from the input tab separated dictionary.

      The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03.

      Parameters:
      dictionaryPath - The dictionary referenced via a valid, readable Path.
      Throws:
      IOException - Thrown if IO errors occurred while reading in from dictionaryPath.
  • Method Details

    • getDictMap

      public Map<List<String>, List<String>> getDictMap()
      Returns:
      Retrieves the Map containing the dictionary.
    • lemmatize

      public String[] lemmatize(String[] tokens, String[] postags)
      Specified by:
      lemmatize in interface opennlp.tools.lemmatizer.Lemmatizer
    • lemmatize

      public List<List<String>> lemmatize(List<String> tokens, List<String> posTags)
      Specified by:
      lemmatize in interface opennlp.tools.lemmatizer.Lemmatizer