Class DictionaryLemmatizer

java.lang.Object
opennlp.tools.lemmatizer.DictionaryLemmatizer
All Implemented Interfaces:
Lemmatizer

public class DictionaryLemmatizer extends Object implements Lemmatizer
A Lemmatizer implementation that works by simple dictionary lookup into a Map built from a file containing, for each line:

word\tabpostag\tablemma.

  • Constructor Details

    • DictionaryLemmatizer

      public DictionaryLemmatizer(InputStream dictionaryStream, Charset charset) throws IOException
      Initializes a DictionaryLemmatizer and related HashMap from the input tab separated dictionary.

      The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03.

      Parameters:
      dictionaryStream - The dictionary referenced by an open InputStream.
      charset - The character encoding of the dictionary.
      Throws:
      IOException - Thrown if IO errors occurred while reading in from dictionaryStream.
    • DictionaryLemmatizer

      public DictionaryLemmatizer(InputStream dictionaryStream) throws IOException
      Initializes a DictionaryLemmatizer and related HashMap from the input tab separated dictionary.

      The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03.

      Parameters:
      dictionaryStream - The dictionary referenced by an open InputStream.
      Throws:
      IOException - Thrown if IO errors occurred while reading in from dictionaryStream.
    • DictionaryLemmatizer

      public DictionaryLemmatizer(File dictionaryFile) throws IOException
      Initializes a DictionaryLemmatizer and related HashMap from the input tab separated dictionary.

      The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03.

      Parameters:
      dictionaryFile - The dictionary referenced by a valid, readable File.
      Throws:
      IOException - Thrown if IO errors occurred while reading in from dictionaryFile.
    • DictionaryLemmatizer

      public DictionaryLemmatizer(File dictionaryFile, Charset charset) throws IOException
      Initializes a DictionaryLemmatizer and related HashMap from the input tab separated dictionary.

      The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03.

      Parameters:
      dictionaryFile - The dictionary referenced by a valid, readable File.
      charset - The character encoding of the dictionary.
      Throws:
      IOException - Thrown if IO errors occurred while reading in from dictionaryFile.
    • DictionaryLemmatizer

      public DictionaryLemmatizer(Path dictionaryPath) throws IOException
      Initializes a DictionaryLemmatizer and related HashMap from the input tab separated dictionary.

      The input file should have, for each line, word\tabpostag\tablemma. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should be word\tab\postag\tablemma01#lemma02#lemma03.

      Parameters:
      dictionaryPath - The dictionary referenced via a valid, readable Path.
      Throws:
      IOException - Thrown if IO errors occurred while reading in from dictionaryPath.
  • Method Details

    • getDictMap

      public Map<List<String>,List<String>> getDictMap()
      Returns:
      Retrieves the Map containing the dictionary.
    • lemmatize

      public String[] lemmatize(String[] tokens, String[] postags)
      Description copied from interface: Lemmatizer
      Generates lemmas for the word and postag.
      Specified by:
      lemmatize in interface Lemmatizer
      Parameters:
      tokens - An array of the tokens
      postags - an array of the pos tags
      Returns:
      An array of possible lemmas for each token in the toks sequence.
    • lemmatize

      public List<List<String>> lemmatize(List<String> tokens, List<String> posTags)
      Description copied from interface: Lemmatizer
      Generates lemma tags for the word and postag.
      Specified by:
      lemmatize in interface Lemmatizer
      Parameters:
      tokens - An array of the tokens
      posTags - An array of the pos tags
      Returns:
      A list of every possible lemma for each token in the toks sequence.