Class DictionaryLemmatizer
- java.lang.Object
-
- opennlp.tools.lemmatizer.DictionaryLemmatizer
-
- All Implemented Interfaces:
Lemmatizer
public class DictionaryLemmatizer extends Object implements Lemmatizer
ALemmatizer
implementation that works by simple dictionary lookup into aMap
built from a file containing, for each line:word\tabpostag\tablemma
.
-
-
Constructor Summary
Constructors Constructor Description DictionaryLemmatizer(File dictionaryFile)
Initializes aDictionaryLemmatizer
and relatedHashMap
from the input tab separated dictionary.DictionaryLemmatizer(File dictionaryFile, Charset charset)
Initializes aDictionaryLemmatizer
and relatedHashMap
from the input tab separated dictionary.DictionaryLemmatizer(InputStream dictionaryStream)
Initializes aDictionaryLemmatizer
and relatedHashMap
from the input tab separated dictionary.DictionaryLemmatizer(InputStream dictionaryStream, Charset charset)
Initializes aDictionaryLemmatizer
and relatedHashMap
from the input tab separated dictionary.DictionaryLemmatizer(Path dictionaryPath)
Initializes aDictionaryLemmatizer
and relatedHashMap
from the input tab separated dictionary.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Map<List<String>,List<String>>
getDictMap()
String[]
lemmatize(String[] tokens, String[] postags)
Generates lemmas for the word and postag.List<List<String>>
lemmatize(List<String> tokens, List<String> posTags)
Generates lemma tags for the word and postag.
-
-
-
Constructor Detail
-
DictionaryLemmatizer
public DictionaryLemmatizer(InputStream dictionaryStream, Charset charset) throws IOException
Initializes aDictionaryLemmatizer
and relatedHashMap
from the input tab separated dictionary.The input file should have, for each line,
word\tabpostag\tablemma
. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should beword\tab\postag\tablemma01#lemma02#lemma03
.- Parameters:
dictionaryStream
- The dictionary referenced by an openInputStream
.charset
- Thecharacter encoding
of the dictionary.- Throws:
IOException
- Thrown if IO errors occurred while reading in fromdictionaryStream
.
-
DictionaryLemmatizer
public DictionaryLemmatizer(InputStream dictionaryStream) throws IOException
Initializes aDictionaryLemmatizer
and relatedHashMap
from the input tab separated dictionary.The input file should have, for each line,
word\tabpostag\tablemma
. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should beword\tab\postag\tablemma01#lemma02#lemma03
.- Parameters:
dictionaryStream
- The dictionary referenced by an openInputStream
.- Throws:
IOException
- Thrown if IO errors occurred while reading in fromdictionaryStream
.
-
DictionaryLemmatizer
public DictionaryLemmatizer(File dictionaryFile) throws IOException
Initializes aDictionaryLemmatizer
and relatedHashMap
from the input tab separated dictionary.The input file should have, for each line,
word\tabpostag\tablemma
. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should beword\tab\postag\tablemma01#lemma02#lemma03
.- Parameters:
dictionaryFile
- The dictionary referenced by a valid, readableFile
.- Throws:
IOException
- Thrown if IO errors occurred while reading in fromdictionaryFile
.
-
DictionaryLemmatizer
public DictionaryLemmatizer(File dictionaryFile, Charset charset) throws IOException
Initializes aDictionaryLemmatizer
and relatedHashMap
from the input tab separated dictionary.The input file should have, for each line,
word\tabpostag\tablemma
. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should beword\tab\postag\tablemma01#lemma02#lemma03
.- Parameters:
dictionaryFile
- The dictionary referenced by a valid, readableFile
.charset
- Thecharacter encoding
of the dictionary.- Throws:
IOException
- Thrown if IO errors occurred while reading in fromdictionaryFile
.
-
DictionaryLemmatizer
public DictionaryLemmatizer(Path dictionaryPath) throws IOException
Initializes aDictionaryLemmatizer
and relatedHashMap
from the input tab separated dictionary.The input file should have, for each line,
word\tabpostag\tablemma
. Alternatively, if multiple lemmas are possible for each word-postag pair, then the format should beword\tab\postag\tablemma01#lemma02#lemma03
.- Parameters:
dictionaryPath
- The dictionary referenced via a valid, readablePath
.- Throws:
IOException
- Thrown if IO errors occurred while reading in fromdictionaryPath
.
-
-
Method Detail
-
getDictMap
public Map<List<String>,List<String>> getDictMap()
- Returns:
- Retrieves the
Map
containing the dictionary.
-
lemmatize
public String[] lemmatize(String[] tokens, String[] postags)
Description copied from interface:Lemmatizer
Generates lemmas for the word and postag.- Specified by:
lemmatize
in interfaceLemmatizer
- Parameters:
tokens
- An array of the tokenspostags
- an array of the pos tags- Returns:
- An array of possible lemmas for each token in the
toks
sequence.
-
lemmatize
public List<List<String>> lemmatize(List<String> tokens, List<String> posTags)
Description copied from interface:Lemmatizer
Generates lemma tags for the word and postag.- Specified by:
lemmatize
in interfaceLemmatizer
- Parameters:
tokens
- An array of the tokensposTags
- An array of the pos tags- Returns:
- A list of every possible lemma for each token in the
toks
sequence.
-
-