Package opennlp.tools.ngram
Class NGramUtils
- java.lang.Object
-
- opennlp.tools.ngram.NGramUtils
-
public class NGramUtils extends Object
Utility class for ngrams. Some methods apply specifically to certain 'n' values, for e.g. tri/bi/uni-grams.
-
-
Constructor Summary
Constructors Constructor Description NGramUtils()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static double
calculateBigramMLProbability(String x0, String x1, Collection<StringList> set)
calculate the probability of a bigram in a vocabulary using maximum likelihood estimationstatic double
calculateBigramPriorSmoothingProbability(String x0, String x1, Collection<StringList> set, Double k)
calculate the probability of a bigram in a vocabulary using prior Laplace smoothing algorithmstatic double
calculateLaplaceSmoothingProbability(StringList ngram, Iterable<StringList> set, Double k)
calculate the probability of a ngram in a vocabulary using Laplace smoothing algorithmstatic double
calculateMissingNgramProbabilityMass(StringList ngram, Double discount, Iterable<StringList> set)
calculate the probability of a ngram in a vocabulary using the missing probability mass algorithmstatic double
calculateNgramMLProbability(StringList ngram, Iterable<StringList> set)
calculate the probability of a ngram in a vocabulary using maximum likelihood estimationstatic double
calculateTrigramLinearInterpolationProbability(String x0, String x1, String x2, Collection<StringList> set, Double lambda1, Double lambda2, Double lambda3)
calculate the probability of a trigram in a vocabulary using a linear interpolation algorithmstatic double
calculateTrigramMLProbability(String x0, String x1, String x2, Iterable<StringList> set)
calculate the probability of a trigram in a vocabulary using maximum likelihood estimationstatic double
calculateUnigramMLProbability(String word, Collection<StringList> set)
calculate the probability of a unigram in a vocabulary using maximum likelihood estimationstatic Collection<String[]>
getNGrams(String[] sequence, int size)
Get the ngrams of dimension n of a certain input sequence of tokens.static Collection<StringList>
getNGrams(StringList sequence, int size)
Get the ngrams of dimension n of a certain input sequence of tokens.static StringList
getNMinusOneTokenFirst(StringList ngram)
get the (n-1)th ngram of a given ngram, that is the same ngram except the last word in the ngramstatic StringList
getNMinusOneTokenLast(StringList ngram)
get the (n-1)th ngram of a given ngram, that is the same ngram except the first word in the ngram
-
-
-
Method Detail
-
calculateLaplaceSmoothingProbability
public static double calculateLaplaceSmoothingProbability(StringList ngram, Iterable<StringList> set, Double k)
calculate the probability of a ngram in a vocabulary using Laplace smoothing algorithm- Parameters:
ngram
- the ngram to get the probability forset
- the vocabularyk
- the smoothing factor- Returns:
- the Laplace smoothing probability
- See Also:
- Additive Smoothing
-
calculateUnigramMLProbability
public static double calculateUnigramMLProbability(String word, Collection<StringList> set)
calculate the probability of a unigram in a vocabulary using maximum likelihood estimation- Parameters:
word
- the only word in the unigramset
- the vocabulary- Returns:
- the maximum likelihood probability
-
calculateBigramMLProbability
public static double calculateBigramMLProbability(String x0, String x1, Collection<StringList> set)
calculate the probability of a bigram in a vocabulary using maximum likelihood estimation- Parameters:
x0
- first word in the bigramx1
- second word in the bigramset
- the vocabulary- Returns:
- the maximum likelihood probability
-
calculateTrigramMLProbability
public static double calculateTrigramMLProbability(String x0, String x1, String x2, Iterable<StringList> set)
calculate the probability of a trigram in a vocabulary using maximum likelihood estimation- Parameters:
x0
- first word in the trigramx1
- second word in the trigramx2
- third word in the trigramset
- the vocabulary- Returns:
- the maximum likelihood probability
-
calculateNgramMLProbability
public static double calculateNgramMLProbability(StringList ngram, Iterable<StringList> set)
calculate the probability of a ngram in a vocabulary using maximum likelihood estimation- Parameters:
ngram
- a ngramset
- the vocabulary- Returns:
- the maximum likelihood probability
-
calculateBigramPriorSmoothingProbability
public static double calculateBigramPriorSmoothingProbability(String x0, String x1, Collection<StringList> set, Double k)
calculate the probability of a bigram in a vocabulary using prior Laplace smoothing algorithm- Parameters:
x0
- the first word in the bigramx1
- the second word in the bigramset
- the vocabularyk
- the smoothing factor- Returns:
- the prior Laplace smoothing probability
-
calculateTrigramLinearInterpolationProbability
public static double calculateTrigramLinearInterpolationProbability(String x0, String x1, String x2, Collection<StringList> set, Double lambda1, Double lambda2, Double lambda3)
calculate the probability of a trigram in a vocabulary using a linear interpolation algorithm- Parameters:
x0
- the first word in the trigramx1
- the second word in the trigramx2
- the third word in the trigramset
- the vocabularylambda1
- trigram interpolation factorlambda2
- bigram interpolation factorlambda3
- unigram interpolation factor- Returns:
- the linear interpolation probability
-
calculateMissingNgramProbabilityMass
public static double calculateMissingNgramProbabilityMass(StringList ngram, Double discount, Iterable<StringList> set)
calculate the probability of a ngram in a vocabulary using the missing probability mass algorithm- Parameters:
ngram
- the ngramdiscount
- discount factorset
- the vocabulary- Returns:
- the probability
-
getNMinusOneTokenFirst
public static StringList getNMinusOneTokenFirst(StringList ngram)
get the (n-1)th ngram of a given ngram, that is the same ngram except the last word in the ngram- Parameters:
ngram
- a ngram- Returns:
- a ngram
-
getNMinusOneTokenLast
public static StringList getNMinusOneTokenLast(StringList ngram)
get the (n-1)th ngram of a given ngram, that is the same ngram except the first word in the ngram- Parameters:
ngram
- a ngram- Returns:
- a ngram
-
getNGrams
public static Collection<StringList> getNGrams(StringList sequence, int size)
Get the ngrams of dimension n of a certain input sequence of tokens.- Parameters:
sequence
- a sequence of tokenssize
- the size of the resulting ngrmams- Returns:
- all the possible ngrams of the given size derivable from the input sequence
-
getNGrams
public static Collection<String[]> getNGrams(String[] sequence, int size)
Get the ngrams of dimension n of a certain input sequence of tokens.- Parameters:
sequence
- a sequence of tokenssize
- the size of the resulting ngrmams- Returns:
- all the possible ngrams of the given size derivable from the input sequence
-
-