Package opennlp.tools.ngram
Class NGramUtils
java.lang.Object
opennlp.tools.ngram.NGramUtils
Utility class for ngrams.
 Some methods apply specifically to certain 'n' values, for e.g. tri/bi/uni-grams.
- 
Constructor SummaryConstructors
- 
Method SummaryModifier and TypeMethodDescriptionstatic doublecalculateBigramMLProbability(String x0, String x1, Collection<StringList> set) calculate the probability of a bigram in a vocabulary using maximum likelihood estimationstatic doublecalculateBigramPriorSmoothingProbability(String x0, String x1, Collection<StringList> set, Double k) calculate the probability of a bigram in a vocabulary using prior Laplace smoothing algorithmstatic doublecalculateLaplaceSmoothingProbability(StringList ngram, Iterable<StringList> set, Double k) calculate the probability of a ngram in a vocabulary using Laplace smoothing algorithmstatic doublecalculateMissingNgramProbabilityMass(StringList ngram, double discount, Iterable<StringList> set) calculate the probability of a ngram in a vocabulary using the missing probability mass algorithmstatic doublecalculateNgramMLProbability(StringList ngram, Iterable<StringList> set) calculate the probability of a ngram in a vocabulary using maximum likelihood estimationstatic doublecalculateTrigramLinearInterpolationProbability(String x0, String x1, String x2, Collection<StringList> set, Double lambda1, Double lambda2, Double lambda3) calculate the probability of a trigram in a vocabulary using a linear interpolation algorithmstatic doublecalculateTrigramMLProbability(String x0, String x1, String x2, Iterable<StringList> set) calculate the probability of a trigram in a vocabulary using maximum likelihood estimationstatic doublecalculateUnigramMLProbability(String word, Collection<StringList> set) calculate the probability of a unigram in a vocabulary using maximum likelihood estimationstatic Collection<String[]> Get the ngrams of dimension n of a certain input sequence of tokens.static Collection<StringList> getNGrams(StringList sequence, int size) Get the ngrams of dimension n of a certain input sequence of tokens.static StringListgetNMinusOneTokenFirst(StringList ngram) get the (n-1)th ngram of a given ngram, that is the same ngram except the last word in the ngramstatic StringListgetNMinusOneTokenLast(StringList ngram) get the (n-1)th ngram of a given ngram, that is the same ngram except the first word in the ngram
- 
Constructor Details- 
NGramUtilspublic NGramUtils()
 
- 
- 
Method Details- 
calculateLaplaceSmoothingProbabilitypublic static double calculateLaplaceSmoothingProbability(StringList ngram, Iterable<StringList> set, Double k) calculate the probability of a ngram in a vocabulary using Laplace smoothing algorithm- Parameters:
- ngram- the ngram to get the probability for
- set- the vocabulary
- k- the smoothing factor
- Returns:
- the Laplace smoothing probability
- See Also:
 
- 
calculateUnigramMLProbabilitycalculate the probability of a unigram in a vocabulary using maximum likelihood estimation- Parameters:
- word- the only word in the unigram
- set- the vocabulary
- Returns:
- the maximum likelihood probability
 
- 
calculateBigramMLProbabilitycalculate the probability of a bigram in a vocabulary using maximum likelihood estimation- Parameters:
- x0- first word in the bigram
- x1- second word in the bigram
- set- the vocabulary
- Returns:
- the maximum likelihood probability
 
- 
calculateTrigramMLProbabilitypublic static double calculateTrigramMLProbability(String x0, String x1, String x2, Iterable<StringList> set) calculate the probability of a trigram in a vocabulary using maximum likelihood estimation- Parameters:
- x0- first word in the trigram
- x1- second word in the trigram
- x2- third word in the trigram
- set- the vocabulary
- Returns:
- the maximum likelihood probability
 
- 
calculateNgramMLProbabilitycalculate the probability of a ngram in a vocabulary using maximum likelihood estimation- Parameters:
- ngram- a ngram
- set- the vocabulary
- Returns:
- the maximum likelihood probability
 
- 
calculateBigramPriorSmoothingProbabilitypublic static double calculateBigramPriorSmoothingProbability(String x0, String x1, Collection<StringList> set, Double k) calculate the probability of a bigram in a vocabulary using prior Laplace smoothing algorithm- Parameters:
- x0- the first word in the bigram
- x1- the second word in the bigram
- set- the vocabulary
- k- the smoothing factor
- Returns:
- the prior Laplace smoothing probability
 
- 
calculateTrigramLinearInterpolationProbabilitypublic static double calculateTrigramLinearInterpolationProbability(String x0, String x1, String x2, Collection<StringList> set, Double lambda1, Double lambda2, Double lambda3) calculate the probability of a trigram in a vocabulary using a linear interpolation algorithm- Parameters:
- x0- the first word in the trigram
- x1- the second word in the trigram
- x2- the third word in the trigram
- set- the vocabulary
- lambda1- trigram interpolation factor
- lambda2- bigram interpolation factor
- lambda3- unigram interpolation factor
- Returns:
- the linear interpolation probability
 
- 
calculateMissingNgramProbabilityMasspublic static double calculateMissingNgramProbabilityMass(StringList ngram, double discount, Iterable<StringList> set) calculate the probability of a ngram in a vocabulary using the missing probability mass algorithm- Parameters:
- ngram- the ngram
- discount- discount factor
- set- the vocabulary
- Returns:
- the probability
 
- 
getNMinusOneTokenFirstget the (n-1)th ngram of a given ngram, that is the same ngram except the last word in the ngram- Parameters:
- ngram- a ngram
- Returns:
- a ngram
 
- 
getNMinusOneTokenLastget the (n-1)th ngram of a given ngram, that is the same ngram except the first word in the ngram- Parameters:
- ngram- a ngram
- Returns:
- a ngram
 
- 
getNGramsGet the ngrams of dimension n of a certain input sequence of tokens.- Parameters:
- sequence- a sequence of tokens
- size- the size of the resulting ngrmams
- Returns:
- all the possible ngrams of the given size derivable from the input sequence
 
- 
getNGramsGet the ngrams of dimension n of a certain input sequence of tokens.- Parameters:
- sequence- a sequence of tokens
- size- the size of the resulting ngrmams
- Returns:
- all the possible ngrams of the given size derivable from the input sequence
 
 
-