Class NGramUtils


  • public class NGramUtils
    extends Object
    Utility class for ngrams. Some methods apply specifically to certain 'n' values, for e.g. tri/bi/uni-grams.
    • Constructor Detail

      • NGramUtils

        public NGramUtils()
    • Method Detail

      • calculateLaplaceSmoothingProbability

        public static double calculateLaplaceSmoothingProbability​(StringList ngram,
                                                                  Iterable<StringList> set,
                                                                  Double k)
        calculate the probability of a ngram in a vocabulary using Laplace smoothing algorithm
        Parameters:
        ngram - the ngram to get the probability for
        set - the vocabulary
        k - the smoothing factor
        Returns:
        the Laplace smoothing probability
        See Also:
        Additive Smoothing
      • calculateUnigramMLProbability

        public static double calculateUnigramMLProbability​(String word,
                                                           Collection<StringList> set)
        calculate the probability of a unigram in a vocabulary using maximum likelihood estimation
        Parameters:
        word - the only word in the unigram
        set - the vocabulary
        Returns:
        the maximum likelihood probability
      • calculateBigramMLProbability

        public static double calculateBigramMLProbability​(String x0,
                                                          String x1,
                                                          Collection<StringList> set)
        calculate the probability of a bigram in a vocabulary using maximum likelihood estimation
        Parameters:
        x0 - first word in the bigram
        x1 - second word in the bigram
        set - the vocabulary
        Returns:
        the maximum likelihood probability
      • calculateTrigramMLProbability

        public static double calculateTrigramMLProbability​(String x0,
                                                           String x1,
                                                           String x2,
                                                           Iterable<StringList> set)
        calculate the probability of a trigram in a vocabulary using maximum likelihood estimation
        Parameters:
        x0 - first word in the trigram
        x1 - second word in the trigram
        x2 - third word in the trigram
        set - the vocabulary
        Returns:
        the maximum likelihood probability
      • calculateNgramMLProbability

        public static double calculateNgramMLProbability​(StringList ngram,
                                                         Iterable<StringList> set)
        calculate the probability of a ngram in a vocabulary using maximum likelihood estimation
        Parameters:
        ngram - a ngram
        set - the vocabulary
        Returns:
        the maximum likelihood probability
      • calculateBigramPriorSmoothingProbability

        public static double calculateBigramPriorSmoothingProbability​(String x0,
                                                                      String x1,
                                                                      Collection<StringList> set,
                                                                      Double k)
        calculate the probability of a bigram in a vocabulary using prior Laplace smoothing algorithm
        Parameters:
        x0 - the first word in the bigram
        x1 - the second word in the bigram
        set - the vocabulary
        k - the smoothing factor
        Returns:
        the prior Laplace smoothing probability
      • calculateTrigramLinearInterpolationProbability

        public static double calculateTrigramLinearInterpolationProbability​(String x0,
                                                                            String x1,
                                                                            String x2,
                                                                            Collection<StringList> set,
                                                                            Double lambda1,
                                                                            Double lambda2,
                                                                            Double lambda3)
        calculate the probability of a trigram in a vocabulary using a linear interpolation algorithm
        Parameters:
        x0 - the first word in the trigram
        x1 - the second word in the trigram
        x2 - the third word in the trigram
        set - the vocabulary
        lambda1 - trigram interpolation factor
        lambda2 - bigram interpolation factor
        lambda3 - unigram interpolation factor
        Returns:
        the linear interpolation probability
      • calculateMissingNgramProbabilityMass

        public static double calculateMissingNgramProbabilityMass​(StringList ngram,
                                                                  Double discount,
                                                                  Iterable<StringList> set)
        calculate the probability of a ngram in a vocabulary using the missing probability mass algorithm
        Parameters:
        ngram - the ngram
        discount - discount factor
        set - the vocabulary
        Returns:
        the probability
      • getNMinusOneTokenFirst

        public static StringList getNMinusOneTokenFirst​(StringList ngram)
        get the (n-1)th ngram of a given ngram, that is the same ngram except the last word in the ngram
        Parameters:
        ngram - a ngram
        Returns:
        a ngram
      • getNMinusOneTokenLast

        public static StringList getNMinusOneTokenLast​(StringList ngram)
        get the (n-1)th ngram of a given ngram, that is the same ngram except the first word in the ngram
        Parameters:
        ngram - a ngram
        Returns:
        a ngram
      • getNGrams

        public static Collection<StringList> getNGrams​(StringList sequence,
                                                       int size)
        Get the ngrams of dimension n of a certain input sequence of tokens.
        Parameters:
        sequence - a sequence of tokens
        size - the size of the resulting ngrmams
        Returns:
        all the possible ngrams of the given size derivable from the input sequence
      • getNGrams

        public static Collection<String[]> getNGrams​(String[] sequence,
                                                     int size)
        Get the ngrams of dimension n of a certain input sequence of tokens.
        Parameters:
        sequence - a sequence of tokens
        size - the size of the resulting ngrmams
        Returns:
        all the possible ngrams of the given size derivable from the input sequence