Class StringUtil


  • public class StringUtil
    extends Object
    • Constructor Detail

      • StringUtil

        public StringUtil()
    • Method Detail

      • isWhitespace

        public static boolean isWhitespace​(char charCode)
        Determines if the specified Character is a whitespace. A character is considered a whitespace when one of the following conditions is met: Character.isWhitespace(int) does not include no-break spaces. In OpenNLP no-break spaces are also considered as white spaces.
        Parameters:
        charCode - The character to check.
        Returns:
        true if charCode represents a white space, false otherwise.
      • isWhitespace

        public static boolean isWhitespace​(int charCode)
        Determines if the specified Character is a whitespace. A character is considered a whitespace when one of the following conditions is met: Character.isWhitespace(int) does not include no-break spaces. In OpenNLP no-break spaces are also considered as white spaces.
        Parameters:
        charCode - An int representation of a character to check.
        Returns:
        true if charCode represents a white space, false otherwise.
      • levenshteinDistance

        public static int[][] levenshteinDistance​(String wordForm,
                                                  String lemma)
        Computes the Levenshtein distance of two strings in a matrix.

        Based on this pseudo-code which in turn is based on the paper Wagner, Robert A.; Fischer, Michael J. (1974), "The String-to-String Correction Problem", Journal of the ACM 21 (1): 168-173

        Parameters:
        wordForm - The form as input.
        lemma - The target lemma.
        Returns:
        A 2-dimensional Levenshtein distance matrix.
      • computeShortestEditScript

        public static void computeShortestEditScript​(String wordForm,
                                                     String lemma,
                                                     int[][] distance,
                                                     StringBuffer permutations)
        Computes the Shortest Edit Script (SES) to convert a word into its lemma. This is based on Chrupala's PhD thesis (2008).
        Parameters:
        wordForm - The token.
        lemma - The target lemma.
        distance - A 2-dimensional Levenshtein distance matrix.
        permutations - The number of permutations.
      • decodeShortestEditScript

        public static String decodeShortestEditScript​(String wordForm,
                                                      String permutations)
        Reads the predicted Shortest Edit Script (SES) by a lemmatizer model and applies the permutations to obtain the lemma from the wordForm.
        Parameters:
        wordForm - The wordForm as input.
        permutations - The permutations predicted by the lemmatizer model.
        Returns:
        The decoded lemma.
      • getShortestEditScript

        public static String getShortestEditScript​(String wordForm,
                                                   String lemma)
        Parameters:
        wordForm - The word as input.
        lemma - The target lemma.
        Returns:
        Retrieves the Shortest Edit Script (SES) required to go from a word to a lemma.