Class StringUtil


  • public class StringUtil
    extends Object
    • Constructor Detail

      • StringUtil

        public StringUtil()
    • Method Detail

      • isWhitespace

        public static boolean isWhitespace​(char charCode)
        Determines if the specified character is a whitespace. A character is considered a whitespace when one of the following conditions is meet: Character.isWhitespace(int) does not include no-break spaces. In OpenNLP no-break spaces are also considered as white spaces.
        Parameters:
        charCode -
        Returns:
        true if white space otherwise false
      • isWhitespace

        public static boolean isWhitespace​(int charCode)
        Determines if the specified character is a whitespace. A character is considered a whitespace when one of the following conditions is meet: Character.isWhitespace(int) does not include no-break spaces. In OpenNLP no-break spaces are also considered as white spaces.
        Parameters:
        charCode -
        Returns:
        true if white space otherwise false
      • toLowerCase

        public static String toLowerCase​(CharSequence string)
        Converts to lower case independent of the current locale via Character.toLowerCase(int) which uses mapping information from the UnicodeData file.
        Parameters:
        string -
        Returns:
        lower cased String
      • toUpperCase

        public static String toUpperCase​(CharSequence string)
        Converts to upper case independent of the current locale via Character.toUpperCase(char) which uses mapping information from the UnicodeData file.
        Parameters:
        string -
        Returns:
        upper cased String
      • levenshteinDistance

        public static int[][] levenshteinDistance​(String wordForm,
                                                  String lemma)
        Computes the Levenshtein distance of two strings in a matrix. Based on pseudo-code provided here: https://en.wikipedia.org/wiki/Levenshtein_distance#Computing_Levenshtein_distance which in turn is based on the paper Wagner, Robert A.; Fischer, Michael J. (1974), "The String-to-String Correction Problem", Journal of the ACM 21 (1): 168-173
        Parameters:
        wordForm - the form
        lemma - the lemma
        Returns:
        the distance
      • computeShortestEditScript

        public static void computeShortestEditScript​(String wordForm,
                                                     String lemma,
                                                     int[][] distance,
                                                     StringBuffer permutations)
        Computes the Shortest Edit Script (SES) to convert a word into its lemma. This is based on Chrupala's PhD thesis (2008).
        Parameters:
        wordForm - the token
        lemma - the target lemma
        distance - the levenshtein distance
        permutations - the number of permutations
      • decodeShortestEditScript

        public static String decodeShortestEditScript​(String wordForm,
                                                      String permutations)
        Read predicted SES by the lemmatizer model and apply the permutations to obtain the lemma from the wordForm.
        Parameters:
        wordForm - the wordForm
        permutations - the permutations predicted by the lemmatizer model
        Returns:
        the lemma
      • getShortestEditScript

        public static String getShortestEditScript​(String wordForm,
                                                   String lemma)
        Get the SES required to go from a word to a lemma.
        Parameters:
        wordForm - the word
        lemma - the lemma
        Returns:
        the shortest edit script