Class StringUtil

java.lang.Object
opennlp.tools.util.StringUtil

public class StringUtil extends Object
  • Constructor Details

    • StringUtil

      public StringUtil()
  • Method Details

    • isWhitespace

      public static boolean isWhitespace(char charCode)
      Determines if the specified Character is a whitespace. A character is considered a whitespace when one of the following conditions is met: Character.isWhitespace(int) does not include no-break spaces. In OpenNLP no-break spaces are also considered as white spaces.
      Parameters:
      charCode - The character to check.
      Returns:
      true if charCode represents a white space, false otherwise.
    • isWhitespace

      public static boolean isWhitespace(int charCode)
      Determines if the specified Character is a whitespace. A character is considered a whitespace when one of the following conditions is met: Character.isWhitespace(int) does not include no-break spaces. In OpenNLP no-break spaces are also considered as white spaces.
      Parameters:
      charCode - An int representation of a character to check.
      Returns:
      true if charCode represents a white space, false otherwise.
    • toLowerCase

      public static String toLowerCase(CharSequence string)
      Converts a CharSequence to lower case, independent of the current Locale via Character.toLowerCase(int) which uses mapping information from the UnicodeData file.
      Parameters:
      string - The CharSequence to transform.
      Returns:
      The lower-cased String.
    • toLowerCaseCharBuffer

      public static CharBuffer toLowerCaseCharBuffer(CharSequence sequence)
    • toUpperCase

      public static String toUpperCase(CharSequence string)
      Converts a CharSequence to upper case, independent of the current Locale via Character.toUpperCase(char) which uses mapping information from the UnicodeData file.
      Parameters:
      string - The CharSequence to transform.
      Returns:
      The upper-cased String
    • isEmpty

      public static boolean isEmpty(CharSequence theString)
      Returns:
      true if CharSequence.length() is 0 or null, otherwise false
      Since:
      1.5.1
    • levenshteinDistance

      public static int[][] levenshteinDistance(String wordForm, String lemma)
      Computes the Levenshtein distance of two strings in a matrix.

      Based on this pseudo-code which in turn is based on the paper Wagner, Robert A.; Fischer, Michael J. (1974), "The String-to-String Correction Problem", Journal of the ACM 21 (1): 168-173

      Parameters:
      wordForm - The form as input.
      lemma - The target lemma.
      Returns:
      A 2-dimensional Levenshtein distance matrix.
    • computeShortestEditScript

      public static void computeShortestEditScript(String wordForm, String lemma, int[][] distance, StringBuffer permutations)
      Computes the Shortest Edit Script (SES) to convert a word into its lemma. This is based on Chrupala's PhD thesis (2008).
      Parameters:
      wordForm - The token.
      lemma - The target lemma.
      distance - A 2-dimensional Levenshtein distance matrix.
      permutations - The number of permutations.
    • decodeShortestEditScript

      public static String decodeShortestEditScript(String wordForm, String permutations)
      Reads the predicted Shortest Edit Script (SES) by a lemmatizer model and applies the permutations to obtain the lemma from the wordForm.
      Parameters:
      wordForm - The wordForm as input.
      permutations - The permutations predicted by the lemmatizer model.
      Returns:
      The decoded lemma.
    • getShortestEditScript

      public static String getShortestEditScript(String wordForm, String lemma)
      Parameters:
      wordForm - The word as input.
      lemma - The target lemma.
      Returns:
      Retrieves the Shortest Edit Script (SES) required to go from a word to a lemma.