Package opennlp.tools.util
Class StringUtil
- java.lang.Object
-
- opennlp.tools.util.StringUtil
-
public class StringUtil extends Object
-
-
Constructor Summary
Constructors Constructor Description StringUtil()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static voidcomputeShortestEditScript(String wordForm, String lemma, int[][] distance, StringBuffer permutations)Computes the Shortest Edit Script (SES) to convert a word into its lemma.static StringdecodeShortestEditScript(String wordForm, String permutations)Reads the predicted Shortest Edit Script (SES) by a lemmatizer model and applies the permutations to obtain the lemma from thewordForm.static StringgetShortestEditScript(String wordForm, String lemma)static booleanisEmpty(CharSequence theString)static booleanisWhitespace(char charCode)Determines if the specifiedCharacteris a whitespace.static booleanisWhitespace(int charCode)Determines if the specifiedCharacteris a whitespace.static int[][]levenshteinDistance(String wordForm, String lemma)Computes the Levenshtein distance of two strings in a matrix.static StringtoLowerCase(CharSequence string)Converts aCharSequenceto lower case, independent of the currentLocaleviaCharacter.toLowerCase(int)which uses mapping information from the UnicodeData file.static StringtoUpperCase(CharSequence string)Converts aCharSequenceto upper case, independent of the currentLocaleviaCharacter.toUpperCase(char)which uses mapping information from the UnicodeData file.
-
-
-
Method Detail
-
isWhitespace
public static boolean isWhitespace(char charCode)
Determines if the specifiedCharacteris a whitespace. A character is considered a whitespace when one of the following conditions is met:- It's a
Character.isWhitespace(int)whitespace. - It's a part of the Unicode Zs category (
Character.SPACE_SEPARATOR).
Character.isWhitespace(int)does not include no-break spaces. In OpenNLP no-break spaces are also considered as white spaces.- Parameters:
charCode- The character to check.- Returns:
trueifcharCoderepresents a white space,falseotherwise.
- It's a
-
isWhitespace
public static boolean isWhitespace(int charCode)
Determines if the specifiedCharacteris a whitespace. A character is considered a whitespace when one of the following conditions is met:- Its a
Character.isWhitespace(int)whitespace. - Its a part of the Unicode Zs category (
Character.SPACE_SEPARATOR).
Character.isWhitespace(int)does not include no-break spaces. In OpenNLP no-break spaces are also considered as white spaces.- Parameters:
charCode- An int representation of a character to check.- Returns:
trueifcharCoderepresents a white space,falseotherwise.
- Its a
-
toLowerCase
public static String toLowerCase(CharSequence string)
Converts aCharSequenceto lower case, independent of the currentLocaleviaCharacter.toLowerCase(int)which uses mapping information from the UnicodeData file.- Parameters:
string- TheCharSequenceto transform.- Returns:
- The lower-cased String.
-
toUpperCase
public static String toUpperCase(CharSequence string)
Converts aCharSequenceto upper case, independent of the currentLocaleviaCharacter.toUpperCase(char)which uses mapping information from the UnicodeData file.- Parameters:
string- TheCharSequenceto transform.- Returns:
- The upper-cased String
-
isEmpty
public static boolean isEmpty(CharSequence theString)
- Returns:
trueifCharSequence.length()is0ornull, otherwisefalse- Since:
- 1.5.1
-
levenshteinDistance
public static int[][] levenshteinDistance(String wordForm, String lemma)
Computes the Levenshtein distance of two strings in a matrix.Based on this pseudo-code which in turn is based on the paper Wagner, Robert A.; Fischer, Michael J. (1974), "The String-to-String Correction Problem", Journal of the ACM 21 (1): 168-173
- Parameters:
wordForm- The form as input.lemma- The target lemma.- Returns:
- A 2-dimensional Levenshtein distance matrix.
-
computeShortestEditScript
public static void computeShortestEditScript(String wordForm, String lemma, int[][] distance, StringBuffer permutations)
Computes the Shortest Edit Script (SES) to convert a word into its lemma. This is based on Chrupala's PhD thesis (2008).- Parameters:
wordForm- The token.lemma- The target lemma.distance- A 2-dimensional Levenshtein distance matrix.permutations- The number of permutations.
-
decodeShortestEditScript
public static String decodeShortestEditScript(String wordForm, String permutations)
Reads the predicted Shortest Edit Script (SES) by a lemmatizer model and applies the permutations to obtain the lemma from thewordForm.- Parameters:
wordForm- The wordForm as input.permutations- The permutations predicted by the lemmatizer model.- Returns:
- The decoded lemma.
-
-