Package opennlp.tools.util
Klasse StringUtil
java.lang.Object
opennlp.tools.util.StringUtil
-
Konstruktorübersicht
Konstruktoren -
Methodenübersicht
Modifizierer und TypMethodeBeschreibungstatic void
computeShortestEditScript
(String wordForm, String lemma, int[][] distance, StringBuffer permutations) Computes the Shortest Edit Script (SES) to convert a word into its lemma.static String
decodeShortestEditScript
(String wordForm, String permutations) Reads the predicted Shortest Edit Script (SES) by a lemmatizer model and applies the permutations to obtain the lemma from thewordForm
.static String
getShortestEditScript
(String wordForm, String lemma) static boolean
isEmpty
(CharSequence theString) static boolean
isWhitespace
(char charCode) Determines if the specifiedCharacter
is a whitespace.static boolean
isWhitespace
(int charCode) Determines if the specifiedCharacter
is a whitespace.static int[][]
levenshteinDistance
(String wordForm, String lemma) Computes the Levenshtein distance of two strings in a matrix.static String
toLowerCase
(CharSequence string) Converts aCharSequence
to lower case, independent of the currentLocale
viaCharacter.toLowerCase(int)
which uses mapping information from the UnicodeData file.static CharBuffer
toLowerCaseCharBuffer
(CharSequence sequence) static String
toUpperCase
(CharSequence string) Converts aCharSequence
to upper case, independent of the currentLocale
viaCharacter.toUpperCase(char)
which uses mapping information from the UnicodeData file.
-
Konstruktordetails
-
StringUtil
public StringUtil()
-
-
Methodendetails
-
isWhitespace
public static boolean isWhitespace(char charCode) Determines if the specifiedCharacter
is a whitespace. A character is considered a whitespace when one of the following conditions is met:- It's a
Character.isWhitespace(int)
whitespace. - It's a part of the Unicode Zs category (
Character.SPACE_SEPARATOR
).
Character.isWhitespace(int)
does not include no-break spaces. In OpenNLP no-break spaces are also considered as white spaces.- Parameter:
charCode
- The character to check.- Gibt zurück:
true
ifcharCode
represents a white space,false
otherwise.
- It's a
-
isWhitespace
public static boolean isWhitespace(int charCode) Determines if the specifiedCharacter
is a whitespace. A character is considered a whitespace when one of the following conditions is met:- Its a
Character.isWhitespace(int)
whitespace. - Its a part of the Unicode Zs category (
Character.SPACE_SEPARATOR
).
Character.isWhitespace(int)
does not include no-break spaces. In OpenNLP no-break spaces are also considered as white spaces.- Parameter:
charCode
- An int representation of a character to check.- Gibt zurück:
true
ifcharCode
represents a white space,false
otherwise.
- Its a
-
toLowerCase
Converts aCharSequence
to lower case, independent of the currentLocale
viaCharacter.toLowerCase(int)
which uses mapping information from the UnicodeData file.- Parameter:
string
- TheCharSequence
to transform.- Gibt zurück:
- The lower-cased String.
-
toLowerCaseCharBuffer
-
toUpperCase
Converts aCharSequence
to upper case, independent of the currentLocale
viaCharacter.toUpperCase(char)
which uses mapping information from the UnicodeData file.- Parameter:
string
- TheCharSequence
to transform.- Gibt zurück:
- The upper-cased String
-
isEmpty
- Gibt zurück:
true
ifCharSequence.length()
is0
ornull
, otherwisefalse
- Seit:
- 1.5.1
-
levenshteinDistance
Computes the Levenshtein distance of two strings in a matrix.Based on this pseudo-code which in turn is based on the paper Wagner, Robert A.; Fischer, Michael J. (1974), "The String-to-String Correction Problem", Journal of the ACM 21 (1): 168-173
- Parameter:
wordForm
- The form as input.lemma
- The target lemma.- Gibt zurück:
- A 2-dimensional Levenshtein distance matrix.
-
computeShortestEditScript
public static void computeShortestEditScript(String wordForm, String lemma, int[][] distance, StringBuffer permutations) Computes the Shortest Edit Script (SES) to convert a word into its lemma. This is based on Chrupala's PhD thesis (2008).- Parameter:
wordForm
- The token.lemma
- The target lemma.distance
- A 2-dimensional Levenshtein distance matrix.permutations
- The number of permutations.
-
decodeShortestEditScript
Reads the predicted Shortest Edit Script (SES) by a lemmatizer model and applies the permutations to obtain the lemma from thewordForm
.- Parameter:
wordForm
- The wordForm as input.permutations
- The permutations predicted by the lemmatizer model.- Gibt zurück:
- The decoded lemma.
-
getShortestEditScript
- Parameter:
wordForm
- The word as input.lemma
- The target lemma.- Gibt zurück:
- Retrieves the Shortest Edit Script (SES) required to go from a word to a lemma.
-