public class StringUtil extends Object
Constructor and Description |
---|
StringUtil() |
Modifier and Type | Method and Description |
---|---|
static void |
computeShortestEditScript(String wordForm,
String lemma,
int[][] distance,
StringBuffer permutations)
Computes the Shortest Edit Script (SES) to convert a word into its lemma.
|
static String |
decodeShortestEditScript(String wordForm,
String permutations)
Read predicted SES by the lemmatizer model and apply the
permutations to obtain the lemma from the wordForm.
|
static String |
getShortestEditScript(String wordForm,
String lemma)
Get the SES required to go from a word to a lemma.
|
static boolean |
isEmpty(CharSequence theString)
|
static boolean |
isWhitespace(char charCode)
Determines if the specified character is a whitespace.
|
static boolean |
isWhitespace(int charCode)
Determines if the specified character is a whitespace.
|
static int[][] |
levenshteinDistance(String wordForm,
String lemma)
Computes the Levenshtein distance of two strings in a matrix.
|
static String |
toLowerCase(CharSequence string)
Converts to lower case independent of the current locale via
Character.toLowerCase(int) which uses mapping information
from the UnicodeData file. |
static String |
toUpperCase(CharSequence string)
Converts to upper case independent of the current locale via
Character.toUpperCase(char) which uses mapping information
from the UnicodeData file. |
public static boolean isWhitespace(char charCode)
Character.isWhitespace(int)
whitespace.Character.SPACE_SEPARATOR
).Character.isWhitespace(int)
does not include no-break spaces.
In OpenNLP no-break spaces are also considered as white spaces.charCode
- public static boolean isWhitespace(int charCode)
Character.isWhitespace(int)
whitespace.Character.SPACE_SEPARATOR
).Character.isWhitespace(int)
does not include no-break spaces.
In OpenNLP no-break spaces are also considered as white spaces.charCode
- public static String toLowerCase(CharSequence string)
Character.toLowerCase(int)
which uses mapping information
from the UnicodeData file.string
- public static String toUpperCase(CharSequence string)
Character.toUpperCase(char)
which uses mapping information
from the UnicodeData file.string
- public static boolean isEmpty(CharSequence theString)
true
if CharSequence.length()
is 0
, otherwise
false
public static int[][] levenshteinDistance(String wordForm, String lemma)
wordForm
- the formlemma
- the lemmapublic static void computeShortestEditScript(String wordForm, String lemma, int[][] distance, StringBuffer permutations)
wordForm
- the tokenlemma
- the target lemmadistance
- the levenshtein distancepermutations
- the number of permutationspublic static String decodeShortestEditScript(String wordForm, String permutations)
wordForm
- the wordFormpermutations
- the permutations predicted by the lemmatizer modelCopyright © 2021 The Apache Software Foundation. All rights reserved.