Package opennlp.tools.ml.model
Class ModelParameterChunker
- java.lang.Object
-
- opennlp.tools.ml.model.ModelParameterChunker
-
public final class ModelParameterChunker extends Object
A helper class that handles Strings with more than 64k (65535 bytes) in length. This is achieved via the signatureSIGNATURE_CHUNKED_PARAMS
at the beginning of the String instance to be written to aDataOutputStream
.Background: In OpenNLP, for large(r) corpora, we train models whose (UTF String) parameters will exceed the
MAX_CHUNK_SIZE_BYTES
bytes limit set inDataOutputStream
. For writing and reading those models, we have to chunk up those string instances in 64kB blocks and recombine them correctly upon reading a (binary) model file.The problem was raised in ticket OPENNLP-1366.
Solution strategy:
- If writing parameters to a
DataOutputStream
blows up with aUTFDataFormatException
a large String instance is chunked up and written as appropriate blocks. - To indicate that chunking was conducted, we start with the
SIGNATURE_CHUNKED_PARAMS
indicator, directly followed by the number of chunks used. This way, when reading in chunked model parameters, recombination is achieved transparently.
Note: Both, existing (binary) model files and newly trained models which don't require the chunking technique, will be supported like in previous OpenNLP versions.
- Author:
- Martin Wiesner, Mark Struberg
- If writing parameters to a
-
-
Field Summary
Fields Modifier and Type Field Description static String
SIGNATURE_CHUNKED_PARAMS
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static String
readUTF(DataInputStream dis)
Reads model parameters fromdis
.static void
writeUTF(DataOutputStream dos, String s)
Writes the model parameters
todos
.
-
-
-
Field Detail
-
SIGNATURE_CHUNKED_PARAMS
public static final String SIGNATURE_CHUNKED_PARAMS
- See Also:
- Constant Field Values
-
-
Method Detail
-
readUTF
public static String readUTF(DataInputStream dis) throws IOException
Reads model parameters fromdis
. In case the stream start withSIGNATURE_CHUNKED_PARAMS
, the number of chunks is detected and the original large parameter string is reconstructed from several chunks.- Parameters:
dis
- The stream which will be used to read the model parameter from.- Throws:
IOException
-
writeUTF
public static void writeUTF(DataOutputStream dos, String s) throws IOException
Writes the model parameters
todos
. In cases
does exceedMAX_CHUNK_SIZE_BYTES
in length, the chunking mechanism is used; otherwise the parameter is written 'as is'.- Parameters:
dos
- TheDataOutputStream
stream which will be used to persist the model.s
- The input string that is checked for length and chunked ifMAX_CHUNK_SIZE_BYTES
is exceeded.- Throws:
IOException
-
-