Package opennlp.tools.ml.model
Class ModelParameterChunker
- java.lang.Object
-
- opennlp.tools.ml.model.ModelParameterChunker
-
public final class ModelParameterChunker extends Object
A helper class that handles Strings with more than 64k (65535 bytes) in length. This is achieved via the signatureSIGNATURE_CHUNKED_PARAMSat the beginning of the String instance to be written to aDataOutputStream.Background: In OpenNLP, for large(r) corpora, we train models whose (UTF String) parameters will exceed the
MAX_CHUNK_SIZE_BYTESbytes limit set inDataOutputStream. For writing and reading those models, we have to chunk up those string instances in 64kB blocks and recombine them correctly upon reading a (binary) model file.The problem was raised in ticket OPENNLP-1366.
Solution strategy:
- If writing parameters to a
DataOutputStreamblows up with aUTFDataFormatExceptiona large String instance is chunked up and written as appropriate blocks. - To indicate that chunking was conducted, we start with the
SIGNATURE_CHUNKED_PARAMSindicator, directly followed by the number of chunks used. This way, when reading in chunked model parameters, recombination is achieved transparently.
Note: Both, existing (binary) model files and newly trained models which don't require the chunking technique, will be supported like in previous OpenNLP versions.
- Author:
- Martin Wiesner, Mark Struberg
- If writing parameters to a
-
-
Field Summary
Fields Modifier and Type Field Description static StringSIGNATURE_CHUNKED_PARAMS
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static StringreadUTF(DataInputStream dis)Reads model parameters fromdis.static voidwriteUTF(DataOutputStream dos, String s)Writes the model parameterstodos.
-
-
-
Field Detail
-
SIGNATURE_CHUNKED_PARAMS
public static final String SIGNATURE_CHUNKED_PARAMS
- See Also:
- Constant Field Values
-
-
Method Detail
-
readUTF
public static String readUTF(DataInputStream dis) throws IOException
Reads model parameters fromdis. In case the stream start withSIGNATURE_CHUNKED_PARAMS, the number of chunks is detected and the original large parameter string is reconstructed from several chunks.- Parameters:
dis- The stream which will be used to read the model parameter from.- Throws:
IOException
-
writeUTF
public static void writeUTF(DataOutputStream dos, String s) throws IOException
Writes the model parameterstodos. In casesdoes exceedMAX_CHUNK_SIZE_BYTESin length, the chunking mechanism is used; otherwise the parameter is written 'as is'.- Parameters:
dos- TheDataOutputStreamstream which will be used to persist the model.s- The input string that is checked for length and chunked ifMAX_CHUNK_SIZE_BYTESis exceeded.- Throws:
IOException
-
-