Class DocumentCategorizerME

java.lang.Object
opennlp.tools.doccat.DocumentCategorizerME
All Implemented Interfaces:
DocumentCategorizer

public class DocumentCategorizerME extends Object implements DocumentCategorizer
A Max-Ent based implementation of DocumentCategorizer.
  • Constructor Details

    • DocumentCategorizerME

      public DocumentCategorizerME(DoccatModel model)
      Initializes a DocumentCategorizerME instance with a doccat model. Default feature generation is used.
      Parameters:
      model - the DoccatModel to be used for categorization.
  • Method Details

    • categorize

      public double[] categorize(String[] text, Map<String,Object> extraInformation)
      Categorize the given text provided as tokens along with the provided extra information.
      Specified by:
      categorize in interface DocumentCategorizer
      Parameters:
      text - The text tokens to categorize.
      extraInformation - Additional information for context to be used by the feature generator.
      Returns:
      The per category probabilities.
    • categorize

      public double[] categorize(String[] text)
      Description copied from interface: DocumentCategorizer
      Categorizes the given text, provided in separate tokens.
      Specified by:
      categorize in interface DocumentCategorizer
      Parameters:
      text - The tokens of text to categorize.
      Returns:
      The per category probabilities.
    • scoreMap

      public Map<String,Double> scoreMap(String[] text)
      Description copied from interface: DocumentCategorizer
      Retrieves a Map in which the key is the category name and the value is the score.
      Specified by:
      scoreMap in interface DocumentCategorizer
      Parameters:
      text - The tokenized input text to classify.
      Returns:
      A Map with the score as a key.
    • sortedScoreMap

      public SortedMap<Double,Set<String>> sortedScoreMap(String[] text)
      Description copied from interface: DocumentCategorizer
      Retrieves a SortedMap of the scores sorted in ascending order, together with their associated categories.

      Many categories can have the same score, hence the Set as value.

      Specified by:
      sortedScoreMap in interface DocumentCategorizer
      Parameters:
      text - the input text to classify
      Returns:
      A SortedMap with the score as a key.
    • getBestCategory

      public String getBestCategory(double[] outcome)
      Description copied from interface: DocumentCategorizer
      Retrieves the best category from previously generated outcome probabilities
      Specified by:
      getBestCategory in interface DocumentCategorizer
      Parameters:
      outcome - An array of computed outcome probabilities.
      Returns:
      The best category represented as String.
    • getIndex

      public int getIndex(String category)
      Description copied from interface: DocumentCategorizer
      Retrieves the index of a certain category.
      Specified by:
      getIndex in interface DocumentCategorizer
      Parameters:
      category - The category for which the index is to be found.
      Returns:
      The index.
    • getCategory

      public String getCategory(int index)
      Description copied from interface: DocumentCategorizer
      Retrieves the category at a given index.
      Specified by:
      getCategory in interface DocumentCategorizer
      Parameters:
      index - The index for which the category shall be found.
      Returns:
      The category represented as String.
    • getNumberOfCategories

      public int getNumberOfCategories()
      Description copied from interface: DocumentCategorizer
      Retrieves the number of categories.
      Specified by:
      getNumberOfCategories in interface DocumentCategorizer
      Returns:
      The no. of categories.
    • getAllResults

      public String getAllResults(double[] results)
      Description copied from interface: DocumentCategorizer
      Retrieves the name of the category associated with the given probabilities.
      Specified by:
      getAllResults in interface DocumentCategorizer
      Parameters:
      results - The probabilities of each category.
      Returns:
      The name of the outcome.
    • train

      public static DoccatModel train(String lang, ObjectStream<DocumentSample> samples, TrainingParameters mlParams, DoccatFactory factory) throws IOException
      Starts a training of a DoccatModel with the given parameters.
      Parameters:
      lang - The ISO conform language code.
      samples - The ObjectStream of DocumentSample used as input for training.
      mlParams - The TrainingParameters for the context of the training.
      factory - The DoccatFactory for creating related objects defined via mlParams.
      Returns:
      A valid, trained DoccatModel instance.
      Throws:
      IOException - Thrown if IO errors occurred.