0

I'm trying to apply binary classification with OpenNLP. I could already successfully classify movies by different genres. The data sample has the form:

Genre1 Description 1
Genre1 Description 2
Genre2 Description 3
etc.

Evaluating my model I get a correct classification of another movie description:

Thriller : 2.1301979251908337E-14
Romantic : 0.9999999999999787
---------------------------------

Romantic : is the predicted category for the given sentence.


But how can I apply a binary classification, where the result should be either yes or no? So in my case, I would like to know if a movie belongs to the genre romantic or not. I'm not interested in the other genres.

When I just delete all the other genres from my data sample:

Genre1 Description 1
Genre1 Description 2

I get the following error:

Training data must contain more than one outcome

This is my code (source):

public static void main(String[] args) {
try {
    // read the training data
    Path path = Paths.get("C:", "Users");

    InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File(path + File.separator + "training" + File.separator + "en-movie-categoryNew.train"));
    ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
    ObjectStream sampleStream = new DocumentSampleStream(lineStream);

    // define the training parameters
    TrainingParameters params = new TrainingParameters();
    params.put(TrainingParameters.ITERATIONS_PARAM, 10 + "");
    params.put(TrainingParameters.CUTOFF_PARAM, 0 + "");
    params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);

    // create a model from traning data
    DoccatModel model = DocumentCategorizerME.train("en", sampleStream, params, new DoccatFactory());
    System.out.println("\nModel is successfully trained.");

    // save the model to local
    BufferedOutputStream modelOut = new BufferedOutputStream(new FileOutputStream(path + File.separator + "model" + File.separator + "en-movie-classifier-maxent.bin"));
    model.serialize(modelOut);
    System.out.println("\nTrained Model is saved locally at : " + "model" + File.separator + "en-movie-classifier-maxent.bin");

    // test the model file by subjecting it to prediction
    DocumentCategorizer doccat = new DocumentCategorizerME(model);
    String[] docWords = "Afterwards Stuart and Charlie notice Kate in the photos Stuart took at Leopolds ball and realise that her destiny must be to go back and be with Leopold That night while Kate is accepting her promotion at a company banquet he and Charlie race to meet her and show her the pictures Kate initially rejects their overtures and goes on to give her acceptance speech but it is there that she sees Stuarts picture and realises that she truly wants to be with Leopold".replaceAll("[^A-Za-z]", " ").split(" ");
    double[] aProbs = doccat.categorize(docWords);

    // print the probabilities of the categories
    System.out.println("\n---------------------------------\nCategory : Probability\n---------------------------------");
    for (int i = 0; i < doccat.getNumberOfCategories(); i++) {
        System.out.println(doccat.getCategory(i) + " : " + aProbs[i]);
    }
    System.out.println("---------------------------------");

    System.out.println("\n" + doccat.getBestCategory(aProbs) + " : is the predicted category for the given sentence.");
} catch (IOException e) {
    System.out.println("An exception in reading the training file. Please check.");
    e.printStackTrace();
}

}

gaout5
  • 3

1 Answers1

0

Most of the machine learning algorithms are not "hard" classifiers, but "soft" classifiers. This means that instead of returning the predicted class (or yes/no answers for binary cases), they return a score. The score often is a probability of the sample belonging to the class, as in your example. To convert the score to the hard classification, you need a decision function. The simplest thing you can do for binary classification to use the 0.5 cutoff, i.e. probability > 0.5 means that the model predicted the positive class. Please notice however that the 0.5 threshold is quite arbitrary, as you can learn from the Why P>0.5 cutoff is not "optimal" for logistic regression? thread.

Tim
  • 138,066
  • Ok, I understand the classification part. But how can I modify my training data, so that it accepts only 1 category (i.e. romantic)? I'm not interested in the other categories. So my question is a little OpenNLP-specific. – gaout5 Jun 30 '21 at 08:56
  • @g1a1out5 if you modify the training data so that it contains only the samples of a single class, then the only thing your model can learn from such data is that in the training set it always saw romantic movies, hence it will always predict the "romantic" class. Technically, this would not happen, but rather the code would fail. The classifier always predicts probability of some class vs other classes. If you just want to detect things that aren't romantic movies, you could use [tag:anomaly-detection] algorithms instead. – Tim Jun 30 '21 at 09:01
  • Notice however that treating classification problem as anomaly detection is usually not the best idea. – Tim Jun 30 '21 at 09:02