1

I know of four possible reasons:

  • overfitting
  • underfitting
  • input data doesn't represent the problem (which I guess is underfitting)
  • classifier isn't suitable (e.g. problem is not linear)

Are there any other reasons why a classifier produces bad results? When should one try a different model? Could you please point me out to some good literature which I get read?

Sean Easter
  • 8,814
  • 2
  • 31
  • 58
DerTom
  • 807
  • 1
    What do you mean by "bad" results? – Tim Nov 03 '15 at 11:26
  • @Tim By bad results I mean producing wrong predictions regarding a problem. What are other meanings of "bad" results? – DerTom Nov 03 '15 at 11:29
  • @Tim I see, well in this matter I am looking at all different kind of models and the goal is to achieve the best possible prediction. – DerTom Nov 03 '15 at 12:06
  • 2
    To @tim's point, "bad" is a subjective and nontechnical term. This paper from the Jour ML Research Do we Need Hundreds of Classiers to Solve Real World Classication Problems? reviews and compares a boatload of methods with the goal of objectively identifying the "best" performing algorithm. Random Forests won. http://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf – user78229 Nov 03 '15 at 13:57
  • Interesting. I might add "the input data is unmodelable", i.e. it is nothing but random noise. The definition of "bad" is worth examining too, of course. – Andrew Charneski Nov 03 '15 at 16:12
  • @AndrewCharneski isn't that point number 3: input data doesn't represent the problem? or am I missing something? – DerTom Nov 03 '15 at 21:11

1 Answers1

1

You either fail to capture patterns that explain the outcome (underfitting) or fit to coincidences that will not be present in new data (overfitting). All four bullet points correspond to one of these.

The first two are obvious. The third could be either overfitting or underfitting. It would be underfitting if you simply do not measure an important determinant of the outcome of interest, and it would be overfitting if you have biased data that lead to a model that only applies to a subset of the population where you would want to apply the model (e.g., training using data collected from children, applying to children and adults). The fourth is another example of underfitting. Plenty of models allow you to model nonlinear boundaries between groups. If you force the boundary to be a line when it should be curved, you have underfit.

Note that you can simultaneously overfit and underfit.

Dave
  • 62,186