10

I've been analyzing a data set of ~400k records and 9 variables The dependent variable is binary. I've fitted a logistic regression, a regression tree, a random forest, and a gradient boosted tree. All of them give virtual identical goodness of fit numbers when I validate them on another data set.

Why is this so? I'm guessing that it's because my observations to variable ratio is so high. If this is correct, at what observation to variable ratio will different models start to give different results?

IgorS
  • 5,474
  • 11
  • 31
  • 43
JenSCDC
  • 317
  • 1
  • 10

4 Answers4

7

This results means that whatever method you use, you are able to get reasonably close to the optimal decision rule (aka Bayes rule). The underlying reasons have been explained in Hastie, Tibshirani and Friedman's "Elements of Statistical Learning". They demonstrated how the different methods perform by comparing Figs. 2.1, 2.2, 2.3, 5.11 (in my first edition -- in section on multidimensional splines), 12.2, 12.3 (support vector machines), and probably some others. If you have not read that book, you need to drop everything RIGHT NOW and read it up. (I mean, it isn't worth losing your job, but it is worth missing a homework or two if you are a student.)

I don't think that observations to variable ratio is the explanation. In light of my rationale offered above, it is the relatively simple form of the boundary separating your classes in the multidimensional space that all of the methods you tried have been able to identify.

StasK
  • 310
  • 1
  • 5
  • I'll ask my boss if I can get the company to pay for it. – JenSCDC Aug 19 '14 at 18:39
  • 1
    ESL is 'free' as a pdf from their homepage...also worth downloading is ISL (by many of same authors) - more practical http://www-bcf.usc.edu/~gareth/ISL/ – seanv507 Aug 20 '14 at 12:02
4

its worth also looking at the training errors.

basically I disagree with your analysis. if logistic regression etc are all giving the same results it would suggest that the 'best model' is a very simple one (that all models can fit equally well - eg basically linear).

So then the question might be why is the best model a simple model?: It might suggest that your variables are not very predictive. Its of course hard to analyse without knowing the data.

seanv507
  • 751
  • 3
  • 12
1

As @seanv507 suggested, the similar performance may simply be due to the data being best separated by a linear model. But in general, the statement that it is because the "observations to variable ratio is so high" is incorrect. Even as your ratio of sample size to number of variables goes to infinity, you should not expect different models to perform nearly identically, unless they all provide the same predictive bias.

bogatron
  • 846
  • 5
  • 4
  • I just edited my question to add that the dependent variable is binary. Hence, a linear model isn't suitable. – JenSCDC Aug 18 '14 at 17:36
  • " you should not expect different models to perform nearly identically, unless they all provide the same predictive bias." I used MAE and the ratio of actual to predicted outcomes as validation measures and the ratios were very close. – JenSCDC Aug 18 '14 at 17:40
  • 1
    Andy, I would include logistic regression (and linear SVM) as 'linear' model. They are all just separating the data by a weighted sum of the inputs. – seanv507 Aug 18 '14 at 17:59
  • 1
    @seanv507 Exactly - the decision boundary is still linear. The fact that binary classification is being performed doesn't change that. – bogatron Aug 18 '14 at 18:07
  • What about trees? They really don't seem linear to me. – JenSCDC Aug 19 '14 at 14:52
  • Random forests can approximate a linear boundary pretty well. But there are no details in your question regarding characteristics of the data, model parameters, or model performance such that anyone can give a definitive answer as to why they all perform similarly. – bogatron Aug 19 '14 at 15:10
  • With regard to oberservations-to-variables ratio, consider a 2-variable problem where x * y > 0 is classified as True and x * y <= 0 is False. Even as sample size goes to infinity, a logistic regression model will, on average, have no better than accuracy of 0.5, whereas a 2x2x2 neural network will have accuracy approaching 1.0. One can easily construct such examples where sufficiently high ratio of sample size to number of variables will not guarantee similar performance of classifiers. – bogatron Aug 19 '14 at 15:19
  • bogatron- could you elaborate? I'm not quite sure what you're asking for. – JenSCDC Aug 19 '14 at 18:37
  • It may be that for your data set, each of the models can accurately approximate the optimal decision boundary (see answer by @StasK), though I wouldn't say that for sure without looking at the model parameters and doing a comparative error analysis. My main point was simply that a high sample-to-variable ratio does not guarantee similar performance for a given set of classifiers because they each have different biases/limitations on the kind of solutions they can produce. – bogatron Aug 19 '14 at 19:17
0

I'm guessing that it's because my observations to variable ratio is so high.

I think this explanation makes perfect sense.

If this is correct, at what observation to variable ratio will different models start to give different results?

This will probably depend very much on your specific data (for instance, even whether your nine variables are continuous, factors, ordinary or binary), as well as any tuning decisions you made while fitting your model.

But you can play around with the observation-to-variable ratio - not by increasing the number of variables, but by decreasing the number of observations. Randomly draw 100 observations, fit models and see whether different models yield different results. (I guess they will.) Do this multiple times with different samples drawn from your total number of observations. Then look at subsamples of 1,000 observations... 10,000 observations... and so forth.

Stephan Kolassa
  • 1,161
  • 1
  • 8
  • 13
  • 1
    Hm why is that? more observations seems to increase the chance that the decision boundary is more complex -- i.e. definitely not linear. And these models do different things in complex cases, and tend to do the same in simple ones. – Sean Owen Aug 20 '14 at 09:53
  • @SeanOwen: I think I'm not understanding your comment. What part of my answer does "why is that" refer to? The OP said nothing about using linear decision boundaries - after all, he might by transforming predictors in some way. – Stephan Kolassa Aug 20 '14 at 10:31
  • Why would more observations make different classifiers give more similar decisions? my intuition is the opposite. Yes, I'm not thinking of just linear decision boundaries. The more complex the optimal boundary the less likely they will all fit something similar to that boundary. And the boundary tends to be more complex with more observations. – Sean Owen Aug 20 '14 at 10:51