3

I am using linear regression to draw a y = mx + b line between my data, I just want to know how much of a good fit line my best linear line is. So I thought I would just use clf.score(X_train, y_train) on the points I've already used to train my algorithm. I just want to see how my line compares to the average y-line. Do I need to split my data into train and test data, and then run it. Or should I just use my train data to test, beacuse it can't deviate from the line anyways? And why?

2 Answers2

5

If you're not trying to generalise on new data, then you don't need to.

If you are trying to generalise to new data, and if your algorithm has no hyper-parameters (i.e. settings you can tweak), then you don't need to.

If you are trying to generalise to new data, and (as is usual), you have hyper-parameters to tune, then you need to.

For example, if you were using regularised linear regression (a.k.a. "ridge" regression), then you would need to have some way of choosing the regularlisation parameter, such that it will be valid when testing on new data, rather than just fitting the "training" data perfectly.

tdc
  • 7,569
  • 1
    To add to this answer: Apart from regularization, other possible hyper-parameters for linear regression could be: adding/not adding an intercept, or using another error metric (e.g. MAE instead of MSE) – Denwid Apr 02 '18 at 22:08
0

I just want to know how much of a good fit line my best linear line is

Whilst there might only be a single way to fit a best-fit line to your data, which in fact isnt true, because you can vary various kinds of regularization and so on. But even if it was true, your training data is a single draw from the underlying production distribution. You dont know how well that draw matches the actual distribution. It'll probably match a bit, but clearly it cant match 'perfectly', since the distribution itself is continuous (I assume?), whereas your drawn examples are discrete, Dirac-delta type things.

To estimate how well your line of best fit will generalize to new data, you'll therefore almost certainly want to do some kind of cross-fold validation, and/or bootstrapping etc, to get some kind of measure of what happens to the line, and to the test samples, according to different draws.

A simple standard approach is cross-fold validation: randomly split the data you have into eg 80% train, 20% split. Train on the train, test on the split. Do this 5 times; and average the accuracy from each split.

Hugh Perkins
  • 4,697