1

I have a 70K x 30 dataset and I want to build a regression model on it. Right now, I am running a bunch of algorithms via Weka tool with cross-validation and I compare the RMSE values reported by Weka in order to decide which model works better.

However, after I experiment with Multi layer perceptron, Linear Regression and a bunch of tree-related algorithms, the best performance I got was K-NN algorithm. Since this algorithm is very naive and instance based, I am not sure if just comparing RMSE is the right way.

When experimenting a Regression model, what kind of process should I follow?

Weka
  • 21

1 Answers1

2

As long as you repeat 10-fold cross-validation many times to achieve adequate precision, RMSE is a good measure for comparison, as is mean absolute error and median absolute error, the latter two being more robust.

Frank Harrell
  • 91,879
  • 6
  • 178
  • 397
  • Weka tool also reports mean/median absolute error, so I can compare them. I see correlation coefficient value reported as well. Does it play a role on comparing two regression models? – Weka Jun 12 '13 at 04:24
  • $R^2$ is another excellent measure, but if calculated the usual way allows for a linear recalibration of the predictions so does not penalize for predictions being off by a constant or a constant multiple. – Frank Harrell Jun 12 '13 at 12:41
  • @FrankHarrell What would you consider the usual way of calculating $R^2$, $cor(\hat y, y)?$ – Dave Nov 16 '21 at 14:31
  • Correct. Instead you need to go back to basics, i.e., $R^2$ is one minus the sum of squared errors divided by the total sum of squares (the latter being proportional to the variance of Y in the new sample). This correct formula can yield negative $R^2$ (predictions worse than random) which is how you know it's correct. – Frank Harrell Nov 17 '21 at 12:50