1

I am sorry, I have a simple question that I am confused about (I AM STILL A BEGINNER):

When I create a model let's say a decision tree model and I specify random_state=integer to get reproducible outputs, then I run cross validation (let's say kfold with k=5) and I also specify random_state=integer in my CV to get reproducible outputs, then take the average R^2 for my kfolds, is this enough to give me a clue about how good is my model?

new_model = DecisionTreeRegressor(max_depth=9,
                                  min_samples_split=2,random_state=0)

crossvalidation_Decision_Trees = KFold(n_splits=5, random_state=0,shuffle=True)
model2=new_model.fit(X_normalized, y_for_normalized)

scores_D_Trees = cross_val_score(model2, X_normalized,y_for_normalized, scoring='r2', cv=crossvalidation_Decision_Trees, n_jobs=1)

print("\n\nDecision Trees"+": R^2 for every fold: " + str(scores_D_Trees))

print('\033[1m'+"Decision Trees"+'\033[1m'+": Average R^2 for all the folds: " + str(np.mean(scores_D_Trees)) + '\033[0m'+ ", STD: " + str(np.std(scores_D_Trees)))

OR: Shall I remove the random_state from my decision tree model AND from my CV and let the code take different training and testing datasets every time I run the code, repeat that many times (let's say iterations=5) and at the end take the average R^2 for the average R^2 of my kfolds for these 5 iterations as an indicator for my model's performance? Will this be a better evaluation of my model?

new_model = DecisionTreeRegressor(max_depth=9,
                                      min_samples_split=2)

crossvalidation_Decision_Trees = KFold(n_splits=5,shuffle=True)
model2=new_model.fit(X_normalized, y_for_normalized)

scores_D_Trees = cross_val_score(model2, X_normalized,y_for_normalized, scoring='r2', cv=crossvalidation_Decision_Trees, n_jobs=1)

print("\n\nDecision Trees"+": R^2 for every fold: " + str(scores_D_Trees))

print('\033[1m'+"Decision Trees"+'\033[1m'+": Average R^2 for all the folds: " + str(np.mean(scores_D_Trees)) + '\033[0m'+ ", STD: " + str(np.std(scores_D_Trees)))

OR: Any of these approaches is acceptable?

Note: Let's ignore hyperparameter tuning for now.

Z47
  • 31

1 Answers1

1

The scikit-learn team has written extensive tutorials on how to do cross validation well. You might want to give GridSearchCV a try. You can use it to cross-validate one model or many. This will make your code straightforward to extend when you want to use cross validation for model selection/hyperparameters tuning, as in your previous question.

Selecting dimensionality reduction with Pipeline and GridSearchCV
Cross-validation on diabetes Dataset Exercise
Comparing randomized search and grid search for hyperparameter estimation

from sklearn.datasets import load_diabetes
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor

X, y = load_diabetes(return_X_y=True) model = DecisionTreeRegressor()

Default parameters

params = {}

or

Hard-coded parameters

params = { "max_depth": [9], "min_samples_split": [2], }

n_splits = 10

The KFold default is no shuffling,

so we explicitly turn shuffling on.

cv = GridSearchCV( model, params, scoring="r2", cv=KFold(n_splits, shuffle=True), ) cv.fit(X, y)

Average R-squared across the k folds

cv.cv_results_["mean_test_score"]

Standard deviation of the R-squared

cv.cv_results_["std_test_score"]

dipetkov
  • 9,805
  • Thank you. I've used RandomizedSearchCV and GridSearchCV with CV in each but I haven't included this part of my code in the question. I was hoping for a more of a straightforward answer to my question. Like which approach 1 or 2 or both are acceptable? – Z47 Aug 11 '22 at 22:39
  • I think GridSearchCV is a better option than either approach 1 or 2, so that's why I wrote the answer. – dipetkov Aug 11 '22 at 22:45
  • let's assume GridSearchCV is not an option, may I know which approach you would go with and why? – Z47 Aug 11 '22 at 22:47
  • As the comment says, this is the average R-squared across the 10 folds. You can also get the scores for each fold. You will need to actually read the documentation to learn how to use these tools. – dipetkov Aug 12 '22 at 00:53
  • Thank you. I've figured that out that's why I've deleted my comment. The thing is now my R^2 is -216 and using kfold it was 0.42! – Z47 Aug 12 '22 at 01:00
  • Oversight on my part: I didn't show you how to shuffle the data, as scikit-learn KFold doesn't shuffle by default. You have to know how to turn it on. The most efficient way to learn this is reading the documentation. – dipetkov Aug 12 '22 at 19:25
  • I think I have shuffle=True in the code I posted. – Z47 Aug 12 '22 at 19:33
  • It matters where the shuffle=True is added. Anyway, I think the best use of your time will be to learn enough scikit-learn. And the best way to achieve this is to read the docs and do the tutorials. – dipetkov Aug 12 '22 at 19:38