Coming from statistics, I'm freshly trying to learn machine learning. I've read a lot of tutorials about ML, but have no real training.
I'm working on a little project where my dataset have 6k lines and around 300 features.
As I've read in my tutorials, I split my dataset into a training sample (80%) and a testing sample (20%), and then train my algorithm on the training sample with cross-validation (5 folds).
As I re-ran my program twice (I've only tested KNN which I now know is quite not appropriate), I got really different results, with different sensitivity, specificity and precision.
I guess that if I re-run the program until metrics are good, my algorithm will be overfitted, and I also guess it would be because of the resample of test/training samples, but please correct me if I'm wrong.
If I'm going to try a lot of algorithms to see what I can get, should I fix my samples somewhere? Is it even OK to do try several algorithms? (it would not always be in statistics)
In case it matters, I'm working with python's scikit-learn module.
*PS: my outcome is binary and my features are mostly binary, with few categorial and few numeric. I'm thinking about logistic, but which algorithm would be the best one ?
model_selection.cross_validate) on the two formers to auto-selected the best K hyperparameter (hence having only 2 populations is ok). I'm a bit confused here, should I do cross validation on the test sample too ? Withmodel_selection.cross_validatetoo ? – Dan Chaltiel Mar 15 '19 at 22:16model_selection.train_test_splitinscikit) will be called each day. Is this an issue ? If yes, is there a optimized method to deal with it or should I just save the result astrain.csvandtest.csv? – Dan Chaltiel Mar 16 '19 at 10:53