Accuracy score change a lot by changing random seed in train/test split

Question

I'm running a ML algorithm on some data, and I noticed that if I change the random state inside the train_test_split function, accuracy score change in a quite wide range.

For example, with random state = 4, I reach an accuarcy score range that may vary from 0.78 to 0.8 (it depends by the seed in the algorithm). By using another value, like 42, it goes down to 0.65 - 0.69.

I don't have duplicates in the dataset, and the task is a multi-class text classification.

I really don't understand this beahviour, is there an explanation?

Thanks.

Welcome to Cross Validated! How many observations do you have? — Dave, Oct 23 '23 at 14:03

Dave · Accepted Answer · 2023-10-23T14:28:47.823

1

You have $560$ total observations. Harrell has noted that splitting the data the way you have often leads to instability in the performance metrics until the sample size reaches $20000$ (this is probably in his Regression Modeling Strategies textbook with references to the primary literature, and he has written this sort of comment on here [1, 2] and on his blog [3], too). Consequently, your results, even if disappointing, are not surprising.

You might be curious to run your code through something like this:

# Loop over 500 seeds
#
for i in range(500):
# Set a new seed
# 
np.random.seed(i)

# Run the rest of your code to split the data, train the model, 
# and report the performance for seed i

This will try many seeds and likely show there to be a major dependence of accuracy on the seed. The interpretation would be that your performance in production is subject to major variability and that the one number you get for the one seed is not trustworthy.

edited Oct 23 '23 at 14:28

answered Oct 23 '23 at 14:12

Dave

62,186

many thanks for the answer. Actually the task which I was required was to label 10k titles, in order to build the df I manually labeled 560 of them. According to Harrell, since the whole dataset is 10k, I would not have enough data for handling this with ML, right? – Federicofkt Oct 23 '23 at 14:15
@Federicofkt Machine learning vs simpler models is really a separate issue. Harrell's 20k comment is really about the train-test splitting. He typically advocates for bootstrap validation. – Dave Oct 23 '23 at 14:19
@Dave isnt it just some kind of bootstrapping ? – Renaud Bied-charreton Oct 23 '23 at 14:22
@RenaudBied-charreton Isn't what some kind of bootstrapping? – Dave Oct 23 '23 at 14:25
the way you would repeat the splitting randomly – Renaud Bied-charreton Oct 23 '23 at 14:26
1

@RenaudBied-charreton I do not see the way in which that loop samples with replacement, so I wouldn't call that bootstrap. – Dave Oct 23 '23 at 14:30

Accuracy score change a lot by changing random seed in train/test split

1 Answers1