Performing ordered data encoding gives more accurate results than integer encoding or one-hot encoding

Question

I have been playing around with the titanic data set from Kaggle. The aim of this data set is to predict if a person will survive from the given variables. I tried a few methods of encoding a persons title (Mr, Sir, Dr etc), including integer and one-hot but I got the highest accuracy when used integer encoding but ordered the title numbers by survival rates, e.g.

Title    %_survived    integer_encoding
Mr        20           0
Dr        40           1
Mrs       75           3

I assumed this worked as it would help an algorithm like linear regression fit a non-linear model if the data had a roughly positive correlation with survival. A negative correlation also improved the accuracy but by a much smaller amount.

I noticed that age did not have a completely negative correlation with survival, as survival rates dropped for aged 6-12. I tried using this ordered encoding again, treating each year as a group and ordering by percentage survived. This again increased accuracy. I am a little confused as to how this worked for continuous data. Is it just a quirk of the data set and models I'm using or is this a reliable way of pre-processing data?

This sounds like target encoding https://stats.stackexchange.com/questions/490810/target-encoder-for-logistic-regression maybe add that tag? — kjetil b halvorsen, Dec 01 '23 at 17:11

Performing ordered data encoding gives more accurate results than integer encoding or one-hot encoding

0 Answers0