I have been playing around with the titanic data set from Kaggle. The aim of this data set is to predict if a person will survive from the given variables. I tried a few methods of encoding a persons title (Mr, Sir, Dr etc), including integer and one-hot but I got the highest accuracy when used integer encoding but ordered the title numbers by survival rates, e.g.
Title %_survived integer_encoding
Mr 20 0
Dr 40 1
Mrs 75 3
I assumed this worked as it would help an algorithm like linear regression fit a non-linear model if the data had a roughly positive correlation with survival. A negative correlation also improved the accuracy but by a much smaller amount.
I noticed that age did not have a completely negative correlation with survival, as survival rates dropped for aged 6-12. I tried using this ordered encoding again, treating each year as a group and ordering by percentage survived. This again increased accuracy. I am a little confused as to how this worked for continuous data. Is it just a quirk of the data set and models I'm using or is this a reliable way of pre-processing data?