When turning continuous values into discrete, should you use mean of data observed, or expected normal distribution?

Question

Recently I did a school project where instead of predicting height as a continuous value, we turned it into three categories of 'tall', 'average' and 'short' and predicted it via logistic regression with some other, mostly categorical, variables. We were told to use first quartile of '175' and third of '185' to divide height into categories, as height follows normal distribution from 170cm to 200cm. However, once I plotted the data we were given, it doesn't seem to fit normal distribution, as vast majority of samples are outside of what should be the bell curve.

My question is this: when analyzing problems like these and dividing into categories, should you use the data from the actual dataset you are observing or if it is something that should follow normal distribution, should you assume normal distribution values for things like mean and quartiles in order to get a results that match the real world?

I apologize if my question seems stupid, I am pretty new to data science, and this task confused me...

Welcome to Cross Validated! Binning like this tends to be a bad idea (no matter how common it is). Consequently, it is hard to recommend a way to do this. // We recently had a question about this topic that has some good answers and links to others. — Dave, Oct 19 '22 at 00:37
@Dave Thank you for the link with resources. It was a great read, and although I didn't understand some parts of it, I feel like I learned more! The part that confused me is that task description mentioned normal distribution, while plot seemed to match more a bimodal one. But after reading the questions you linked, I believe it were arbitrary choices as it is an introductory statistics course, probably as to not confuse us. Thank you again! — Euchidna, Oct 19 '22 at 02:01
Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Oct 19 '22 at 02:33

When turning continuous values into discrete, should you use mean of data observed, or expected normal distribution?

0 Answers0