0

Say I have a continuous Age column and I then create a new feature as an ordinal Age Groups column like so:

Children (00-14)
Youth    (15-24)
Adults   (25-64)
Seniors  (65 and over)

When using the new Age Groups column in a correlation matrix during feature selection do I remove the original Age column prior to running the matrix? I've already noticed that if I keep both columns high +/- correlations between the two columns can occur.

Since I'm just looking at the graph can I simply ignore any correlations with the Age column and just look at how Age Groups relates with the target class and all the other inputs?

Would the existence of both those inputs in the same matrix create potential multicollinearity issues between other inputs or can I simply ignore any relations to Age because it's only a graph? I just have to make sure Age is removed prior to modeling?

Edison
  • 137
  • You can do whatever you want, and correlation between features is not necessary so bad. However, why do you want to bin the values like this? Binning tends to be discouraged, as links I expect other to post (I can’t find the one I like) will show. – Dave Jun 10 '22 at 00:44
  • Really? I thought binning i.e. continuous into ordinal (for experimental feature engineering) was normal? How else would you do feature engineering with an Age column for example? And going back to my original question, what if there are some high +/- correlations between Age and my new Age Group variable? – Edison Jun 10 '22 at 00:46
  • It is common; that doesn’t make it advisable. // What engineering of features do you find necessary? Why not take the raw ages? (There are alternatives, and how you answer why you don’t just take the raw ages will be informative.) // Please read my link about correlated features. – Dave Jun 10 '22 at 00:48
  • I know correlation isn't necessarily a bad thing. I'm just curious about multicollinearity and selection. If the EDA tells me that Age is insignificant, should I stop right there and forget Age ever existed in the universe? Or do I create features from Age e.g. Age Group, to see if I can make age data significant in modeling? – Edison Jun 10 '22 at 00:51
  • Again, why do you want to do anything with the age beyond including raw age as a predictor? // I absolutely can envision a scenario where raw age has an insignificant correlation with $y$ while your binned variable has a highly significant correlation with $y$. Imagine $y$ being low for children, medium for youth, high for adults, and very low for seniors. That would be minimal to no correlation between raw age and $y$, yet categorical bins would be able to model that. (However, there are alternatives, such as splines, that don’t destroy Shannon information in the age data.) – Dave Jun 10 '22 at 01:00
  • Because I thought that was one of the purposes of feature engineering. To find more significance from an input that may not have much significance. For example in a matrix plot or coef table Age is shown to have a low value, but then Age Group -> Youth has a higher value. If that's not one of the purposes of feature engineering then why do it? Aren't we supposed to be creative and experiment? I don't even know what Shannon information is so I'll Google it. – Edison Jun 10 '22 at 01:22
  • I’m not sure why this didn’t post the chat link. – Dave Jun 10 '22 at 02:28

0 Answers0