1

I recently started working with sklearn, and found myself creating new features often (new features with K Bins, with various Encoders etc.).

What I noticed though, is that is very difficult to systematically create new features (i.e. have a general approach in creating new features). For example, I was recently analizyng the Titanic dataset on Kaggle (https://www.kaggle.com/c/titanic/data).

Let's assume, for the sake of example, that I split the age in n bins and find out that the bin containing the ages between 20 and 25 is a lot more likely than the other age-bins to survive, and the other age-bins are uncorrelated with the survival feature.

It does make sense in this case to split the age in different bins as the 20 - 25 age Bin will give me more insight in the data.

Once I start thinking like this, however, the possibilities are endless: how about creating an additional feature with the people that are aged 20 to 25 and come from a certain port? How about creating a feature with people that are aged 20 to 25, come from a certain port and have a certain social class and are all male?

By looking at some examples online (at this specific dataset or others) I saw the general approach is to visually look at the data (making charts, etc.) and then think about the new features to create but is there a general way of creating new features without ending up with a dataset that is way bigger than the original one?

Lorenzo
  • 11
  • 1

1 Answers1

0

I am afraid that your question has no good answer. There is no "best" or universal way of creating features for the machine learning model. It is usually done by combining your experience, domain knowledge, and results of exploratory data analysis on the available data (what you mentioned).

For example, say that there is a disease that is known to progress with age differently for males and females. Knowing this, you could add an interaction age_gender = age * gender to your features. This is something you would know from your domain knowledge or interviewing domain experts. It may or may not be instantly visible from the exploratory data analysis, especially if there are other contributing factors.

You mentioned binning the data as an example, but in fact, it is a quite poor choice for a feature as you can learn from the What is the benefit of breaking up a continuous predictor variable? thread.

Finally, you may choose machine learning models that do feature engineering by themselves (like neural networks). Keep in mind still, that even for such models, creating good features may improve the performance or make things easier as you can learn from the Utility of feature-engineering : Why create new features based on existing features? thread.

If you are looking for further resources on this subject, there is a nice introductory book Feature Engineering for Machine Learning by Alice Zheng and Amanda Casari.

Tim
  • 138,066