2

I have a weather dataset containing four features that are continuous values. Temperature is almost normal, but precipitation is highly negatively skewed. In addition, wind speed and humidity are positively skewed! Performing log transformation on all features can somehow improve the skewness of the precipitation but no other features. Can I perform a transformation on each feature separately? For example, implementing log transformation on temperature, and cube root on precipitation. It should be noted that features are independent and not correlated.

Asa Ya
  • 73
  • Are you really sure those features are independent? Precipitation and Humidity? – frank Aug 05 '22 at 09:07
  • Yes, they are independent. Their dependencies are not my issue. My objective to ask this question is can I transform each column according to its skewness condition? For example, np.log(df['Precipitation']), np.sqrt(df['Temperature']) are acceptable or not? – Asa Ya Aug 05 '22 at 11:24

2 Answers2

0

It is absolutely standard to perform a transformation of your features and then fit a model to those transformed features. Authors like Bishop even use the name "linear regression" for this combination of transformation and fitting a linear model, see e.g. his book:

Bishop, Christopher M., and Nasser M. Nasrabadi. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.

So, in particular, if your transformations np.log(df['Precipitation']) and np.sqrt(df['Temperature']) satisfy the assumptions of your model better than the original features, you should apply them.

frank
  • 10,797
0

First, note that there is no normality assumption about the features, so you might be transforming for nothing.

However, if you choose to transform, it is fine to use different transformations of different variables. The following is still a linear regression.

$$ \hat y=\hat\beta_0 + \hat\beta_1 \log(x_1) + \hat\beta_2 \sqrt[3]{x_2} $$

If you have theoretical reasons to care about the logarithm and cube root, or if those transformations improve your performance, perhaps those transformed variables and the features to input into your regression.

Jeffrey Miller’s “mathematicalmonk” channel on YouTube has a good video about this topic. He takes it a step further and allows for transformations involving multiple variables, such as using $\sqrt{x_1x_2}$ as a feature.

Dave
  • 62,186