2

I am currently trying to run a linear regression model to identify the effect of several explanatory variables X on a response variable Y. My advisor asks me to scale the continuous explanatory & control variables by standardization or normalization. He anticipates that it will improve the regression results (no details).

From my understanding, scaling only affects the size of coefficients, not the significance (including t-statistics and R-square). This does not even change the distribution of the variable as in log transformation to obtain a normal distribution. Thus, scaling would only make it difficult to interpret the coefficient. If what I am trying to do is the data preprocessing for one of the machine learning algorithms (many classifiers) that would determine the importance of each explanatory variable, it would make sense. However, I am more interested in explaining the causality of X on Y, rather than prediction through machine learning.

In this case, is it necessary to use standardization or normalization for my regression model?

J.K.
  • 143
  • It depends on the algorithm. Some procedures automatically perform the standardization, or its equivalent, under the hood, rendering your efforts superfluous. When the ranges of the variables are vastly different (this often happens when including a date measured in seconds), crude algorithms can experience numerical instabilities, so it is not a bad practice to choose suitable units of measurement for all your variables that places each of them on a readily interpretable scale. There's never a need to transform any of your variables into quantities that are difficult to interpret. – whuber Dec 05 '22 at 22:08
  • See https://stats.stackexchange.com/a/563207/919, https://stats.stackexchange.com/a/508961/919, and https://stats.stackexchange.com/a/202209/919 for additional comments. – whuber Dec 05 '22 at 22:09
  • @whuber Thank you very much for your comments! Do you possibly have any idea what standardization of variables by groups (specified by a categorical variable) achieves in terms of the significance of regression model? – J.K. Dec 05 '22 at 22:21
  • It depends on what the model is and what you might mean by such standardization. I am having some difficulty imagining a situation in which a varying form of "standardization" would be either a simplification or helpful, but I'm not familiar with every possible application of statistics by any means. – whuber Dec 06 '22 at 15:43

1 Answers1

1

Standardization for causal inference is a tricky business

This is an amazing question because it raises an unresolved issue at the heart of causal inference: What is a causally meaningful level of representation? The representation includes choosing variables and, as in your case, a suitable measurement scale. Standardization amounts to assuming that a re-scaled version of the measured data is more suitable. This may be justifiable for predictive tasks, but it becomes a lot more tricky in the context of causal inference. The scale of variables may contain useful information but it could also distort your results.

Information in the data scale

From a statistical perspective, you are right in observing that standardization would (in linear regression) affect only coefficient magnitudes, but not statistical significance. So if your method of causal inference only relies on significance, this is not a problem in the first place. However, the data scale (if it is known, as could be the case for example with count data) may hold useful information about the relationship between variables. This fact has been used for example by the winning submission to a competition on finding causal links.

The promise and pitfalls of scale-sensitivity

At the same time, attempting to use information in the data scale may change results in unintended ways. If a suitable data scale is not known, variables with scale-dependent properties such as high variance may come to dominate results (as would be the case e.g. for penalized regression). A recent work on synthetic data from causal models shows that many models produce data with strong scale patterns. These patterns are shown to dominate the performance of algorithms estimating causal structure and influence between variables. This might be a good thing if there was information in the data scale, but in the real world that is hard to be sure of. After all, many scales are arbitrary and who knows whether we should measure in meters or millimeters, pounds or kilogram, bitcoin or USD, and so on.

Take-away

Whether or not you should standardize depends on your method and your domain. If you are convinced that you know the right scale and there may be information in the raw coefficients, then using the original scale may give you additional insights. If you don't know the right scale and/or are not trying to use it, standardization is a good idea to reduce the risk that the data scale changes results in unintended ways.

Scriddie
  • 2,244
  • 6
  • 13