53

When transforming variables, do you have to use all of the same transformation? For example, can I pick and choose differently transformed variables, as in:

Let, $x_1,x_2,x_3$ be age, length of employment, length of residence, and income.

Y = B1*sqrt(x1) + B2*-1/(x2) + B3*log(x3)

Or, must you be consistent with your transforms and use all of the same? As in:

Y = B1*log(x1) + B2*log(x2) + B3*log(x3) 

My understanding is that the goal of transformation is to address the problem of normality. Looking at histograms of each variable we can see that they present very different distributions, which would lead me to believe that the transformations required are different on a variable by variable basis.

## R Code
df <- read.spss(file="http://www.bertelsen.ca/R/logistic-regression.sav", 
                use.value.labels=T, to.data.frame=T)
hist(df[1:7]) 

alt text

Lastly, how valid is it to transform variables using $\log(x_n + 1)$ where $x_n$ has $0$ values? Does this transform need to be consistent across all variables or is it used adhoc even for those variables which do not include $0$'s?

## R Code 
plot(df[1:7])

alt text

whuber
  • 322,774
Brandon Bertelsen
  • 7,232
  • 9
  • 41
  • 48

1 Answers1

73

One transforms the dependent variable to achieve approximate symmetry and homoscedasticity of the residuals. Transformations of the independent variables have a different purpose: after all, in this regression all the independent values are taken as fixed, not random, so "normality" is inapplicable. The main objective in these transformations is to achieve linear relationships with the dependent variable (or, really, with its logit). (This objective over-rides auxiliary ones such as reducing excess leverage or achieving a simple interpretation of the coefficients.) These relationships are a property of the data and the phenomena that produced them, so you need the flexibility to choose appropriate re-expressions of each of the variables separately from the others. Specifically, not only is it not a problem to use a log, a root, and a reciprocal, it's rather common. The principle is that there is (usually) nothing special about how the data are originally expressed, so you should let the data suggest re-expressions that lead to effective, accurate, useful, and (if possible) theoretically justified models.

The histograms--which reflect the univariate distributions--often hint at an initial transformation, but are not dispositive. Accompany them with scatterplot matrices so you can examine the relationships among all the variables.


Transformations like $\log(x + c)$ where $c$ is a positive constant "start value" can work--and can be indicated even when no value of $x$ is zero--but sometimes they destroy linear relationships. When this occurs, a good solution is to create two variables. One of them equals $\log(x)$ when $x$ is nonzero and otherwise is anything; it's convenient to let it default to zero. The other, let's call it $z_x$, is an indicator of whether $x$ is zero: it equals 1 when $x = 0$ and is 0 otherwise. These terms contribute a sum

$$\beta \log(x) + \beta_0 z_x$$

to the estimate. When $x \gt 0$, $z_x = 0$ so the second term drops out leaving just $\beta \log(x)$. When $x = 0$, "$\log(x)$" has been set to zero while $z_x = 1$, leaving just the value $\beta_0$. Thus, $\beta_0$ estimates the effect when $x = 0$ and otherwise $\beta$ is the coefficient of $\log(x)$.

whuber
  • 322,774
  • 1
    Very helpful description, thanks for the direction and the detail on my subquestion as well. – Brandon Bertelsen Nov 23 '10 at 20:57
  • http://pareonline.net/getvn.asp?v=15&n=12 Osborne (2002) recommends anchoring the minimum value in a distribution at exactly 1.0. http://pareonline.net/getvn.asp?v=8&n=6 – Chris Jun 17 '15 at 22:22
  • @Chris Thank you for the interesting references. I think the reasoning is invalid, though. There is no mathematical basis to the assertion that "similar to square root transformations, numbers between 0 and 1 are treated differently than those above 1.0. Thus a distribution to be transformed via this method should be anchored at 1.00." The logarithm doesn't suddenly change its behavior at $1$! For an example of a more nuanced and statistically appropriate approach to this question, see my answer at http://stats.stackexchange.com/a/30749. – whuber Jun 17 '15 at 22:36
  • The authors are pointing out that log(0.99) is negative while log(1.01) is positive... I will read your post. :) – Chris Jun 17 '15 at 22:41
  • 2
    @Chris All Box-Cox transformations transition from negative to positive at $1$, too. That's irrelevant for a nonlinear transformation, though, because it can be followed up by any linear transformation without changing its effects on variance or linearity of a relationship with another variable. Thus, if your client is allergic to negative numbers, just add a suitable constant after the transformation. Adding the constant before the transformation, though, can have a profound effect--and that's why no recommendation always to use $1$ could possibly be right. – whuber Jun 17 '15 at 22:44
  • 1
    In one of my datasets that I am working on, I noticed if I shifted the dependent response variable to anchor at 1 and used a box cox transformation to eliminate the skew, the resulting transformation was weakened leading credence to your critique. ;) – Chris Jun 17 '15 at 23:07
  • 1
    @whuber My previous question was very silly (will probably delete comment). Of course $\beta_0$ pertains to the $z_x$ dummy indicator, and NOT to the constant in the model. Thank you again for the extensive and clear explanations of this setup; very helpful for my work. Overall I prefer this parametrization as opposed to this other, equivalent approach. – landroni Jul 07 '15 at 18:37
  • @whuber The main objective in these transformations is to achieve linear relationships with the dependent variable [...]. (This objective over-rides auxiliary ones such as [...] achieving a simple interpretation of the coefficients.) This is also one of the central messages coming from CAR... While I am starting to understand the need for transforming data, I'm still a bit unclear on how to approach the interpretation of resulting coefficients. With log it's simple, as it allows transformation while maintaining relatively straightforward interpretations. – landroni Jul 08 '15 at 08:16
  • @whuber But, to take an example from my real data, what if your variable requires e.g. a Yeo-Johnson transformation with lambda=-2 (cf symbox() and yjPower() in car) to achieve symmetry and near-normality? How do you approach interpretation of the associated coefficients then? Such transformations seem to limit comments srictly to the sign and significance of the estimated effect... (In CAR Fox and Weisberg advocate for the use of effects displays, especially when dealing with models containing transformations and interactions, but I haven't quite gotten the hang of them yet.) – landroni Jul 08 '15 at 08:17
  • 2
    @landroni If the data show that (say) a log transform produces a linear relationship between two variables, then that is useful and insightful information. If you were not familiar with the logarithm, you would find that transformation is difficult to interpret, too. But that's a subjective issue--it says nothing about the data. Interpretability is a matter of familiarity. That means understanding the mathematical nature of the transformation. It eventually comes with study and experience. – whuber Jul 08 '15 at 12:56
  • Thanks so much for this clear answer! I am implementing this approach in an analysis I'm conducting, but I find I'm seeing very high multicollinearity as a result; the correlations between the zero-indicator variables and the variables indicating the (log transformed) value of that same variable above 0 are ranging .91 to .99 in my data. Should I just not attempt to interpret the resulting coefficients individually? – Rose Hartman Apr 02 '17 at 02:20
  • @Rose This tends to be a problem with binary variables that are highly skewed (so that one value predominates). The multicollinearity can create large standard errors of the coefficient estimates, making it unreliable to interpret the coefficients. How to deal with that depends on the amount of data you have, the purpose of the regression, theories about the relationships, and much more. – whuber Apr 02 '17 at 17:52
  • @whuber -- how does your "two variables" method with an indicator relate to zero-inflated models? Seems very similar. – abalter Sep 03 '19 at 02:31
  • @abalter They share some features. One difference is that zero-inflated models concern the conditional response whereas the models discussed here concern the regressor variables. This is an important distinction, because the former is random while the latter is determinate. – whuber Sep 03 '19 at 11:32