3

I'm doing a Logistic Regression on company level using the total assets to control for company size. Due to the skewedness of the data, I do a log-transformation of the asset data. While I get no significance for the log-transformed variable, the not transformed variable is highly significant (p < 0.01). What could be a possible explanation of such a result? Does it mean that the relationship is not diminishing for extremely high and low values or does it show an outlier problem and should not be used for interpretation of the data.

mpiktas
  • 35,099
Andreas
  • 31
  • With highly (positively) skewed data, a very small number of extremely high values strongly influence the regression. Until you perform diagnostic analyses of influence in the "significant" regression, its results have to be considered unreliable. – whuber Nov 07 '13 at 20:43

2 Answers2

4

I don't prefer to think of this type of a problem as a "choose between two transformations" problem but rather I like to estimate the transformation as part of the modeling process. In doing so we take care of multiplicities (possible inflated type I error) by having a parameter in the model for everything we think might be needed. Consider expanding the predictor using a regression spline such as a restricted cubic spline (natural spline). Test for association by doing a "chunk" test of all the parameters jointly that involve that predictor. With a restricted cubic spline this test will have $k-1$ degrees of freedom where $k$ is the number of knots (join points), and using defaults for knots based on the marginal distribution of the predictor will work fine (this is how the R rms package's rcs function does it).

Once you fit the spline model you can plot the predicted value vs. the predictor to learn about the estimated shape in the logistic model.

Concerning $Y$ make sure that it is truly all-or-nothing and does not represent a dichotomization.

Frank Harrell
  • 91,879
  • 6
  • 178
  • 397
3

A. There are a number of textbook and statisticians that use this method of plugging in different transformations of a variable and using the p-values to provide evidence to support a conclusion about the nature of the relationship between the variable and the outcome. It seems appealing.

The Vittinghoff book "Regression Methods in Biostatistics" has a rather long section where they plug in a predictor as a categorical or continuous variable or both and discuss the p-values.

In practice I haven't found this approach meaningful. If you're doing exploratory analysis graphical methods including residual plots are usually more helpful. In this case I would be reluctant to make any assumption based on the p-value but would consider different graphical approaches.

B. There are a number of common reasons why you might log transform a variable: (1) you think you should but not sure why (2) you want to linearize the relationship between x & y (3) you’re trying to address that your data is clumped/with outliers.

You’ll get better answers to your question if you’re explicit about your reasons for transformation. Most I think will assume you are motivates by (2) since this is perhaps the most valid reason, but from the question is seems that you might be motivated by (3) which is likely the most common reason.

charles
  • 264