7

$\newcommand{\rv}[1]{\texttt{#1}}$The Question: The wealth in billions of dollars for $232$ billionaires is given in $\rv{fortune}.$ Determine a transformation on the response to facilitate linear modeling.

Note: This is a significant portion of Problem 14.7 in Linear Models with R, 2nd Ed., by Julian Faraway.

My Work So Far: There is the usual R code:

require(faraway)
data(fortune, package='faraway')
summary(fortune)
     wealth            age         region
 Min.   : 1.000   Min.   :  7.00   A:38  
 1st Qu.: 1.300   1st Qu.: 56.00   E:80  
 Median : 1.800   Median : 65.00   M:22  
 Mean   : 2.684   Mean   : 64.03   O:29  
 3rd Qu.: 3.000   3rd Qu.: 72.00   U:63  
 Max.   :37.000   Max.   :102.00         
                  NA's   :7 

Plotting $\rv{wealth}$ against $\rv{age}$ is instructive:

require(ggplot2)
ggplot(fortune, aes(x=age, y=wealth, shape=region))+geom_point()

with result

enter image description here

I have tried many, many combinations of the form

sumary(lm(wealth~age,fortune))
sumary(lm(log(wealth)~age,fortune))
require(pracma)
sumary(lm(sigmoid(wealth)~age,fortune))
sumary(lm(sigmoid(wealth)~age+I(age^2),fortune))
...

(That is not a typo in the sumary command: it's Faraway's simplified summary function for linear regressions - a bit less verbose.) I have not gotten anything remotely suitable. Realizing that $R^2$ is by no means the only indicator of quality of fit, you would still expect it to be somewhat larger: the largest value of $R^2$ I've gotten is $0.03.$ I tried the Box-Cox method, with the result of $\lambda=-0.7955,$ but it performed no better than any of the other transforms.

My Question: What transformation will work on this data so that I can get even a half-decent model? What's odd about this data is how right-skewed the $\rv{wealth}$ variable is (financial variables often are, I understand), and I think that is affecting the regressions. Should I eliminate some outliers or influential points? Check leverages?

  • Your $R^2$ is low since you are not using the explanatory variable "region" – user277126 Jul 15 '22 at 16:41
  • log works fine. The fundamental problem is that there doesn't seem to be a strong relationship between wealth and age and/or region. No transformation is going to change that. – COOLSerdash Jul 15 '22 at 16:46
  • 1
    The marginal distributions of your response need not be normal; a normality assumption would be about the errors. I don't see why you have to transform anything. That your features seem not to be predictive is a separate issue. – Dave Jul 15 '22 at 16:47
  • @user277126 Including region doesn't help the $R^2$ at all. – Adrian Keister Jul 15 '22 at 17:12
  • @COOLSerdash Define "works fine". It doesn't improve $R^2$ much at all. The $\log$ transform does greatly reduce the residual SE, but that doesn't mean I have even a semi-decent model. – Adrian Keister Jul 15 '22 at 17:14
  • @Dave Well, this is a textbook exercise, and I am trying to learn linear regression. I have seen instances where, without a transform, nothing appears workable, but with just the right transform, all of a sudden the regression works pretty well. I'm kinda hoping someone will have that sort of insight. – Adrian Keister Jul 15 '22 at 17:16
  • 2
    FYI, it's not meaningful to compare residual SE between the raw and transformed values (or between different transformations). They are in different units. – mkt Jul 15 '22 at 18:06
  • "you would still expect it [R^2] to be somewhat larger" You are making the assumption that there is a linear relationship. There may be no relationship, or the relationship may be nonlinear (one could imagine wealth declining in the oldest age category, leading to a unimodal relationship). – mkt Jul 15 '22 at 18:08
  • @mkt Thanks for the residual SE comment. As for $R^2,$ what I'm saying is that IF I'm going to have a model that's half-decent, THEN I would expect $R^2$ to be somewhat larger. The textbook exercises for this book have almost always had reasonably large $R^2$ - the lowest I've seen before is in the $0.6-0.7$ range. – Adrian Keister Jul 15 '22 at 18:11
  • 2
    The question doesn't ask you to model these particular three variables. It asks you to give an opinion about how you might initially express the values of wealth were you to use it in a model with any other variables. – whuber Jul 15 '22 at 18:25
  • @whuber That is an interesting take on the question, and entirely possible. The first two parts of the question just ask for plots. The third part is what I have written in the OP. The fourth part asks what is the relationship of age and region to wealth. The fifth part asks to check the assumptions of my model using appropriate diagnostics. I didn't include the other parts, because I hadn't anticipated that they would bear on this part. I think Faraway does want a model, but it may not be a very good one. – Adrian Keister Jul 15 '22 at 18:41
  • 1
    There isn't much of a detectable relationship among the three variables here. Drawing a small multiple of wealth vs. age scatterplots, one per region, will reveal all there is. It's a little clearer when using -1/wealth in those plots, but it's still evident there's not much of a relationship. – whuber Jul 15 '22 at 21:50
  • 1
    @whuber Yeah, that was the conclusion to which I arrived, as well. Thanks for your help! – Adrian Keister Jul 15 '22 at 21:54

2 Answers2

8

Modeling is often facilitated, in the sense of being simpler or more easily interpreted, when variables have approximately symmetric distributions. (Be careful, though: when the variables are response variables in regressions, usually you want to make their residuals approximately symmetric.) I will say a little more about this at the end.

Among the summary statistics you present are three that stand out in that they are (a) robust (that is, relatively insensitive to outlying data) and (b) indicate the location (typical value) and spread of the data. These are the median and both quartiles, which I reproduce here for the wealth variable.

  Statistic | Value | Transformed (see below)
  --------- | ----- | -----------
  Q1        | 1.3   | 0.231
  M         | 1.8   | 0.444
  Q3        | 3.0   | 0.667

Any self-respecting transformation will be continuously increasing. This means that the quartiles and medians of the transformed values will be the transforms of the original quartiles and medians. One test of symmetry is that the quartiles are equally spaced on either side of the median. This makes it extremely simple to see whether any proposed transformation achieves even approximate symmetry.

Among the simplest transformation to consider are the Box-Cox transformations, $x\to (x^p-1)/p$ (equal to the natural log when $p=0$). We usually try "simple" Box-Cox parameters (powers), such as small integers, small multiples of $1/2,$ and small multiples of $1/3.$ By trial and error (I set up a spreadsheet for quick work) you will quickly arrive at a power of $p=-1,$ where the transformed values are given in the rightmost column of the table. For $p=-1,$ Q1 is $0.444 - 0.231 = 0.214$ less than M while Q3 is $0.66 - 0.444 = 0.222$ greater than M: close enough.

($p=-1.1$ would place the median exactly between the quartiles, but it's not worth fussing over the difference between $-1$ and $-1.1.$ Moreover, $p=-1,$ the reciprocal, is far easier to interpret. You can understand it as converting "billions of dollars per person" into "people per billion dollars.")

Thus, with remarkably little work or sophisticated apparatus, we quickly determine that the Box-Cox transformation with power $-1$ -- that is, the reciprocal -- will make the distribution of wealth approximately symmetric.

That would be an attractive place to begin further analysis.

This R code to display histograms of wealth and its Box-Cox transform makes the skewness of the former and near-symmetry of the latter completely obvious.

data(fortune, package='faraway')
hist(fortune$wealth)
hist(-1/fortune$wealth)

Figure


This exploratory data analysis (EDA) is completely agnostic in the sense that it presupposes no particular model of the data, nor does it impose any restrictions on the modeling that you intend to do. The only thing it has done is to take a batch of data in the form they were recorded and expressed them in a different way. Not only is this completely legitimate and compatible with any form of modeling, experience tells us it gives you a better chance of gaining insight, developing explanations, and creating useful descriptions of the data because the form in which you express the data is based on properties of the data themselves.

Reference

John W. Tukey (1977), EDA. Addison-Wesley.

whuber
  • 322,774
  • Supposing our measurements were represented by a collection of random variables(RVs), then any data transformation $f$ on those measurements would also have a corresponding change of random variables. Thinking of RVs as models of measurements, we can be modelling throughout EDA if we choose to. – Galen Jan 25 '23 at 04:31
  • My previous comment makes an assumption of measurability of function of random variables, which can often be satisfied. The link on change of variables supposes smoothness, but this is not required in general. – Galen Jan 25 '23 at 04:34
  • @Galen Thank you for your comments, but I feel that I don't really understand their point. – whuber Jan 25 '23 at 12:03
3

Determine a transformation on the response to facilitate linear modeling

facilitate is vague, possibly intentionally so. In my opinion, log transformation is appropriate for two reasons.

First, it is likely that if you want to use ordinary linear regression it will give residuals with approximately normal distribution, which is necessary to get reliable p-values.

Secondly, and maybe more important, log transformation, especially log2, is nicely interpretable in the case of wealth. I think it is sensible to treat wealth on a multiplicative scale since an additional unit should have an effect that depends on the current status. For example, an increase from £20,000 to £40,000 will change your lifestyle, but a change from £1,000,000 to £1,020,000 is barely noticeable.

I guess it also depends on what you want to do with the model. If you care more about prediction than understanding, maybe interpretability is less relevant.

dariober
  • 4,250
  • The $\log$ transform certainly reduces the residual SE, and that's good. But I still have a very low $R^2$ value. I wouldn't be willing to use this model to do anything at all. I can certainly agree with your a priori reasons for choosing the $\log$ transform, but a posteriori we are still left with an unusable model. – Adrian Keister Jul 15 '22 at 17:17
  • 2
    If your explanatory variables are truly irrelevant, then a low $R^2$ is the appropriate result. It could be that you know beforehand that they have a strong effect, but then I would question whether you are missing some interaction effect or some other confounder. – dariober Jul 15 '22 at 17:24
  • Well, that is an interesting point. I have not tried any interaction terms as yet. Lemme see how that goes... Interaction terms don't help all that much. – Adrian Keister Jul 15 '22 at 17:26
  • 1
    This is all the data I've got, and I can't get any more. So if there is a confounder, I'm stuck trying to work around it. – Adrian Keister Jul 15 '22 at 17:29