2

I am trying to do regression analysis with level of a chemical in blood as dependent variable and age, gender and weight of children as predicting variables. The sample size is about 5000. Age and weight are highly correlated in children. My doubts are:

  1. Should I use z-scores or percentiles for weight rather than raw values?

  2. Should I use some other technique rather than ordinary linear regression?

  3. Do I need to check if data has normal distribution at this sample size?

Edit: I want to clarify regarding z-score or percentile here: I have ages as 5,6,7,8 etc with no fractional ages. I thought for each age I can calculate z-score or percentile of weight for that individual child and use it instead of raw weight. By this I can answer the question that 'Is being overweight for age has any effect on blood level of the chemical'? Is this reasonable argument? Also, this question differs from the earlier question and is not a duplicate. My questions 2 and 3 do not figure in the title.

Regarding a comment on biological issues by @DLDahly: The ages are 5-15 years. Biologically, I want to determine if the weight is a predictor of blood level of chemical, independent of age? Chemical level rises with age, but it is not clear if being overweight increases it further. Actually, one cannot rule out the possibility that this rise may be related mainly to weight and not to age as such.

rnso
  • 10,009

1 Answers1

1

The difficulty's in equating say an eight-year-old whose weight is two standard deviations above the mean for his age (fat), with a fourteen-year-old whose weight is two standard deviations above the mean for his age (shooting up). And even if you're happy with that for the population, you still need to be happy with it for your sample.

Rather than try to stipulate how age moderates the effect of weight on the blood concentration of some chemical, as you've got 5000 observations you can afford to be more flexible: an additive model with some non-linear terms in age already allows the effect of weight to be controlled for age; including interaction terms allows the slope to vary.

Suppose you were considering $$ \operatorname{E} Y = \beta_0 + \beta_a a + \beta_w w' $$ where $Y$ is blood concentration of the chemical, $a$ is age, $w'$ weight standardized within each age, & the $\beta$s the coefficients

then the model $$ \operatorname{E} Y = \beta_0 + \beta_{a} a + ... + \beta_{a^{10}}a^{10} + \beta_w w + \beta_{wa} wa + ... + \beta_{wa^{10}} w a^{10} $$ where $w$ is unstandardized weight, would include the first as a special case while being much more flexible—it doesn't rigidly assume it's the no. standard of deviations from the mean weight within each age group that's what counts, while still allowing slope & intercept for weight to vary within each age group. Of course you likely needn't go up to a 10th-order polynomial for a good fit, & it'd be sensible to allow for non-linearity in the effect of weight as well (I'd suggest a natural spline basis).

  • Thanks for your advice. Could you clarify, preferably using formula terms, what exactly you mean by additive model with non-linear terms & interaction? Is it something like: lm(y ~ wt * (age+I(age^2)) ) ? – rnso Jun 02 '15 at 15:04
  • This analysis may become very complex if other variables are also to be added. The model may also be more difficult to explain or interpret, which is my primary aim here rather than prediction of future data. Also, can you provide some information / good link for natural spline usage here. – rnso Jun 03 '15 at 01:03
  • 1
    Your ability to interpret the model depends on the process you are trying to model, which still hasn't been illucidated. Also, people can't be expected to contribute to your question if the goal posts are going to be moved (e.g. other variables could be added). That said, the variance in total mass at different ages in human populations is large enough that you almost certainly don't need to worry about collinearity - so the relatively simple linear model given here is probably your best bet. – D L Dahly Jun 03 '15 at 08:39
  • 1
    @DLDahly: Very true. I'm not trying to advocate any particular model on such scanty information, just to show that standard empirical modelling procedures allow you to address concerns such as "what if the effect of weight varies with age?" without having to resort to shaky assumptions such as "the effect of weight is inversely proportional to the standard deviation of weight at each age group". – Scortchi - Reinstate Monica Jun 03 '15 at 08:49
  • @rnso: It could go either way: if, as you say is plausible, age per se has no effect, & the mean & variance of weight are very variable for different ages, then using age-standardized weights could necessitate a much more complex model, obfuscating a simple relationship between blood concentrations of the chemical & absolute weight. – Scortchi - Reinstate Monica Jun 03 '15 at 10:38
  • What is the role of rcs() function of rms package? Will following work well: library(rms); ols(y ~ age + gender + rcs(wt), data=mydata) ? – rnso Jun 03 '15 at 12:00
  • @rnso: Useful information on regression splines can be found at Frank Harrell's RMS site, as well as on how to decide how complex a model to fit overall, how to allocate degrees of freedom among predictors, & how to validate the model. – Scortchi - Reinstate Monica Jun 03 '15 at 12:32
  • Which one is better: lm(y ~ gender+ age + std_wt) or lm(y~ gender + age * wt) or ols(y ~ gender + rcs(age) + rcs(wt) )? The interaction of (age * wt) should tell me if wt is important after correction for age. – rnso Jun 03 '15 at 17:17