0

I have a disease rate calculated for people aged <50 (early-onset) and aged 50+ (late-onset), for every US county. The rate ratio is the early-onset rate divided by the late-onset rate for every county. Then I calculated the correlation coefficients (CC) between the rate ratio and 40 county-level risk factors (e.g., obesity prevalence). If the CC is positive, then I think the risk factor has a stronger association with the early-onset rate than with the late-onset rate.

However, I found that all risk factors had a positive association with the rate ratio. This seems strange as I think at least some risk factors should have a stronger association with the late-onset rate.

Then I found that the range and variance of the early-onset rate are much greater than those of the late-onset rate. This is due to that there are much fewer people with early-onset disease than those with late-onset disease. So the early-onset disease rates are less stable due to the small number of cases. Thus, for the rate ratio, the numerator has a much higher variance than the denominator. I think that is why all risk factors are positively associated with the rate ratio. In an extreme case, if there is no variance in the denominator, the association between the rate ratio and risk factors would be the same as the association between the early-onset disease rate and risk factors.

I tried standardizing the early-onset disease rates to the range of the late-onset disease rate (same method in this post). Then they had a similar variance and the same range. The CCs also showed both positive and negative associations between the rate ratio and risk factors. However, I am not sure if this is a valid way to obtain the direction of associations. If it is a valid way, is there any published paper to support it?

utobi
  • 11,726
  • This paper states that using a rate ratio is problematic (https://journals.sagepub.com/doi/10.1177/1094428118773455). I haven't figured out the suggested solutions from this paper yet. – Weichuan Dong Dec 26 '22 at 03:38

1 Answers1

0

There are several problems with your approach. In no particular order:

  1. Modeling ratios of ratios is problematic. It's best to model probabilities of disease incidence or prevalence directly, for example with a binomial regression, and then use the results of the model to evaluate ratios of ratios if that's what you really need.

  2. It's not clear how you are accounting for differences in county sizes in your model. "The most populous county is Los Angeles County, California, with 10,014,009 residents in 2020... The least populous county is Loving County, Texas, with 64 residents in 2020." (Wikipedia) A proportion for Los Angeles County is much more reliable than one for Loving County. A binomial regression can handle that problem, giving appropriately more weight to larger counties.

  3. The one-risk-factor-at-a-time approach, evaluated with correlation coefficients, can be highly misleading. It's often combinations of risk factors that matter. The one-risk-factor-at-a-time approach runs a risk of omitted-variable bias. You can get false positives for risk factors that just happen to be correlated with other more important risk factors, or miss important risk factors that become significant when other risk factors are also taken into account. Correlation coefficients also can be poorly behaved when you are working with ratios. Again, binomial regression can incorporate all of your risk factors at once, and the regression coefficients for the risk factors will be more reliable for evaluating their contributions to differences among counties.

  4. It's not completely clear what measure you are using for "disease rate" of early-onset and late-onset cases. If you are measuring prevalence of a chronic condition, then the prevalence in those age 50+ will also include those who developed the condition before age 50.

Once you are sure that you have handled that last issue, binomial regression would be a reasonably straightforward way to analyze the data. You would provide 2 data rows for each county, one for the early-onset "disease rate" as a proportion of individuals at risk and one for the corresponding late-onset rate. Each data row would have the county-level risk factors, a variable indicating if the row is for early- or late-onset cases, and the number of individuals at risk in that age group and county. You model the proportions as a function of the risk factors and their interactions with the early/late indicator variable, using the numbers at risk as weights. That interaction coefficient for a risk factor will show directly how much its influence differs between early- and late-onset disease.

A couple of additional warnings. First, make sure that causality is respected in your model. If there's reasonably high prevalence of the condition, then some of the apparent "risk factors" might result from the condition rather than contribute to its incidence/prevalence. Second, it's best to evaluate age flexibly as a continuous predictor rather than with a fixed cutoff like you are using at age 50. I suspect that you only have information available based on this cutoff, but for future work know that continuous modeling is preferable when possible.

EdM
  • 92,183
  • 10
  • 92
  • 267