0

Pardon my ignorance here. While analyzing data based on multiple cohorts(7 cohorts), if there is one cohort that is contributing very small number of study sample, relative to data from other cohorts, is it worth including that cohort in my analysis. What are the disadvantages of including this cohort ?

This is how much data I have from each cohort, C4 is the issue.

 C1     C2      C3      C4     C5       C6        C7    
 200    350     1654    17     1101     412       331

The study objective : Impact on exposure(A) on child's mental development(Y).

The outcome, child's mental development is evaluated based on Bayley's Mental Development Index (MD124). This is a continuous variable.

The exposure, here is a mercury, manganese, cadmium. This is a time varying variable, measured during baseline and two other followup visits.

For each data on outcome(y), I have three data points on exposure at time1, three data points on exposure at time2, three data points on exposure at time3 . So Cohort1 is contributing data on 200 unique children, Cohort2 350 unique children so on.

SubjectID  CohortID    Y    Exposure1  Exposure2 Exposure3  Time       
1          C1          51   12.2       10.5      11.7       Baseline
1          C1          53   12.5       10.4      11.5       Followup1
1          C1          54   12.6       10.2      11.6       Followup2
2          C1          51   12.1       10.1      11.7       Baseline
2          C1          53   12.2       10.2      11.1       Followup1
2          C1          54   12.4       10.3      11.2       Followup2
.          .           .    .          .         .          . 
.          .           .    .          .         .          . 
.          .           .    .          .         .          . 
.          .           .    .          .         .          . 
1          C7          51   11.2       12.5      11.7       Baseline
1          C7          53   11.5       11.4      11.5       Followup1
1          C7          54   10.6       9.2       11.6       Followup2
2          C7          51   11.1       12.1      11.7       Baseline
2          C7          53   12.2       12.2      11.1       Followup1
2          C7          54   9 .4       9 .3      11.2       Followup2

I am planning to include cohort id in the model to estimate the cohort effect.

  • 3
    The answer depends a lot on the nature of your study and the hypotheses that you want to test. The question doesn't include enough information on those matters to allow for a helpful answer. Please edit your question to say more about the study, the cohorts, and your underlying scientific question. Please provide that information by editing the question, as comments are easy to overlook and can be deleted. – EdM Mar 15 '22 at 16:06
  • 1
    Ultimately, I would say the issue is whether the question you want to answer involves including that cohort. While it may not be obvious, the analyses including & excluding that cohort answer different questions. Which one is yours? – gung - Reinstate Monica Mar 15 '22 at 16:27
  • @EdM, thanks EdM , I have updated my question with more information. – Ahir Bhairav Orai Mar 15 '22 at 16:45
  • @gung-ReinstateMonica, I am planning to include an CohortId as an independent variable to estimate cohort effects. – Ahir Bhairav Orai Mar 15 '22 at 17:27

1 Answers1

0

With only 17 in cohort 4 out of more than 4000 total, the results won't differ much either way. But you also have substantial differences in other cohort sizes, with 8 times as many in cohort 3 as in cohort 1.

Using cohort as a random effect in a mixed model comes to mind. That's particularly the case if you see the cohorts as representing sets of potential cohorts in an underlying population rather than objects of interest themselves. Individuals would be modeled as random effects within the cohorts. This answer provides a useful introduction to the benefits of proceeding this way, which would provide a "partial pooling" among cohorts, weighted somewhat by numbers of individuals:

Partial pooling means that, if you have few data points in a group, the group's effect estimate will be based partially on the more abundant data from other groups. This can be a nice compromise between estimating an effect by completely pooling all groups, which masks group-level variation, and estimating an effect for all groups completely separately, which could give poor estimates for low-sample groups.

Whether to do that and how to proceed depend on what information you want to get from your attempt to "estimate the cohort effect."

The "cohort effect" could be something as simple as estimating differences in baseline values of outcome values among cohorts while assuming that effects of exposures on outcomes are the same among cohorts. You could do that by including cohort as a fixed effect without a random effect for cohort (while accounting for correlations of measurements within individuals with a repeated-measures approach). If differences in estimated baseline values are what you mean by "cohort effect" and you are OK with "poor estimates for low-sample groups," then I don't see much to lose by including that cohort. Modeling cohort with a random intercept in a mixed model would be an alternative.

A "cohort effect" could, however, be something as complicated as different associations of changes of different exposures with changes of outcomes among the cohorts. That type of "cohort effect" would require incorporating interaction terms into the model. If that's what you mean by "cohort effect," my sense is that the large differences in cohort sizes would argue for modeling cohort with corresponding random effects for the associated "slopes" as well.

I'm a bit worried about the apparent modeling of outcome with respect to current exposures. Heavy-metal effects typically develop over time and can differ depending on an individual's age at exposure. Those concerns, however, don't have to do with the question about differing cohort sizes.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • Thanks EdM, you not only answered my question but also highlighted some drawbacks in my approach. I agree with you that random effects model is the way forward inorder to estimate the cohort effect based on the pooling mechanism you mentioned above. I am now thinking about bigger issues which you have highlighted. Thanks again. – Ahir Bhairav Orai Mar 15 '22 at 22:52