1

I have a dataset I'm working with containing measurements of several thousand proteins per person, with about 150 people split pretty evenly between two groups (cases and controls). Case status is split evenly within men and women, but gender is split unevenly overall, with about 50 men and 100 women. Everyone has a measurement for every protein.

I'm ultimately interested in what proteins are associated with/predictive of case status. I've evaluated this in the entire group using a lasso regression model, which zeroed out all but about 50 proteins (which is fine, just providing more information).

I want to know if there is a difference in the proteins that predict case status in men versus women. Specifically, I want to know:

  • Are there different proteins predicting case status in men versus women?

  • Are there differences in the strength or direction of association of matching predictive proteins between men and women?

  • Are there proteins that predict case status well in both men and women? If so, do they do so in the same way and to the same degree in men and women?

  • Do proteomic models (either the overall model or in models specifically fitted by gender) fit better among men versus among women?


Edited for clarity:

I'm currently stuck on how to answer these questions statistically. I'm deliberating between two approaches and would like to get input on the advantages and disadvantages of each method (assuming both are valid) and any suggestions of alternative method if you have any. The approaches and some of my thoughts about them are:

  1. Split the data into two sets; one with all men, one with all women. Repeat the lasso regression model in each of the two subsets, then directly compare the proteins and coefficients that come from each gendered model and the overall model. I'm uncertain whether the information gleaned from this approach would accurately be considered the interaction between protein and gender on case status in the strict statistical sense. I think I would be learning about model-level associations/interactions, not protein-level associations/interactions (but I am very unsure on this, so please correct me if I'm wrong).

  2. Run a new model on the full dataset, adding the interaction between gender and each protein to the formula from my overall model, and evaluate the interaction with gender by which gender x protein coefficients remain in the resulting model. But I don't know if that answers the question I'm asking; or how I would interpret the interaction (the non-simple/complex effect) in the resulting model. Could I expect it to predict case status equally well among men as among women?

Edited to add:

I gather what an interaction would mean for any single protein x gender coefficient. However, I don't understand what it would mean for the entire proteomic model, or combination of proteins, and gender. It's possible that what I'm referring to and confused about can only be illuminated by comparing the overall and the interaction model, or perhaps not, but that's a large part of what I'm trying to ask.

Further points of consideration:

  • On multiple testing and power: Approach 1. uses fewer tests per model, but creates two models. Approach 2. uses doubly many tests in one model, but that is the only model. The total number of participants included is the same in the end, but in Approach 1., they are included across two models, one with ~50 people, and one with ~100 people, and in Approach 2., the only model has ~150 people. Is there a definite answer to which approach minimizes multiple testing concerns and maximizes power? If so, are they both the same approach, or does one minimize multiple testing concerns where the other maximizes power?

  • If I were to use Approach 1., fitting to men and women separately, would that result in differing fits by gender? I would imagine so, as one model has only one fit (perhaps at best, the fit of the model in Approach 2. would reflect an average of the fit for men and the fit for women, if not a different fit altogether). If this is the case, would I validly be able to compare the gendered models to each other, or even to the overall model, considering differences in fit and sample size?

End Edit


There also remains the possibility that neither approach is suitable, or appropriate, to answer the questions I am asking. If that is the case, please do let me know what you would suggest as an alternative.

With that, what approach would you use to examine the difference in the relationship between protein and case status by gender, and why?

Any thoughts or suggestions are greatly appreciated!

Edited to add:

In reference to this answer as a suggested answer to this question, I appreciate the suggestion and have previously reviewed this answer, but it does not answer my question. It partially addresses what I've written with respect to interactions for what in this case would be single specific protein x gender coefficients, but not what the interaction would be between the proteomic model and gender. Also, as it only really reflects Approach 2., it does not address why one would use Approach 1. versus Approah 2., or essentially, what are the benefits and drawbacks of each approach?

Sam
  • 25
  • [About model with interaction] "I imagine it would tell me about the association of any given protein with gender, but would it tell me about the difference in the association of the protein with case status by gender?" I assume your model is predicting case status as the response. If so, the answer to this question is yes: an interaction between gender and protein will tell you about the difference between genders in how the protein is associated with case status. – mkt Aug 01 '22 at 21:16
  • 1
    I'd favour using a model with interactions over splitting the dataset in any case I can think of (which does not mean such situations don't exist). I think you'd probably benefit from reading some of the threads here about how to interpret interaction terms. – mkt Aug 01 '22 at 21:21
  • Take a look at this thread and see if it addresses your questions: https://stats.stackexchange.com/questions/49002/interaction-in-generalized-linear-model – mkt Aug 01 '22 at 21:26
  • @mkt I feel insane trying to think this through because I know the answer to these questions in simpler sets with fewer (way fewer) variables, but the number of variables in this set is throwing me off. I'm grappling between the reduction of power if I split the dataset versus the insane multiple testing if I don't, which, really, it's more tests in total if it's done on three 'different' sets, but it's never as many in one model at one time, and I don't know which is better/worse between the two. – Sam Aug 01 '22 at 21:30
  • 2
    It would help if you edited your question to focus on the precise portion that is still unclear, if that thread did not clear it up for you. There is no difference in how to interpret 2-way interactions in models with just a few predictors vs. those with many predictors. – mkt Aug 02 '22 at 05:48
  • Does this answer your question? Interaction in generalized linear model – mkt Aug 02 '22 at 12:28
  • @mkt I have edited the post for clarification and additional details. Is it sufficiently clear now? If not, I can try to clarify further; just let me know what part is still unclear. – Sam Aug 02 '22 at 20:10

1 Answers1

2

Approach 3: instead of modeling case/control against proteins and gender, model the set of protein expression levels against case/control and gender. That's been a standard approach dating back to the days of spotted microarrays for measuring mRNA expression. Standard software tools for that are provided for example by the Bioconductor limma package, which can be adapted to several different types of expression data. It uses appropriately pooled error estimates and takes multiple comparisons into account.

With predictors in the model of gender, case/control, and their interaction, that will tell you specifically which proteins are associated with case/control regardless of gender, which are associated with gender regardless of case/control status, and which differ depending on the combination of case/control and gender.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • Extremely helpful, thank you! So, from my understanding, what you're saying is that rather than using case status as outcome and gender, protein, and gender x protein as predictors, instead, I should use protein level as outcome (which would change this from a binary to a continuous outcome model) and use case status, gender, and gender x case status as binary predictors. To clarify, when you say to use the set of protein expression levels as outcome is this in some way aggregating all, say, 8000 protein measurements per person into one composite outcome (if so, how?), or making 8000 models? – Sam Aug 03 '22 at 05:26
  • Oops, forgot to @EdM – Sam Aug 03 '22 at 05:55
  • Could you say more about how this solves the problem? There's a giant multiple comparisons problem that is being addressed by regularization in OP's model, but would be ignored in the approach you describe. Some sort of multiple comparisons correction could be used, but aren't we back at square one then, just with many models instead of one? – mkt Aug 03 '22 at 08:31
  • Or are you thinking of a multivariate model instead? Though that would seem to address a somewhat different question. – mkt Aug 03 '22 at 08:32
  • 1
    @mkt the Lima package I suggest, and others like it, are extensions of multivariate (multiple outcome) models for this type of data, with all genes treated as outcomes in a single model. Empirical Bayes error estimates are combined with procedures for multiple comparisons. – EdM Aug 03 '22 at 13:16
  • @EdM Thanks, that is helpful. I'm still not clear about whether/how this constitutes an improvement over modelling case_status as function of proteins & gender (with regularisation), especially when that is the implied causal direction. – mkt Aug 03 '22 at 13:55
  • @Sam software like limma is a multivariate (multiple-outcome) analysis that does 8000 models all at once. It uses pooled error estimates (among proteins in your case) to smooth out the vagaries of individual measurements and to incorporate into the model how errors change with expression levels. It then can generate rank-ordered lists of proteins most associated with the gender/case/interaction "predictors," taking multiple-comparison corrections into account. Pooled errors help prevent being fooled by individual proteins that just happen to have low individual standard errors in your data. – EdM Aug 03 '22 at 17:10
  • 1
    @mkt the improvement is that the pooled error estimates on protein expression mean that a protein that just happens to have a low standard error in this particular data set won't be singled out erroneously as "important." Think about it as a type of penalization on the outcomes, I suppose. From my perspective as a biologist, the causal direction is from gender/case status to the protein expression; you might then be able to use the protein expression to "predict" case status. In practice, the case status is often known first. – EdM Aug 03 '22 at 17:18
  • Interesting - thanks! – mkt Aug 03 '22 at 17:51