Statistical comparison of mean protein expression data from imaging mass cytometry (single cell)

Question

I acquired some protein expression data of one protein marker (e.g. CD45, expression range between 0-4, not necessarily integer and on a continuous scale) in a Control group and a Treated group. This is single cell imaging cell mass cytometry data, so I have the information for each individual cell. Each group contains 6 samples, each sample comprising 1000 cells.

We want to compare the expression of our protein marker between the 6,000 cells of the Control group and the 6,000 cells of the Treated group.

For each image separately, CD45 & other markers were normalised to the measured expression of a gas in the Hyperion machine. Then all 12 images were grouped together and the total dataset was scaled to 99th percentile.

How can I know whether expression data from these 2 groups is significantly different ? I am aware that a t-test can compare two groups but am wondering if I need to carry out something more sophisticated to factor in variance across multiple samples?

The 12 samples are all independent of each other.

For context I will be running this analysis in R or Python

I have included a reduced dataset with 10 cells per sample with info on sample number, treatment and mean CD45 expression to show what the data is like. The expression values have already been normalised across samples.

structure(list(sample = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 
8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L, 12L, 
12L, 12L, 12L, 12L, 12L), MI_CD45 = c(0.1545877, 0.1793125, 0.1445191, 
0.1019348, 0.08286311, 0.6044133, 0.3096269, 0.2132869, 0.08715424, 
0.1460735, 0.08738686, 0.2297806, 0.365499, 0.1289996, 0.2122718, 
0.101198, 0.121693, 0.1089632, 0.2279845, 0.1451246, 0.425867, 
0.1139834, 0.6634371, 0.7761338, 0.1474155, 0.06559177, 0.1553516, 
0.09389775, 0.3503682, 0.1704765, 0.3370313, 0.2912913, 0.2387781, 
0.6796798, 0.07864064, 0.1668305, 0.463418, 0.3439344, 0.2285349, 
0.2273838, 0.1829566, 0.09678843, 0.07209037, 0.5453814, 0.8523626, 
0.07520846, 0.2545249, 0.204567, 0.09239703, 0.05353002, 0.1567345, 
0.1373628, 0.07440487, 0.3590844, 0.2259703, 0.08887248, 0.1052557, 
0.1162133, 0.7015316, 0.1174095, 0.392122, 0.1412689, 0.1081716, 
0.09501558, 0.06310009, 0.1514402, 0.2839504, 0.2825152, 0.1443816, 
0.3718398, 0.4008126, 0.2265992, 0.2609764, 0.1345574, 0.1566612, 
0.1411953, 0.1564614, 0.1828695, 0.1337754, 0.2459216, 0.166302, 
0.2635781, 0.4181116, 0.1849425, 0.7578567, 0.1300071, 0.110335, 
0.2298646, 0.3006229, 0.1730745, 0.1535573, 0.2113298, 0.1487033, 
0.1209101, 0.2045534, 0.2157926, 0.1669114, 0.187058, 0.2326026, 
0.2871616, 0.368213, 0.3813162, 0.6728588, 0.2095521, 0.3196558, 
0.2208793, 0.2146631, 0.2767163, 0.375026, 1.384382, 0.1599959, 
0.5988706, 0.4100114, 0.5025848, 0.2143242, 0.3332356, 0.544148, 
0.5055255, 0.1230738, 0.1806344), treatment = c("Treated", "Treated", 
"Treated", "Treated", "Treated", "Treated", "Treated", "Treated", 
"Treated", "Treated", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Treated", "Treated", 
"Treated", "Treated", "Treated", "Treated", "Treated", "Treated", 
"Treated", "Treated", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Treated", "Treated", "Treated", "Treated", "Treated", "Treated", 
"Treated", "Treated", "Treated", "Treated", "Treated", "Treated", 
"Treated", "Treated", "Treated", "Treated", "Treated", "Treated", 
"Treated", "Treated", "Treated", "Treated", "Treated", "Treated", 
"Treated", "Treated", "Treated", "Treated", "Treated", "Treated", 
"Treated", "Treated", "Treated", "Treated", "Treated", "Treated", 
"Treated", "Treated", "Treated", "Treated")), row.names = c(32376L, 
31483L, 26108L, 9290L, 39029L, 13975L, 8904L, 7043L, 30854L, 
28262L, 61943L, 62499L, 59538L, 44945L, 49756L, 48593L, 57793L, 
51017L, 51676L, 61591L, 74993L, 80774L, 75583L, 80449L, 66765L, 
86362L, 73016L, 86895L, 70177L, 74650L, 98852L, 97693L, 121342L, 
99156L, 111652L, 115560L, 117963L, 105585L, 112108L, 116307L, 
142252L, 127436L, 124329L, 132183L, 127607L, 141529L, 127357L, 
128249L, 133640L, 123272L, 160252L, 173527L, 174041L, 161181L, 
158121L, 168710L, 170645L, 166045L, 169275L, 170357L, 185508L, 
184176L, 176645L, 177849L, 181230L, 179293L, 177097L, 185926L, 
176611L, 186402L, 191758L, 196205L, 191362L, 206817L, 187058L, 
197708L, 189013L, 193915L, 197749L, 188210L, 214891L, 214353L, 
211243L, 216867L, 215698L, 211359L, 211004L, 208673L, 217491L, 
208737L, 219605L, 225965L, 219584L, 227512L, 228393L, 219070L, 
226765L, 230550L, 225558L, 225561L, 236993L, 239767L, 253503L, 
242694L, 253704L, 244407L, 248366L, 247806L, 244830L, 240108L, 
263171L, 273190L, 279033L, 266438L, 281753L, 280685L, 274369L, 
280756L, 266414L, 266935L), class = "data.frame")

Do you want to test for a difference in means? Roughly, what is the range of the values? — David Smith, Jan 29 '24 at 16:45
What is your dependent variable here? It says you want to test differences in protein expression data. Is this a single DV or multiple DVs? — Shawn Hemelstrand, Jan 30 '24 at 00:37
@DavidSmith Yes I want to test for a difference in means and the range of values is roughly 0-4. — MeganCole, Jan 30 '24 at 17:46
@ShawnHemelstrand There is one treated group which contains six samples and this group is the DV — MeganCole, Jan 30 '24 at 17:47
Two groups with n=6 independent (? You don't really say.) observations. Student's t-test is a good start. You may want to scale the protein expression values logarithmically to control heteroscedasticity. Any zero or below threshold values will need further consideration before the analysis. (What do you mean "multiple samples"? Something like measurements in triplicate?) — Michael Lew, Jan 30 '24 at 20:08
Since this question is closed for now, I suggest you revise it to include the information you have given in response to comments. — David Smith, Jan 31 '24 at 03:15
Hi ! Here is some information you should include : "mean protein expression data" : I suppose this represents an average of counts your protein of interest in the 1000 cells of on sample, am I right ?

how did you acquire this data, is it mass spectrometry ? — CaroZ, Jan 31 '24 at 12:11
Also : - do you have a way to know the level of expression in one individual cell ? I would assume not. — CaroZ, Jan 31 '24 at 12:19
Please do provide more information about the nature of your study. Technically, the CD45 levels would be considered the "dependent variable" while the control/treatment would be the "independent variable" in this situation. It's also not clear whether there are 6 pairs of control/treatment samples, or if all 12 samples are independent. Are your CD45 measurements just integer values (0,1,2,3,4), or can they take on values continuously in that range? Are those values per cell or per sample? @CaroZ CD45 is often evaluated by flow cytometry, one cell at a time; might be the case here. — EdM, Jan 31 '24 at 22:11
@CaroZ I have added extra detail that this is imaging mass cytometry data which is single cell so yes I have the information per cell — MeganCole, Feb 01 '24 at 15:41
@EdM, sorry someone further up asked about dependent and independent variables and as I am very new to this I thought they had to be categorised this way. The 12 samples are all independent of each other. The values are mostly not integer e.g., can be 0 or can be 0.0156 expression. I have the expression values per cell — MeganCole, Feb 01 '24 at 15:46
Could you maybe upload a boxplot or even better, a dotplot of the readouts in the control and treated groups ? — CaroZ, Feb 01 '24 at 15:48
@CaroZ I have included a small example dataset labelling sample number, treatment group and mean expression of CD45 per cell I hope that helps — MeganCole, Feb 01 '24 at 16:52
I do not understand what th mean expression per cell is : I thought you had 1 sample = 1000 cells so mean expression = mean of 1000 cells. Am I wrong ? — CaroZ, Feb 01 '24 at 22:07
@CaroZ The data is from imaging mass cytometry, so the expression is measured in pixels orignally, as it uses heavy metal tags attached to antibodies to capture this expression. Once we have identified the individual cells during cell segmentation we then take the mean expression of those pixels as the expression value for that cell — MeganCole, Feb 02 '24 at 10:15
So in this piece of dataset, "1L", "2L", ... are individual cells ? If yes, 1L should be the sample ID, but each individual cell should have a unique ID. Once I know that I can give you a proper answer ! — CaroZ, Feb 02 '24 at 12:55
@CaroZ Each row in the data frame is an individual cell, I have just reduced the dataset size here to show an example of the data. 1L to 12L under the sample column are samples, so this should be sample ID. I have shared the CD45 expression of 10 cells for each sample, so in total 120 cells over the 12 samples, giving 120 rows. I hope that makes sense — MeganCole, Feb 02 '24 at 13:00
One last question : why is it bound between 0 and 4 ? Is it bound by the detection limit of the machine, or is there a transformation you apply to the data ? — CaroZ, Feb 02 '24 at 13:03
@CaroZ It is mainly bound to the detection limit of the machine and then we have scaled to 99th percentile across samples — MeganCole, Feb 02 '24 at 13:23
What precisely was done when you "scaled to 99th percentile across samples"? Were all values scaled to the 99th percentile over all values? Or was that done within treatment groups? Or within individual samples? When you get a chance, please include that information by editing the question. It's important with respect to how to proceed, and if it's only in a comment it will be easy for other visitors to this site to overlook. Please remember that this site tries to be a repository of answers useful to others, not just to the person who asked the question originally. — EdM, Feb 02 '24 at 15:15

EdM · Accepted Answer · 2024-02-02T18:22:19.340

First, look at the data. I saved your sample data in a data frame CD45 and looked at boxplots:

CD45$sample <- as.factor(CD45$sample)
# that avoids problems with using numeric values for sample labels
library(ggplot2)
ggplot(data=CD45,mapping=aes(x=treatment,y=MI_CD45,color=sample)) + geom_boxplot()

The values seem to have substantial skew, with some extreme outliers in most samples. My guess is that these data wouldn't readily meet the assumptions required for t-tests or the more complicated models that could take the grouping by samples into account. This is pretty typical of biological gene-transcript or protein expression data, for which errors tend to be approximately proportional to the values instead of independent of the values.

As these aren't count data (at least after your normalization), the negative binomial model suggested in another answer isn't appropriate for what you show.* I couldn't get the suggested Gamma model to work on your data with default settings (although I think that some playing with initial parameter values would probably work).

What often works in practice with such data is to use log-transformed values instead. I used a log2 transformation, so that a 1-unit difference corresponds to a doubling. The calculations now evaluate the mean values of the log2-transformed data. I also examined the distribution of values among cells within treatments.

CD45[,"log2CD45"] <- log2(CD45[,"MI_CD45"])
ggplot(data=CD45,mapping=aes(x=treatment,y=log2CD45,color=sample)) + geom_boxplot()
ggplot(data=CD45,mapping=aes(x=log2CD45,group=treatment,color=treatment)) + stat_ecdf()

The box plots look much better, and the distribution of log2CD45 values is clearly shifted to higher values in the Treated group. With only 10 data points here per sample, I couldn't reasonably do this further by sample. With your full data set it would make sense to examine the distribution among cells by both sample and treatment. What follows doesn't do any modeling of the potentially rich data about the distribution of values among cells within samples. In general, treatments might affect not only the mean (or mean-log) outcome values, but also their distributions among cells. We're ignoring that here.

Then the question becomes what you consider the fundamental unit of observation. The most conservative is to treat the individual cells as technical replicates and the samples as the unit of observation. Find the average of the log-transformed values per sample, and then do a linear model; in this case, it's equivalent to a t-test among samples.

meanLogAggregate <- with(CD45, aggregate(log2CD45~treatment + sample,FUN=mean))
lmLogAgg <- lm(log2CD45~treatment,data=meanLogAggregate)
summary(lmLogAgg)
# many lines omitted
# Coefficients:
#                  Estimate Std. Error t value Pr(>|t|)    
# (Intercept)       -2.5140     0.1346 -18.673 4.19e-09 
# treatmentTreated   0.4887     0.1904   2.566   0.0281

Even in this most conservative analysis, there is a "statistically significant" difference in log2CD45 as a function of treatment. Apply your understanding of the subject matter to assess the practical significance of having Treated values about 1.4 times the Control values $(2^{0.4887}=1.40)$.

With your nicely balanced sample data, the following alternative approaches all provide the same estimates for (Intercept) and the treatmentTreated coefficient, the estimated treatment-associated difference in mean log2CD45 values. The alternatives can give different estimates for the standard errors of the coefficients.

The least conservative analysis is to ignore the grouping by treatment and treat all single-cell observations as independent. As you sense, that might be a risky choice in general.

lmIndep <- lm(log2CD45~treatment,data=CD45)
summary(lmIndep)
# many lines omitted
# Coefficients:
#                  Estimate Std. Error t value Pr(>|t|)    
# (Intercept)       -2.5140     0.1163 -21.623  < 2e-16 
# treatmentTreated   0.4887     0.1644   2.972  0.00359

The standard errors are lower; together with the much larger number of observations (now by cell instead of by sample) the p-values are much lower.

Treating the samples as grouping factors is a reasonable compromise. There are two general ways to deal with this.

One is to start with the model based on independence and to adjust the standard errors to account for the grouping. An example of this approach is to use a "sandwich" estimator, used internally by the coeftest() function in the lmtest package.

library(lmtest)
library(sandwich)
coeftest(lmIndep, vcov.=vcovCL(lmIndep,cluster=~sample))
#                  Estimate Std. Error  t value  Pr(>|t|)    
#  (Intercept)      -2.51402    0.07004 -35.8941 < 2.2e-16 
#  treatmentTreated  0.48865    0.18231   2.6804  0.008406

Accounting for the grouping by sample this way slightly increased the standard error of the treatmentTreated coefficient.

The second general way to handle grouping is to model the groups directly, in a "mixed model" with both fixed and random terms. In your case, this would mean allowing the estimated intercepts to vary among samples (the "random" effect), with the distribution of intercepts among samples usually fit to a zero-mean Gaussian distribution. The overall (Intercept) and treatment are the "fixed" effects. That's often done this way:

library(lmerTest)
lmer1 <- lmer(log2CD45 ~ treatment + (1|sample), data = CD45)
summary(lmer1)
# many lines omitted
# Fixed effects:
#                  Estimate Std. Error      df t value Pr(>|t|)    
# (Intercept)       -2.5140     0.1346 10.0000 -18.673 4.19e-09 
# treatmentTreated   0.4887     0.1904 10.0000   2.566   0.0281

In this case, the standard error and p-values for these fixed effects in the mixed model are identical to those of the original model based solely on the mean values per sample!

As you have 100 times as many observations as in this sample data set, there won't be much practical difference in which estimates of standard errors and degrees of freedom you use. It seems that there will be no question about the "statistical significance" of the effect of treatment. In other circumstances, however, you might need to pay attention to the different ways of adjusting for grouping of observations and their corresponding assumptions.

*A negative binomial model might be appropriate for the non-normalized data, as the original data are presumably counts of protein fragments having masses mapped to CD45 fragments. That type of model is often used, for example, with RNAseq expression data.

score 0 · Answer 2 · answered Feb 02 '24 at 13:46

First, you should check the distribution of your protein expression data. I do suspect it will be negative binomial or Gamma.

I would give one individual ID to each cell in an extra column. The sample ID should be included as a random factor in your model. Then I would use the package glmmTMB to run a GLMM with either a negative binomial distribution or a Gamma distribution (and check which one is the most appropriate with the plot(simulateResiduals(model)) of the DHARMa package. THe formula in R could look something like : model1<-glmmTMB(CD45 expression~treatment+(1|sample), family=nbinom1, data=yourdata) or model2<-M1<-glmmTMB(CD45 expression~treatment+(1|sample),data=yourdata,family=Gamma(link=log))

The family and link should be adjusted according to your data. But you could start like this and let us know how this goes.

Statistical comparison of mean protein expression data from imaging mass cytometry (single cell)

2 Answers2