Testing for significant difference in mortality rate between 3 groups across 3 stages

Question

I am looking for a way to check if there is a significant difference between mortality of egg embryos across three groups (DYF1- double yolk with one embryo; DYF2 - double yolk with two embryos; CONTROL - normal egg (one yolk and one embryo).

Mortality rate in my study is divided into 3 stages: EARLY (1-7day), MID (8-14day), LATE (15-21day).

In each cell I wrote number of embryos that died at each (early,mid,late) incubation stages. I would like to get answer by comparison of these groups in order for example state that late mortality in DYF2 eggs is significantly higer than in CONTROL and so on. This, from what I understand so far, will need post hoc tests (for multiple comparisons).

I read about proportion tests, chi-square or fisher exact, but not really sure which test will be appropriate for my analysis. I ran chi-square test and fisher test, and they show significant difference in table but I would like to confirm if I am doing right test before staring post-hoc analysis.

Should I just keep number of embryos or change it to percentage of dead embryos of all in a given group? Should I split the 3x3 table to smaller ones in order to check for significance?

I would be thankful for getting me on the right track.

My table looks like this

observed_table <- matrix(c(24, 30, 51, 12, 50, 60, 2, 2, 10), 
                         nrow = 3, ncol = 3, byrow = TRUE)
rownames(observed_table) <- c('DYF1', 'DYF2', 'control')
colnames(observed_table) <- c('early', 'mid', 'late')
observed_table
#number of dead embryos by incubation stage (early, mid, late) and group
         early mid late
DYF1       24  30   51
DYF2       12  50   60
control     2   2   10
total number of embryos in DY1F eggs - 213 (213 eggs)
total number of embryos in DYF2 eggs - 122 (60 eggs)
total number of embryos in SY eggs (control) - 200 (200 eggs)

Welcome to Cross Validated! Did you start with the same number of eggs of each type? Or is there additional information on the number of eggs of each type that hatched successfully? Please provide that information by editing the question, as comments are easy to overlook and can be deleted. — EdM, Nov 29 '22 at 21:32
Hey, I edited my orginal post and added total number of eggs and embryos (hatched and dead) for each category. — Maria, Dec 01 '22 at 17:22
Thanks for updating. The observed_table whose values you show doesn't agree with the matrix that you set up in the first line of your code. All the "control" values are different, as is the "DF2,mid" value. I'm working on an answer now; please check whether the original matrix or the displayed version of the table is correct. — EdM, Dec 01 '22 at 19:16
I am sorry for this mistake. The values from the displayed version of the table are correct; I edited post. — Maria, Dec 02 '22 at 15:20

EdM · Answer 1 · 2022-12-02T04:06:07.227

The key here is that you need to evaluate some type of "mortality rate." So you can't just evaluate the numbers of deaths at each stage, as that doesn't provide a rate that takes into account how many embryos were at risk of dying.

You also can't just use the "percentage of dead embryos of all in a given group," for two reasons. First, the statistical reliability of a percentage depends on the number of observations. You need to keep track of the total numbers involved. Second, only embryos that make it through all the earlier developmental stages are at risk of dying at a later stage. So the "rate" of dying at a particular stage must be based on the number still at risk because they had gotten up to that stage.

As you want to evaluate differences among all groups and stages, it's best to do a binomial regression model that can make such detailed estimates. A Fisher test or chi-square test only tells you whether there are some types of differences, but not directly where those differences occur. You might think of binomial regression as a generalization of a simple proportion test.

Let's take your tabular display as being the correct number of deaths at each stage, add to it a column for the number of embryos to start and thus at risk for dying at early stage, and rename your columns to clarify what each represents.

observed_table <- matrix(c(213, 24, 30, 51, 122, 12, 50, 60, 200, 2, 2, 10), nrow = 3, byrow = TRUE)
rownames(observed_table) <- c('DYF1', 'DYF2', 'control')
colnames(observed_table) <- c('early_risk','early_died', 'mid_died', 'late_died')
observed_table
#         early_risk early_died mid_died late_died
# DYF1           213         24       30        51
# DYF2           122         12       50        60
# control        200          2        2        10

Convert that to a data frame to allow easier calculations of those at risk of death at each stage. Also add a "group" column representing the type of egg.

observed_table <- data.frame(observed_table)
observed_table[,"mid_risk"] <- observed_table$early_risk - observed_table$early_died
observed_table[,"late_risk"] <- observed_table$mid_risk - observed_table$mid_died
observed_table[,"group"] <- rownames(observed_table)

The regression model will use both stage and group as predictors of the probability of death at each stage, given that an embryo has already survived up until then. That requires some reformatting of the data, so that each row has the number at risk and the number that died for each combination of stage and group. That can be done pretty simply with the tools of the tidyr package. It's worth learning how to use those tools.

library(tidyr)
library(readr)
dataFormatted <- observed_table %>% 
  pivot_longer(cols = !group,
    names_to = c("stage","status"),
    names_pattern = "(.*)_(.*)", values_to = "number", 
    names_transform =
      list(stage = ~readr::parse_factor(.x,  
        levels = c("early","mid","late")))) %>%
  pivot_wider(names_from = status, values_from = number)

That looks complicated at first, but in outline it just means that the result dataFormatted is produced by starting with your observed_table, making it longer (pivot_longer) to get separate rows for each stage and associated status (risk or died), and then making it a bit wider again (pivot_wider) by setting up separate columns for each of risk and died for each combination of group and stage.

dataFormatted
# A tibble: 9 × 4
  group   stage  risk  died
  <chr>   <fct> <dbl> <dbl>
1 DYF1    early   213    24
2 DYF1    mid     189    30
3 DYF1    late    159    51
4 DYF2    early   122    12
5 DYF2    mid     110    50
6 DYF2    late     60    60
7 control early   200     2
8 control mid     198     2
9 control late    196    10

Now the data are in a format that allows binomial regression. The default is logistic regression, modeling the log-odds of death. The outcome values can be a matrix of events (deaths) and non-events, put together by the cbind() function.

glm1 <- glm(cbind(died, risk-died) ~ group*stage,
              data = dataFormatted, family = binomial)

That's what's called a saturated model, as it includes all combinations of both your predictors with the * interaction symbol.

You could explore the glm1 object if you wish, and you should learn about it eventually. The coefficients of a logistic regression with interactions, however, aren't immediately easy to interpret. Each coefficient represents a difference of the log-odds of a more complicated situation from a simpler situation. You probably want to show the results as estimates of the probabilities of death for each stage and group.

The emmeans package provides a simple way to do that. It's worth learning how to use for post-modeling analysis of many different types of models.

library(emmeans)
emm1 <- emmeans(glm1, ~group|stage, type="response")
emm1
# stage = early:
#  group        prob         SE  df asymp.LCL asymp.UCL
#  control 0.0100000 0.00703562 Inf 0.0025024 0.0390817
#  DYF1    0.1126761 0.02166542 Inf 0.0766746 0.1626047
#  DYF2    0.0983607 0.02696170 Inf 0.0567097 0.1652438
# 
# stage = mid:
#  group        prob         SE  df asymp.LCL asymp.UCL
#  control 0.0101010 0.00710633 Inf 0.0025277 0.0394675
#  DYF1    0.1587302 0.02658070 Inf 0.1132620 0.2179646
#  DYF2    0.4545455 0.04747572 Inf 0.3640969 0.5480966
# 
# stage = late:
#  group        prob         SE  df asymp.LCL asymp.UCL
#  control 0.0510204 0.01571710 Inf 0.0276686 0.0922118
#  DYF1    0.3207547 0.03701701 Inf 0.2528802 0.3971627
#  DYF2    1.0000000 0.00000012 Inf 0.0000000 1.0000000
# 
# Confidence level used: 0.95 
# Intervals are back-transformed from the logit scale 
#

The prob values are the probabilities of an embryo dying in each situation, given that it was still at risk of dying. The SE is the standard error of that estimate. The Inf values for df (degrees of freedom) come from the model having been fit by maximum likelihood, with coefficient estimates having an asymptotic (asymp) multivariate normal distribution. The LCL and UCL are the 95% lower and upper confidence limits for the probability estimates.

Note in particular that the probability of an embryo dying at stage = late in group = DYF2 is 1: all of the embryos in that group that got past stage = mid died. Nevertheless, the corresponding LCL is a probability of 0. That's because when you can make a perfect prediction for some situation in a logistic regression, the phenomenon of "perfect separation" means that confidence limits can't be calculated in the simple way used here.

You also can do pairwise comparisons among all groups at each stage.

contrast(emm1, "pairwise")
# stage = early:
#  contrast       odds.ratio       SE  df null z.ratio p.value
#  control / DYF1     0.0795 5.91e-02 Inf    1  -3.407  0.0019
#  control / DYF2     0.0926 7.16e-02 Inf    1  -3.078  0.0059
#  DYF1 / DYF2        1.1640 4.35e-01 Inf    1   0.407  0.9128
# 
# stage = mid:
#  contrast       odds.ratio       SE  df null z.ratio p.value
#  control / DYF1     0.0541 3.99e-02 Inf    1  -3.953  0.0002
#  control / DYF2     0.0122 9.01e-03 Inf    1  -5.981  <.0001
#  DYF1 / DYF2        0.2264 6.25e-02 Inf    1  -5.378  <.0001
# 
# stage = late:
#  contrast       odds.ratio       SE  df null z.ratio p.value
#  control / DYF1     0.1139 4.17e-02 Inf    1  -5.930  <.0001
#  control / DYF2     0.0000 0.00e+00 Inf    1  -0.001  1.0000
#  DYF1 / DYF2        0.0000 1.00e-07 Inf    1  -0.001  1.0000
# 
# P value adjustment: tukey method for comparing a family of 3 estimates 
# Tests are performed on the log odds ratio scale

The "perfect separation" means the p-values calculated for comparisons involving DYF2 at stage = late can't be correctly calculated this way, but with no DYF2 embryos hatching at all that isn't a practical problem for explaining your results to others.

I really appreciate your help! I did not expect such detailed and professional answer, thank you for your time. Do You think it would be correct to include plots made with plot(emm1) or should I make plots from my raw data?Another simple question is: In emm1 object does probability of 0.0100000 for early stage control group just simply mean that there is 1% chance for embryo to die at this stage? — Maria, Dec 02 '22 at 15:46
@Maria the 0.01 probability is what you think it is. Look at the original data to confirm: 200 embryos to start in control, and 2 died at stage = early. With this saturated model having only categorical predictors, the point estimates of probability from the emm1 object and the raw data should agree. The emm1 object can also provide confidence limits (CL), which are good to show in general--although the CL calculated this way will be uninterpretable for DYF2 at the late stage. — EdM, Dec 02 '22 at 17:07
Thank You once again! I finally understand how to handle data like this. — Maria, Dec 03 '22 at 07:35
@Maria if this has answered your question, please consider marking it as the accepted answer by clicking on the checkmark near the top of the answer. Although you don't yet have enough reputation to "up-vote" you can still accept an answer to your own question. — EdM, Dec 03 '22 at 13:49

Testing for significant difference in mortality rate between 3 groups across 3 stages

1 Answers1