The key here is that you need to evaluate some type of "mortality rate." So you can't just evaluate the numbers of deaths at each stage, as that doesn't provide a rate that takes into account how many embryos were at risk of dying.
You also can't just use the "percentage of dead embryos of all in a given group," for two reasons. First, the statistical reliability of a percentage depends on the number of observations. You need to keep track of the total numbers involved. Second, only embryos that make it through all the earlier developmental stages are at risk of dying at a later stage. So the "rate" of dying at a particular stage must be based on the number still at risk because they had gotten up to that stage.
As you want to evaluate differences among all groups and stages, it's best to do a binomial regression model that can make such detailed estimates. A Fisher test or chi-square test only tells you whether there are some types of differences, but not directly where those differences occur. You might think of binomial regression as a generalization of a simple proportion test.
Let's take your tabular display as being the correct number of deaths at each stage, add to it a column for the number of embryos to start and thus at risk for dying at early stage, and rename your columns to clarify what each represents.
observed_table <- matrix(c(213, 24, 30, 51, 122, 12, 50, 60, 200, 2, 2, 10), nrow = 3, byrow = TRUE)
rownames(observed_table) <- c('DYF1', 'DYF2', 'control')
colnames(observed_table) <- c('early_risk','early_died', 'mid_died', 'late_died')
observed_table
# early_risk early_died mid_died late_died
# DYF1 213 24 30 51
# DYF2 122 12 50 60
# control 200 2 2 10
Convert that to a data frame to allow easier calculations of those at risk of death at each stage. Also add a "group" column representing the type of egg.
observed_table <- data.frame(observed_table)
observed_table[,"mid_risk"] <- observed_table$early_risk - observed_table$early_died
observed_table[,"late_risk"] <- observed_table$mid_risk - observed_table$mid_died
observed_table[,"group"] <- rownames(observed_table)
The regression model will use both stage and group as predictors of the probability of death at each stage, given that an embryo has already survived up until then. That requires some reformatting of the data, so that each row has the number at risk and the number that died for each combination of stage and group. That can be done pretty simply with the tools of the tidyr package. It's worth learning how to use those tools.
library(tidyr)
library(readr)
dataFormatted <- observed_table %>%
pivot_longer(cols = !group,
names_to = c("stage","status"),
names_pattern = "(.*)_(.*)", values_to = "number",
names_transform =
list(stage = ~readr::parse_factor(.x,
levels = c("early","mid","late")))) %>%
pivot_wider(names_from = status, values_from = number)
That looks complicated at first, but in outline it just means that the result dataFormatted is produced by starting with your observed_table, making it longer (pivot_longer) to get separate rows for each stage and associated status (risk or died), and then making it a bit wider again (pivot_wider) by setting up separate columns for each of risk and died for each combination of group and stage.
dataFormatted
# A tibble: 9 × 4
group stage risk died
<chr> <fct> <dbl> <dbl>
1 DYF1 early 213 24
2 DYF1 mid 189 30
3 DYF1 late 159 51
4 DYF2 early 122 12
5 DYF2 mid 110 50
6 DYF2 late 60 60
7 control early 200 2
8 control mid 198 2
9 control late 196 10
Now the data are in a format that allows binomial regression. The default is logistic regression, modeling the log-odds of death. The outcome values can be a matrix of events (deaths) and non-events, put together by the cbind() function.
glm1 <- glm(cbind(died, risk-died) ~ group*stage,
data = dataFormatted, family = binomial)
That's what's called a saturated model, as it includes all combinations of both your predictors with the * interaction symbol.
You could explore the glm1 object if you wish, and you should learn about it eventually. The coefficients of a logistic regression with interactions, however, aren't immediately easy to interpret. Each coefficient represents a difference of the log-odds of a more complicated situation from a simpler situation. You probably want to show the results as estimates of the probabilities of death for each stage and group.
The emmeans package provides a simple way to do that. It's worth learning how to use for post-modeling analysis of many different types of models.
library(emmeans)
emm1 <- emmeans(glm1, ~group|stage, type="response")
emm1
# stage = early:
# group prob SE df asymp.LCL asymp.UCL
# control 0.0100000 0.00703562 Inf 0.0025024 0.0390817
# DYF1 0.1126761 0.02166542 Inf 0.0766746 0.1626047
# DYF2 0.0983607 0.02696170 Inf 0.0567097 0.1652438
#
# stage = mid:
# group prob SE df asymp.LCL asymp.UCL
# control 0.0101010 0.00710633 Inf 0.0025277 0.0394675
# DYF1 0.1587302 0.02658070 Inf 0.1132620 0.2179646
# DYF2 0.4545455 0.04747572 Inf 0.3640969 0.5480966
#
# stage = late:
# group prob SE df asymp.LCL asymp.UCL
# control 0.0510204 0.01571710 Inf 0.0276686 0.0922118
# DYF1 0.3207547 0.03701701 Inf 0.2528802 0.3971627
# DYF2 1.0000000 0.00000012 Inf 0.0000000 1.0000000
#
# Confidence level used: 0.95
# Intervals are back-transformed from the logit scale
#
The prob values are the probabilities of an embryo dying in each situation, given that it was still at risk of dying. The SE is the standard error of that estimate. The Inf values for df (degrees of freedom) come from the model having been fit by maximum likelihood, with coefficient estimates having an asymptotic (asymp) multivariate normal distribution. The LCL and UCL are the 95% lower and upper confidence limits for the probability estimates.
Note in particular that the probability of an embryo dying at stage = late in group = DYF2 is 1: all of the embryos in that group that got past stage = mid died. Nevertheless, the corresponding LCL is a probability of 0. That's because when you can make a perfect prediction for some situation in a logistic regression, the phenomenon of "perfect separation" means that confidence limits can't be calculated in the simple way used here.
You also can do pairwise comparisons among all groups at each stage.
contrast(emm1, "pairwise")
# stage = early:
# contrast odds.ratio SE df null z.ratio p.value
# control / DYF1 0.0795 5.91e-02 Inf 1 -3.407 0.0019
# control / DYF2 0.0926 7.16e-02 Inf 1 -3.078 0.0059
# DYF1 / DYF2 1.1640 4.35e-01 Inf 1 0.407 0.9128
#
# stage = mid:
# contrast odds.ratio SE df null z.ratio p.value
# control / DYF1 0.0541 3.99e-02 Inf 1 -3.953 0.0002
# control / DYF2 0.0122 9.01e-03 Inf 1 -5.981 <.0001
# DYF1 / DYF2 0.2264 6.25e-02 Inf 1 -5.378 <.0001
#
# stage = late:
# contrast odds.ratio SE df null z.ratio p.value
# control / DYF1 0.1139 4.17e-02 Inf 1 -5.930 <.0001
# control / DYF2 0.0000 0.00e+00 Inf 1 -0.001 1.0000
# DYF1 / DYF2 0.0000 1.00e-07 Inf 1 -0.001 1.0000
#
# P value adjustment: tukey method for comparing a family of 3 estimates
# Tests are performed on the log odds ratio scale
The "perfect separation" means the p-values calculated for comparisons involving DYF2 at stage = late can't be correctly calculated this way, but with no DYF2 embryos hatching at all that isn't a practical problem for explaining your results to others.
observed_tablewhose values you show doesn't agree with the matrix that you set up in the first line of your code. All the "control" values are different, as is the "DF2,mid" value. I'm working on an answer now; please check whether the original matrix or the displayed version of the table is correct. – EdM Dec 01 '22 at 19:16