2

This is a dummy dataframe resembling my real-life data:

structure(list(cond = c("WT", "WT", "WT", "WT", "WT", "WT", "WT", 
"WT", "WT", "WT", "WT", "WT", "WT", "WT", "WT", "WT", "WT", "WT", 
"WT", "WT", "WT", "WT", "WT", "WT", "KO", "KO", "KO", "KO", "KO", 
"KO", "KO", "KO", "KO", "KO", "KO", "KO", "KO", "KO", "KO", "KO", 
"KO", "KO", "KO", "KO"), class = c("N", "N", "N", "N", "N", "N", 
"Y", "Y", "Y", "Y", "N", "N", "N", "N", "Y", "Y", "N", "Y", "N", 
"N", "Y", "N", "Y", "N", "N", "N", "Y", "Y", "Y", "N", "N", "Y", 
"N", "N", "N", "Y", "Y", "N", "N", "N", "N", "N", "N", "N"), 
    lattice = c(72.4394527831179, 70.1486049154573, 71.2024262282001, 
    70.095774734531, 73.1687587160835, 73.4725521658284, 71.1213324059112, 
    69.4426450566097, 67.7407461727878, 67.3598397689386, 69.5170866395342, 
    68.751790570905, 73.2734806165999, 72.0386374169852, 70.293510845974, 
    68.9576642114016, 69.4472846093111, 70.8520303262601, 69.967844969872, 
    69.7957750105144, 76.3165495002798, 70.8237308152673, 70.5087804854601, 
    70.0768856496865, 49.4569953395058, 52.0898768763027, 44.3112116723351, 
    53.0069841435797, 49.6755152863985, 50.3014101181505, 49.0856479592249, 
    48.3511098818039, 50.0812079766985, 50.4035212282794, 54.0992908724316, 
    43.4055868143946, 50.1834254159389, 54.7298925145524, 55.1516389972744, 
    51.4685454381875, 52.253317158648, 52.8558395390657, 51.5377616093217, 
    57.7792154694597)), row.names = c(NA, -44L), class = "data.frame")

Those are two experiment conditions ("WT" & "KO"). In each condition, observations might be classified as "Y" or "N" depending on whether the organism exhibits some measured trait or not.

I would like to compare those 2 groups (experimental conditions) & to infer, whether there is a statistically significant difference in amount of observations regarded as "Y" between the groups (and if it is the case, whether there are more or less "Y"-s in "KO" with respect to "WT" or not).

I do not know what type of statistical test would be more appropriate for this task: Fisher's, Chi-squared, etc.

Info: the "lattice" is another feature of the dataset, I am comparing this parameter between conditions using the Wilcoxon rank test. For this question it might be ignored. I just decided to show an entire structure of df, including this column.

EDITS:

  1. the experiment does not have fixed marginals.
  2. there was a time window, in which the data were collected.
  3. conditions are independent (those were cell-sorting experiments).
  4. afaik, this cannot be addressed with McNemar's test which is applicable for dependent variables, so the question was incorrectly assigned as duplicate.
ramen
  • 121
  • 1
    Welcome to CV, ramen! Does the answer to the question "McNemar test with multiple scores for the same subject" answer your needs? – Alexis Mar 24 '23 at 16:46
  • 1
    did all the subjects undergo both experimental conditions? I.e. are your variable paired? – utobi Mar 24 '23 at 17:20
  • 1
    This question was closed as a duplicate of one dealing with paired observations, as might be addressed with McNemar's test. However, there's no indication in the question that this was the design used, and the data presented don't reflect such a design. In addition, the cited question deals with more than two conditions, as in Cochran's Q test. – Sal Mangiafico Mar 25 '23 at 16:25
  • @Alexis unfortunately not, since the conditions are independent. – ramen Apr 06 '23 at 11:59
  • @utobi - nope. To be more specific: each row in the dataframe represents single reading from the cytometer. This was the cell sorting experiment in which cells were grown in 2 separate Petri dishes. They were then subjected to cytometry, counted and sorted for each phenotype (condition) separately. – ramen Apr 06 '23 at 12:02
  • @ramen I am not following your comment: the OP describes two repeated measures, which is (1) canonically what paired (dependent) observations are, and (2) what Cochran's $Q$ test is for. – Alexis Apr 06 '23 at 18:07
  • 1
    @Alexis - but I am the OP of that question. And I performed the experiment. The observations are independent. Nothing was repeated here. Cochran's test is applicable for 3 or more related groups, which is not the case here. Here I have 2 conditions (groups) which yield dichotomous result ("Y" or "N"). I could just stick to the chi-square but I have serious doubts. In real life data, the "N" prevails and the "Y" is incidental, so I am not sure whether chi-square approximates distribution correctly. – ramen Apr 07 '23 at 09:08
  • Hi ramen, sorry for missing that your are the OP! :) I thought Cochran's $Q$ was appropriate for you because "In each condition ["WT" and "KO"], observations might be classified as "Y" or "N" depending on whether the organism exhibits some measured trait or not." This sounds like you are measuring WT and KO in each organism, hence I labeled it "repeated measures" (i.e. I measured the organism for WT, and then I measured the same organism for KO). ←That's a repeated measures design (i.e. paired data). Dependency arises from the same organism part. Can you clarify your study design? – Alexis Apr 07 '23 at 15:56
  • @Alexis sure thing. The experiment concerns 2 cell sets, which are kept in 2 separate containers under the same conditions (e.g. growing medium). The cells are from the same species. First population, "WT" are wild-type cells, similar to those, which could be encountered in the environment (they have unaltered genetic makeup). The second population ("KO") is genetically-engineered. One of the genes involved in controlling cell size, divisions & flagellar motility is knocked-down in this cells. This experiment is meant to check, whether the genetic modification was successful (to be continued) – ramen Apr 09 '23 at 19:21
  • @Alexis ... and whether the cells are affected by loss of this gene (for instance, they should be smaller, some cell structures should look malformed, the flagella should be located improperly and should be less flexible, they should divide slower than their "normal" counterparts etc.). The cells of both types were kept for n days in the same conditions, and after that the entire cultures were collected and transferred to machine (flow cytometer) in which they were measured, counted & characterized (i.e. their morphology was assessed). – ramen Apr 09 '23 at 19:28
  • @ Alexis ... and this was performed for each of the conditions (WT & KO) until the entire cultures were assessed. The "Y" or "N" column in the dummy example contains info, whether the cell currently undergoes division or not. So my null hypothesis should be that there are no differences between the cultures in terms of the content of dividing cells (amount/percentage). However, if such difference is present and statistically significant, I would like to know the direction (whether there is more dividing cells in WT or in KO). And that is the design. – ramen Apr 09 '23 at 19:38
  • @ramen I came across this question that I previously answered and one thing is not clear to me. What is the 'lattice' variable? Is your experiment to observe whether there is a difference in cell 'division' (Y and N) among the WT and KO cells, or is it about a difference in 'lattice'? – Sextus Empiricus May 15 '23 at 06:00
  • Did you start with a single cell of each type WT/KO and counted how many cells you had at the end? Is your test about testing a difference in the total amount of cells of each type WT/KO, or a difference in the ratio Y/N? Your marginals are not fixed, but an interesting effect is that you do have some potential influences on the marginals as the number of cells could be determined stochastically by growth. (The no margins fixed case assumes a multinomial distribution) – Sextus Empiricus May 15 '23 at 06:03
  • 2
    I have voted to reopen because this question seems to have nothing to do with McNemar test which is about multiple scores within the same subject. – Sextus Empiricus May 15 '23 at 06:08
  • 1
    "the entire cultures were collected and transferred to machine (flow cytometer) in which they were measured, counted & characterized (i.e. their morphology was assessed)" do you only have the N/Y characteristic, or did you measure multiple aspects on a single cell? In that case some test that combines these multiple aspects as a whole might be better (and you get close to something like McNemar test). – Sextus Empiricus May 15 '23 at 06:15
  • @SextusEmpiricus thank you kind soul. So, there are also other aspects assessed for each cell (e.g. the position of the flagellum). I wonder if it is possible to test at once whether the amount of Y to N in the given condition is somehow related to the lattice (following the provided example) it would be awesome. For now I decided to stick to the Fischers, since in multiple instances I have less Y observations than required for Chi2. – ramen May 18 '23 at 19:11
  • It is a bit different question (example here), but you could possibly model that with binomial regression. You consider the state Y/N as binomial distributed where $p$ is a function of the type WT/KO and also the lattice (I do not know what lattice means). – Sextus Empiricus May 18 '23 at 19:35

1 Answers1

6

It is easier to present these data in a 2x2 contingency table.

$$\begin{array}{c|cc|c} &\text{Y}&\text{N} & \text{marginal sum}\\ \hline \text{WT} & 15 & 9 & 24\\ \text{KO} & 14 & 6 & 20 \\\hline \text{marginal sum} & 29 & 15 & 44 \\ \end{array}$$

The type of test to be performed depends on the boundary conditions (whether one or more marginals are fixed or not, e.g. whether the experiment selected a fixed number of cases with WT/KO and/or Y/N) and on the stopping rule (whether the test had a fixed number of the total 44 cases, or whether the test was continued untill some number of a particular class had been observed).

You can read about this in an article by Lydersen, Fatherland and Laake Recommended tests for association in 2×2 tables, but possibly also in many other places and also question already asked here.

Depending on the number of marginals that are fixed

  • both marginals fixed: the values have one degree of freedom which follows a hypergeometric distribution.

    You can perform Fisher's exact test.

    Example: the lady tasting tea experiment

  • one marginal fixed: the values have two degrees of freedom which follow a binomial distribution.

    You can perform several types of tests. For instance Barnard's test. Also a z-test for differences in proportions is commonly used.

    Example: a/b testing.

  • no marginal fixed the values follow a multinomial distribution.

    You can perform a chi-squared test, which approximates the multinomial distribution with a multivariate normal distribution. The null hypothesis is that the cell probabilities are a product of class probabilities.

    Example: an observational study where both of the two variables are not controlled.

A situation based on a stopping rule.