2

I have a dataset containing angles. They represent the bending angle that a seedling makes to go toward light. Genotype A is WT and A is the one we are testing. We removed a PKS gene, wich is responsible for the ability to bend thoward light. However, there are several other PKS genes that can compensate the one we deleted. That explains wy A and C have quite similar behaviour. B is negative control and lacks all the photoreceptors. So B can not at all percieve light so most of seedlings are not bent at all. some randomly bend thoward or in the opposite direction of light (this is not the case for A and C), which explains the high variance and I have three groups: A, B and C. The data is neither normal, neither meeting homoscedasticity. Here is a boxplot showing how the data is spread among groups: enter image description here

Also, the Bartlett and Levene test has a value of 9.419191e-13 and the Bartlett a value < to 2.2e-16. Also, my data is not normal. Here is a qqplot of the residuals: enter image description here And my Shapiro test is also < to 2.2e-16. Here is my data if it can be useful: https://gist.github.com/marius894/b2bc17a55e25eb62ccc239e110a056bc So, I tried to do a permutations test instead of a 1 way anova, and i get a fstar value of 0.001. Here is histogram of f star (f star is the simulated f values of the permutation test), where the redline shows where the observed F of which comes from the real dataset: enter image description here

We can see that the observed F value is really far from the F star distribution. In fact, if I remove the observed F, we better see the f star distribution and it shows us that the observed F is really really far form the simulated distribution: enter image description here

So my question is: is this test valid or is there something wrong because of how far observed f is from f star? Also, I do not really understand what this F value of 0.001 I get means, because it represents the probability that a simulated f value gives a value at least as big as my observed f value, but I struggle to understand where it is related to the probability that my result is due to hazard. So can I use this test or should I use another non-parametric test? Also, i forgot to mention that my data is unbalanced, and I wonder if I can use a Tukey test even with the issues my data has. As i thought this Tukey would not be robust because of non normality and heteroscedasticity, i did a non parametric Mann-Whitney test with Bonferroni correction on all my pairs of groups. Could it work?

  • This needs a cross-reference to https://stats.stackexchange.com/questions/629805/transformations-to-meet-heteroscedasticity – Nick Cox Oct 28 '23 at 09:09
  • Can you say more about the genotypes? It's interesting that the genotypes A and C have very similar response, and B on the other hand is very different. Depending on what you know about the genotype and the phenotype some tests / comparisons might make more sense than others. – dipetkov Oct 28 '23 at 14:21
  • 1
    Yes, Genotypes A is WT and A is the one we are testing. We removed a PKS gene, wich is responsible for the ability to bend thoward light. However, there are several other PKS genes that can compensate the one we deleted. That explains wy A and C have quite similar behaviour. B is negative control and lacks all the photoreceptors. So B can not at all percieve light so most of seedlings are not bent at all. some randomly bend thoward or in the opposite direction of light (this is not the case for A and C), which explains the high variance and – Marius Audenis Oct 28 '23 at 15:54
  • Thanks for the clarification. I suppose the overall ANOVA (or a permutation test) tells us something even though that something isn't particularly interesting. As the difference between (A + C) vs B is so obvious, the scientific exposition doesn't really need a p-value; a graph would do a better job or an estimate (with SE) of the average bending angle for each genotype. The A vs C comparison on the other hand seems interesting & as @NickCox points out you shouldn't ignore the Plate factor when making that comparison. PS: I have trouble visualizing a bending angle of 90 towards the light. – dipetkov Oct 28 '23 at 16:21

2 Answers2

6

Recall that the null hypothesis behind ANOVA is that a sample statistic (the mean) is the same across groups. Already from the plot you can very clearly see that your outcome differs greatly in its distribution across groups, barring one or two points the largest value of group B is smaller than any other observation in any other group. I am therefore not particularly surprised that you are able to reject the hypothesis that their underlying distribution (or a statistic thereof) is the same.

You haven't really specified how you did your permutation test, but running some other non- or semi-parametric tests on your data gives very similar results: a Kruskal-Wallis test gives $\chi^2\approx 105$ and an ordinal regression (recommended in a link to an answer to your previous post) has a likelihood ratio $\chi^2\approx 185$, both on two degrees of freedom. For reference, here's the density of a 2-df $\chi^2$:

chi-square density

I truncated this plot at 20 because there's less than 0.005% of the density beyond that, which should give you an idea of the kind of P-values the above statistics result in.

To interpret the F value of 0.001 that you mention we would have to know exactly how the test was done; a wild guess is that you did 1,000 permutations and at most one of those had a more extreme test statistic than the original sample. Unbalanced data is not necessarily a problem for a Tukey test, but keep in mind that there are other distributional assumptions behind it.

PBulls
  • 4,378
  • Thank you! I did 1000 permutations, but my question also was if i can run the Tukey test with this abnormal distribution and heteroscedasticiy. As i thought this would not be robust because of this, i did a non parametric Mann-Whitney test with Bonferroni correction on all my pairs of groups. could it work? – Marius Audenis Oct 28 '23 at 09:51
  • Yes, Tukey would suffer from those, and a Mann-Whitney U with Bonferroni should be a valid (but conservative) alternative. Perhaps you can also have a look at Dunn's test as a post-hoc for Kruskall-Wallis. – PBulls Oct 28 '23 at 10:40
  • Okay thank you very much! But then I do not really see the advantages of using a dunns test as it also uses a correction, so it is conservative too. Is there still an advantage of using it? – Marius Audenis Oct 28 '23 at 11:36
  • Bonferroni is as conservative as you get, and the advantage of Dunn's is that it uses the same (across-group) estimates as Kruskal-Wallis whereas MWU only looks at the two groups being compared. – PBulls Oct 28 '23 at 12:01
3

Thanks for posting the data. This isn't anything statistically but a series of comments but given the graphs it wouldn't fit into a series of comments in the Stack Exchange sense.

Detail. In your dataset one treatment is DMASO and the others are all DMSO. Either that's a typo, or you can't say much reliably about one observation. I've not excluded the oddity from what follows.

Graphics. I don't ever find box plots with jittered data points especially clear or the best possible display. When the focus is on analysis of variance it is particularly odd that means are not plotted, rather than leaving the reader to guess where they lie given the medians and quartiles.

Here I've drawn quantile plots and added means. Unfortunately I have not achieved very much beyond a design I like better, but here are the results.

enter image description here

enter image description here

More on the plots: The quantile plots show the values in sort order (the quantiles are here all the values) against cumulative probability (precisely, plotting position using the Galton-Hazen recipe (unique rank - 0.5) / sample size). Many people in biology use instead an (empirical (cumulative)) distribution (function) plot, now often called an (E)CDF plot, which has the axes reversed, and often a joined line, not a series of points. It's the same information either way, although there is scope for small discussions about what works better. The quantile plot could also be described as a quantile-quantile plot, as the horizontal axis shows quantiles for a uniform (rectangular, flat) distribution on [0, 1].

As already pointed out, it is clear that B is quite different from A and C. Quite what testing adds to that is not so clear, but reviewers or examiners are likely to want some decoration of results with significance testing.

But it's far from just a matter of comparing genotypes A, B and C. Your plates (4708 etc.) differ quite a lot too in both level (e.g. mean) and spread (e.g. SD or IQR). That needs a story and to be brought into the analysis. If plate is confounded with genotype, that is not so good.

Angles. Angles (circular data) are bounded. Here your convention seems to be that angles are measured on the interval $(-180^\circ, 180^\circ)$ where you don't need to worry what you would do with either extreme $-180^\circ, 180^\circ$ if you don't observe it. Statistically, angles are not expected to be normally distributed except as a loose approximation if they cover only part of the possible range. More generally, ANOVA is not obviously natural for circular data, which no-one has pointed out yet. It is far from obvious, however, what would work much better here.

Modality. It's pretty clear that angles close to $90^\circ$ are often but not always popular. I am no kind of biologist or experimental scientist but even a statistical person needs to hear about the experimental protocol. Also, how far are the seedlings grown separately or are they in any sense competing with each other, for light, or for any other resource? The detail of plates suggests that they are grown in groups.

Biology. So I have to guess that an informed analysis of this data needs to be based on more biological information about the experiment and about how plants grow. The question is in essence what to do about heteroscedasticity -- and the underlying question is what to do about heterogeneity. But that's what you're asking!

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
  • 1
    Very much agree with this, there's also the issue that plate seems to be entirely confounded with genotype (bad design?) and, depending on what the angle measures exactly, I'm not even sure if negative ones make sense - are they actively growing away from the light or just making a left-hand 180° instead of right? – PBulls Oct 28 '23 at 10:37
  • So genotype and plate are confounded because we ignore the plate effect, and do not use the plate data for this experiment. The only groups are the genotypes. And the negative angles represent that the plants grow awy from the light. We included those in the data because it would be biaised to remove the as the plants this this by themselves, so that reflects information about the comportment of this genotype. – Marius Audenis Oct 28 '23 at 11:08
  • And the DMASO was a typo i made when copying the data on Github, I just corrected it. To be more clear, the treatment is the same for every individual here, the anova is just comparing the genotype, as DMSO was proved in other papers to have no significant effect on the plants behaviour. The seedlings do not compete and are grown in groups. The question here is if the genotype has an effect on ability of the seedlings to bend thoward light – Marius Audenis Oct 28 '23 at 11:09
  • I have a last question, what does thw cumulative probability mean in those graphs? They look kind of like quantile quantile plots. I do not know how to compute them in r but that would be useful as we clearly see the differences in variance among groups! – Marius Audenis Oct 28 '23 at 11:38
  • As said, it is a quantile plot. I will add more detail to the answer. I am sure that you could plot something similar in R; if not, you need a new favourite software. – Nick Cox Oct 28 '23 at 13:52