2

I've been working on the stats for a paper which involves comparison of various pairs of groups on a particular score that is heavily skewed (and so I don't feel it would be well suited to comparison of means). The scientists I'm working with are rather keen to have p-values alongside my lovely CIs. No problem, in general; chuck in a Mann-Whitney U. The issue arises where I have one comparison they'd like to make, which is partially repeated measures: around half of participants have contributed to scores in both conditions (while the rest have only contributed scores in 1).

I'm a bit stuck on how to test this? It would be nice from my point of view if it could be a test on the Mann-Whitney U, as that would sit well with my other tests (and CIs), and I like how U captures the probability of superiority (which my CIs are also aimed at). But if that's not possible, I could consider alternatives.

I've presently done it by simulating U, using the correlation matrix of the data as the covariance matrix for simulated normal data (so that both cases come from the same distribution but with correlation) and then deleting cases to capture the non-repeated cases. I feel like this is probably "close enough", but feels hacky and of course is incorporating the observed correlation etc into the null. The p-values are tiny however you calculate it, so it doesn't feel that important, but I'm afraid of it getting ripped up later down the line.

Can anyone suggest a better alternative? Would a permutation work in this case? I'm struggling to wrap my head around how they would apply to repeated data.


Clearer understanding of the data:

The academics I'm working with have put out a questionnaire to dog/cat owners. So, the data is scale scores (sum of a large number of Likert items), answered once for a dog that they own, and once for a cat that they own. Those respondents who own both will answer both, those that own only one, will answer just one. I'd like to compare the total scores between dogs and cats, without removing those scores where only one is owned. There is just one total score for each.

justme
  • 775
  • I think I have an idea, but I am not totally clear about the structure of the data and hypothesis. Can you add some more detail n the data structure? – kjetil b halvorsen May 03 '20 at 22:32

1 Answers1

4

So far just some ideas. According to this paper, also referenced in this book by Frank Harrell, the Mann-Whitney test is equivalent (in some sense) to ordinal logistic regression. In the case of two groups, that is the Wilcoxon test is equivalent to logistic regression. This is somewhat approximate, the equivalence is not with the likelihood ratio test which are usually used, but with the score test.

An idea is to exploit this in the case with repeated measures (since repeated measures modeled via mixed models do not have problems with missing observations in some groups.) So use a logistic repeated measures model, the usual one if you have two groups, or proportional odds LR in case of more groups. I can try to flesh that more out if you can give some more details.

As a test of the equivalence, I did some simulations in R:

    k <- 50
    set.seed(7*11*13)#My public seed
    x <- rep(0:1, each=k)
    B <- 1000
    result <-t(replicate(B, {
        y <- rbinom(2*k, 1, rep(c(0.3, 0.55), each=k))
        pval.wilcox <- wilcox.test(y  ~ x)$p.value
        pval.glm    <- anova(glm(y  ~ x, family=binomial), 
                         test="Chi")$`Pr(>Chi)`[2]
        c(pval.wilcox=pval.wilcox, pval.glm=pval.glm)
        }, simplify=TRUE))

Looking at the results, the p-value from wilcox.test is newer smaller than the p-value from logistic regression, but the largest difference is $-0.005$. So this seems to work well!

  • Thank you so much for taking an interest here! The issue here is that your $y$ is binary which isn't the case with my data. So I think this wouldn't apply? I've added a section to my question with more specific details. – justme May 04 '20 at 09:36
  • I do think sone other versjon Will Apply ... – kjetil b halvorsen May 04 '20 at 09:40
  • Ahhhh I see, so I could possibly use a continuous link mixed model (ordinal regression with random effects)? My dv has 69 levels, which feels like a lot of params to estimate, but that might just work! It's a very large dataset. – justme May 04 '20 at 09:49
  • In the case of two groups, that is the Wilcoxon test is equivalent to logistic regression You mean, two groups as the DV, right? – ttnphns May 04 '20 at 09:55
  • Two groups as the dependent variable, right. – kjetil b halvorsen May 04 '20 at 11:35
  • So, I have 69 "groups" (levels) to my dv, which feels like a lot for a PO regression. It is a large sample (at least 500 in each group), though, so I do think it is feasible to estimate this using a CLMM. Do we think this is the best option? Better than my crude simulation of U taking account of correlations? All input is v much appreciated here, I've been having a meltdown trying to figure out the right way forward.... – justme May 04 '20 at 12:46
  • The book by Frank Harrell referenced proposes a version of ordered regression even for continuous data, and there is an implementation in his R packages ... – kjetil b halvorsen May 04 '20 at 12:56
  • Fantastic, thank you! I am looking over the book right now. – justme May 04 '20 at 13:00
  • (I think I'm right in saying his models are fixed effect models and don't have random factors? If so, I guess I need to fall back to the CLMM (PO mixed model) to deal with the repeated measures? Does that sound like a reasonable thing to do with 69 levels? – justme May 04 '20 at 13:17
  • CRAN package ordinal also have mixed modesl, but how well do they work with 69 levels? – kjetil b halvorsen May 04 '20 at 14:41
  • Oh, I missed this! Yes, ordinal is the package I used. It works and gives me p<.001 like everything else I've tried, but it feels a bit clumsy doing that with 69 levels? I feel like every option I've tried so far feels a bit clumsy in one way or another, but this is the most "standard" one... (vs. simulating U, or a hotch-potch randomization test)... My academics don't really want me to flood the paper with too many technical details... I'm wondering, alternatively about showing that "no difference" lies outside of a 99.9% bootstrapped CI on my effect sizes, if that might be enough? – justme May 05 '20 at 09:45
  • I might say approximate p-values calculated by comparison to 95%, 99% and 99.9% bootstrapped CIs. – justme May 05 '20 at 09:50
  • @kjetil The paper link is dead. It's not on web.archive.org. If you find an alternative link, perhaps archive it as well. – Glen_b Nov 05 '23 at 21:50