3

please be kind its been a long time since I have done any statistics!

I have some sexual orientation data for three teams, which are are very different sizes from Organisation (A) to Directorate (B) and a smallish Team (C).

My null hypothesis is that the three sets of data are not statistically different.

There are obviously differences in the data but I would like to determine if these are statistically significant when the team sizes are taken into consideration?

Could you please explain to me what the best tool would be to do this?

Thank you AndyData here

  • Do you mean that you want to test if the A, B, and C groups have the same distributions of the seven orientations in your chart? // Your chart says that the subjects form a population, not a sample. If you have the population, there is no notion of hypothesis testing. Do you have a sample, or are you just interested in the 6,493 people you observed? – Dave Jun 17 '21 at 21:29
  • Too lazy to do the test but looks like they're not. Definitely not close to a $10^{-10}$ p-value – David Jun 17 '21 at 21:30
  • Hmm I think so! for example for the first orientation listed Team C appear to have a higher incidence compared to A + B. Is it genuinely different or is the apparent difference related to the small population size? I'm sorry if that doesn't clarify things! – Andy_J_UK Jun 17 '21 at 21:34
  • 2
    In that case, this seems like a good candidate for the chi-squared test. (I would use something called the G-test, but the theory involves more sophisticated statistical notions like likelihood, and the results will be similar to results produced by a chi-squared test.) – Dave Jun 17 '21 at 21:37
  • Thanks Dave much appreciated! – Andy_J_UK Jun 17 '21 at 21:39
  • 2
    Are the smaller teams subsets of the larger ones, meaning that the same person appears more than once in the table? Also, it is important how the teams were formed, as if this is at least partly personal choice and some involved people know each other, and this has an influence on to what team they belong, the independence assumption required for the chi-squared test (as well as the G-test) is in doubt. – Christian Hennig Jun 17 '21 at 22:12
  • @Dave I had not seen the G-test before. I like the property of its proportionality to the Kullbach-Leibler divergence; very nice. – Galen Jun 17 '21 at 22:36
  • 1
    I like it because of how it can be extended to include covariates and experimental design, much like ANOVA can be extended to larger regression models. – Dave Jun 17 '21 at 22:54
  • @Lewian Good point - yes the smaller teams are also included in the larger numbers, i guess I could subtract these each from the larger teams. The teams are depts in a real organisation - to my knowledge staff didn't choose a team to work in based upon the factors being examined. – Andy_J_UK Jun 17 '21 at 23:21

1 Answers1

1

From the discussion, I assume that B and C are disjoint. To the extent that B and C can be taken as samples from larger populations, it seems reasonable to do a chi-squared test on B and C.

With some rearrangements of response categories, I put all of your data into contingency table TBL.

a = c(3361, 28, 21, 41,  9, 67, 1448)
b = c(1101,  9, 10, 12,  6, 21,  435)
c = c(  84,  3,  1,  0,  1,  3,   29)
TBL = rbind(a,b,c);  TBL

TBL = rbind(a,b,c); TBL [,1] [,2] [,3] [,4] [,5] [,6] [,7] a 3361 28 21 41 9 67 1448 b 1101 9 10 12 6 21 435 c 84 3 1 0 1 3 29

Restricting attention to B and C, we have

TABbc = TBL[c(2,3),];  TABbc
  [,1] [,2] [,3] [,4] [,5] [,6] [,7]
b 1101    9   10   12    6   21  435
c   84    3    1    0    1    3   29

A standard chi-squared test runs into difficulties because of the small counts.

chisq.test(TABbc)
    Pearson's Chi-squared test

data: TABbc X-squared = 8.9769, df = 6, p-value = 0.1749

Warning message: In chisq.test(TABbc) : Chi-squared approximation may be incorrect

As implemented in R, it is possible to simulate a more useful P-value, which does not reject the null hypothesis that proportions in the various response categories are equal.

chisq.test(TABbc, sim=T)
    Pearson's Chi-squared test 
    with simulated p-value 
    (based on 2000 replicates)

data: TABbc X-squared = 8.9769, df = NA, p-value = 0.1689

A more traditional approach is to collapse the table to get fewer cells with larger counts.

The reason for the warning message is that several of the expected counts (computed from row and column totals, assuming the null hypothesis to be true) are smaller than $5,$ so that the chi-sqared statistic may not have approximately a chi-squared distribution. Expected counts are as follows; it is small counts in columns 2-6 that are causing most of the trouble.

hisq.test(TABbc)$exp
        [,1]       [,2]       [,3]       [,4]      [,5]      [,6]      [,7]
b 1101.39359 11.1533528 10.2239067 11.1533528 6.5061224 22.306706 431.26297
c   83.60641  0.8466472  0.7760933  0.8466472 0.4938776  1.693294  32.73703
Warning message:
 In chisq.test(TABbc) : Chi-squared approximation may be incorrect

c1 = c( 84, 3+ 1+ 0+ 1, 3, 29) TBL1 = cbind(a1,b1,c1) TBL1 a1 b1 c1 [1,] 3361 1101 84 [2,] 99 37 5 [3,] 67 21 3 [4,] 1448 435 29

chisq.test(TBL1[c(2,3),])

    Pearson's Chi-squared test

data: TBL1[c(2, 3), ] X-squared = 0.32154, df = 2, p-value = 0.8515

Warning message: In chisq.test(TBL1[c(2, 3), ]) : Chi-squared approximation may be incorrect

We still get a warning message, but this time only one of the expected counts is below $5,$ but not below $3.$ which many statisticians would find good enough.

chisq.test(TBL1[c(2,3),])$exp
            a1    b1       c1
[1,] 100.88793 35.25 4.862069
[2,]  65.11207 22.75 3.137931
Warning message:
In chisq.test(TBL1[c(2, 3), ]) : 
 Chi-squared approximation may be incorrect

The bottom line is that we find no significant differences between responses of groups A and B.

BruceET
  • 56,185