0

I have a data set of continuous data. From 1000 observations, the resulting dataset is only 29 rows and two columns: the range (starting point) and the count.

Pearson - Weldon crabs dataset

range count
0.5835 1
0.5875 3
0.5915 5
0.5955 2
0.5995 7
0.6035 10
0.6075 13
0.6115 19
0.6155 20
0.6195 25
0.6235 40
0.6275 31
0.6315 60
0.6355 62
0.6395 54
0.6435 74
0.6475 84
0.6515 86
0.6555 96
0.6595 85
0.6635 75
0.6675 47
0.6715 43
0.6755 24
0.6795 19
0.6835 9
0.6875 5
0.6915 0
0.6955 1

tot count = 1000 rows = 30

What test is best appropriate?

  • Pearson Chi-Square
  • Shapiro-Wilk
frhack
  • 131
  • 2
    Shapiro-Wilk does not apply--and if you do apply it, it will tell you definitively the data are non-Normal, almost no matter what the counts are. Instead, as described at https://stats.stackexchange.com/a/17148/919, estimate the Normal parameters using Maximum Likelihood and apply the standard chi-squared test. BTW, if these are real data, you can see already (even without knowing what is in the remaining 24 rows) that they are non-Normal! – whuber Jun 23 '22 at 20:25
  • thanks. So I cannot apply shapiro ? Even if I add a uniform jitter in every interval to simulate number approssimations ? (no the numbers are just garbage that I typed for example) – frhack Jun 23 '22 at 20:29
  • 1
    Adding the jitter will affect the result of Shapiro-Wilk, making its results impossible to interpret. – whuber Jun 23 '22 at 20:35
  • @whuber ok tanks, can I use your code here? Is it appropriate ? https://stats.stackexchange.com/questions/34882/how-to-estimate-the-third-quartile-of-binned-data/34894#34894 – frhack Jun 23 '22 at 20:42
  • 1
    You need to use nlm instead of optim because there are two parameters to estimate. You also need to take care with any low-count bins: they can screw up the chi-squared approximation. Handle this by using chisq.test with its simulate.p.value option to perform this test once you have obtained the MLE and used that to compute the estimated bin counts. – whuber Jun 23 '22 at 20:57
  • thnaks the dataset I'm working on is the Weldon-Pearson crabs

    https://rdrr.io/cran/MixtureInf/man/pearson.html

    Now I edit the question and put the real data

    – frhack Jun 23 '22 at 21:00
  • @whuber it's better that I consider the tails to infinite (left and right) in order to group the first two and the last two rows (with count<5) ? – frhack Jun 23 '22 at 22:28
  • Yes: you need your bins to cover the full possible range of values. – whuber Jun 24 '22 at 12:09
  • @whuber avoiding bins with count< 5 ? – frhack Jun 24 '22 at 12:24
  • Certainly not! You have to accommodate all the data. When you have many bins with expected counts less than $5,$ the chi-squared approximation will be poor and you will need to work harder to get a correct p-value. – whuber Jun 24 '22 at 14:19
  • ok thanks, so in the dataset above I can just aggregate the low count rows in the left and right side and substitute them width two open intervals [infinite, start], [end, infinite] Correct ? – frhack Jun 24 '22 at 14:39
  • @whuber I implemented what you suggested, check it out

    https://rpubs.com/frapas/911603

    – frhack Jun 28 '22 at 19:31
  • @whuber I rearranged the bins, now the observed counts are [11 134 451 389 15] and the theoretical counts are [ 4.84 127.82 506.90 326.66 33.25] is it ok to compute Pearson Chi square Test pValue ? – frhack Jun 28 '22 at 20:07

0 Answers0