0

Every animal farm needs to be verified by its city council in a regular time frequency, which could be every six months or 12 months. We report how many farms nationwide which fail the verification each month. If I want to compare one city's failure rate against the nationwide rate (excluding that city) monthly, what test should I use? In this case, I would like to company city A (2.5%) against overall (14.6%).

I am a bit confused: first I am dealing with whole population rather than sample as every farm needs the verification, although some are verified every 6 months, some are every 12 months. Also, I doubt the data has a normal distribution. In addition, some percentage could be small, e.g. smaller than 3%. Some city's count of farms could be small as well.

enter image description here

Ian
  • 3
  • 2
    Welcome to CV, Ian. Could you explain why you need a test at all? Why not just report some measure of the difference between each city and the national average? You could also include a margin of error developed by supposing the failures in each city are random and independent, but with varying underlying failure rates within the cities. – whuber Jun 26 '23 at 20:50
  • Thanks Whuber. Some cities' rate is low while some are high. They are interested in the difference against the national overall is significant or not, from the statistics' perspective. BTW, what "measure" meant please in your reply "Why not just report some measure of the difference"? – Ian Jun 26 '23 at 21:09
  • It could be the arithmetic difference, ratio, or some other metric. "Significant" requires a statistical model. With entire populations, no such model is needed and any nonzero difference is just that: a difference. You likely want to view these data as observations of an ongoing process, as modeled by city-specific annual failure rates. You should be principally concerned about the large number of comparisons you are making and compensate for them. – whuber Jun 26 '23 at 21:39
  • Many thanks Whuber. – Ian Jun 27 '23 at 01:56

1 Answers1

4

Not all questions require statistical analysis, and not all analyses benefit from statistics. Statistics usually knows nothing. Is 23 more than 21? Yes, of course. Is it significantly more? Well that depends on how much the difference matters.

Statistics can tell you things like how large an observed difference is compared to what a statistical model expects. Sometimes that is useful, but such a test relies of the model being a reasonably good match to the data generating and sampling mechanisms of the real data.

Your data are counts and it is pretty easy to apply a statistical model that assumes that there is a fixed probability of any particular farm failing, and that probability might be different depending on the city in which a farm is a member. You can estimate that probability for each city, and an interval around the estimate as a binomial confidence interval. If that interval does not overlap with the interval for another city the difference is certainly notable (and it will be 'statistically significant' at some level), but you would need to provide context and understanding to interpret that difference.

Another statistical analysis that might be interesting to you would be possible if you had the percentage failures for all of the cities in the 'Other cities' category. Then you could plot a distribution of those percentages to see where the cities of interest fall. You could determine their rank within the 'population' of cities.

My last suggestion is that you might find a relationship between the probability of a farm failing and the number of farms in each city. To see if there is something there you should start by plotting the number of fails versus the number of farms and see if there is a pattern. (This would also require you to have the data for each 'other' city.)

Michael Lew
  • 15,102
  • 1
    +1 for the well articulated advice. But there is a subtle trap with one of your later recommendations: a small area fallacy lurks there. When you examine the distribution of fractions, the extremes of that distribution will tend to be populated by the cities with fewer farms, resulting in a spurious "pattern." – whuber Jun 27 '23 at 14:01
  • May I use z-test or t-test please? – Ian Jun 27 '23 at 21:10
  • Perhaps, provided you correct it for the multiple comparisons appropriately. An ANOVA followed by Tukey's HSD might be more insightful. (If you are familiar with the popular Consumer Reports ratings, you already understand the HSD, because they are based on it.) – whuber Jun 28 '23 at 13:14