0

This is for a student selected class project. I am examining incidents of when Wildlife hits an aircraft across the last three decades. The data was selected from population (N=392,341) data at random in excel using data analysis sampling. I was planning on doing an ANOVA originally until I realized my data is a frequency. I recalled I could do Chi-Square, but again I have several cells <5, but more groups than a 2x2 Fisher Exactness will allow. I was wondering what would be the best way to proceed?

Should I remove the unknown incidents, and run another simple random selection? I could still end up with data <5 in each cell.

enter image description here

A Priori was done initially for ANOVA. n=729, alpha =0.001

  • 2
    My first question is, Why sample from the data? Why not analyze all 400,000 observations ?. Second, Are you thinking about Decade as an independent variable ? And if so, would you want to treat them as groups, as if in anova ? – Sal Mangiafico Dec 16 '22 at 14:38
  • @SalMangiafico, I am not able to use the population for this assignment its strictly raw sample data, and the professor will need to be able to run the test themselves if needed. Right now I believe the dependent variable is Wildlife Strikes, and Independent Variable will be safety policies and Bird/Wildlife Aircraft Strike Hazard Programs implemented in the last three decades. – Logan Innes Dec 17 '22 at 02:11
  • Please add the "self study" tag to your post, and if you would, mention that this is for a class assignment (if it is). See also: stats.stackexchange.com/tags/self-study. – Sal Mangiafico Dec 18 '22 at 02:47
  • Here's my opinion. How you approach the analysis for this problem depends on how you see the question practically. 1) You could do as you suggest and use a chi-square test of association. Probably 40% of observations in your table are < 5. Because of this, a standard chi-square may not be the best approach. In this case, I would probably recommend doing the analysis by Monte Carlo simulation, which is easy in some software packages. – Sal Mangiafico Dec 18 '22 at 16:52
  • However, I would approach the problem differently. a) I would pull out the data for "unknown" and look at that separately. But it might depend upon what "unknown" actually means. b) I would treat the number of strikes as an ordinal variable. b1) You could also treat the decades as an ordinal variable, and then use an analysis for tables of ordinal variables, like linear-by-linear association, Kendall's tau-c. b2) However, I would probably treat the decades as a nominal, grouping variable, and use something like Kruskal-Wallis test to see if one decade had a higher number of strikes.
  • – Sal Mangiafico Dec 18 '22 at 16:58
  • Of course, how you choose to approach the problem depends on the context of the course. – Sal Mangiafico Dec 18 '22 at 16:59
  • Practically speaking, I would look at the proportions for the cells within each column. That is, for any decade, 1 Strike is > 70% of the observations, and 2-10 Strikes is > 15%, leaving fewer observations for higher strike numbers. ... And then what do you say about Unknown ? – Sal Mangiafico Dec 18 '22 at 17:25