0

In the attached image, the x-axis is the proportion of children at a school who passed at least 5 GCSE exams that year, and the y-axis is a count of the schools. The different colours are whether the school is in a rural or urban area.

I want to do a regression to see how the school being in a rural vs. urban area affects this measure of attainment: 5_GCSEs ~ rural.

I've read that if your dependent variable is a proportion (as mine is), it can be advisable to convert this into a count of successes out of the total number of trials (i.e. # pupils obtaining 5 GCSEs / # pupils sitting GCSEs) for each school and run a binomial logistic regression. However, you need to look at the data distribution first to determine if this is necessary.

histogram

This looks quite normally distributed other than the second small peak around 1. Does this mean I can just use linear regression?

I've also attached the residual plots and QQ plots. However, as the only independent variable is binary, I'm unsure how to interpret these.

residual plot

QQ plot

Grateful for any suggestions about what the distribution of my data suggests about the type of regression analysis I should do.

Jess
  • 21
  • 4
    You didn't attach any plots. However, they would be irrelevant to the question as no regression model makes assumptions about the marginal distribution, only the conditional distribution. – Tim Aug 05 '23 at 13:24
  • Thanks Tim. The only way I can think to represent the conditional distribution would be in a boxplot? Do you know what features I would be looking for there to decide whether to run linear or binomial logistic? – Jess Aug 05 '23 at 13:40
  • 2
    First, I would not dichotomize the number of GCSEs passed. I would use it as a count and do a count regression (probably negative binomial).

    Second, for proportions, I like beta regression.

    – Peter Flom Aug 05 '23 at 13:59
  • 1
    @PeterFlom But what if the original numerators and denominators that give the proportions are available? – Dave Aug 05 '23 at 14:46
  • 1
    Unfortunately I don't have data on the number of GCSEs passed. The only data I have access to is aggregated to that level of % pupils passing at least 5 GCSEs. And I also have data on the number of pupils taking GCSEs at each school. So I can work out the number of successes/number of trials for that binary variable, but cannot do a straightforward count of number of GCSEs unfortunately. Based on your suggestions, I think I will try running a beta regression – Jess Aug 05 '23 at 14:53
  • 1
    Beta regression is ad hoc and would be difficult to justify unless all the schools have similar numbers of children. A GLM based on a count response would be a better initial choice. – whuber Aug 05 '23 at 16:51
  • This question is related, probably the same data (duplicate?): https://stats.stackexchange.com/questions/623913/can-i-use-quasi-binomial-regression-on-proportion-data-in-this-way – kjetil b halvorsen Aug 14 '23 at 15:55

0 Answers0