Experimental results about the variability of grades on a math exam

Question

"[Myth] that exams are objectively graded. Daniel Stark and Edward Elliot sent two English essays to 200 high school teachers for grading. They got back 142 grades. For one paper, the grades ranged from 50 to 99; for the other, the grades went from 64 to 99. But English is not an "objective" subject, you say. Well, they did the same thing for an essay answer in mathematics and got back grades ranging from 28 to 95. Though most of the grades they received in both cases fell into the middle ground, it was evident that a good part of any grade was the result of who marked the exam and not of who took it"

Source: https://www.nyu.edu/projects/ollman/docs/why_exams.php

The study the author is referring to is: "Starch D, Elliott EC. Reliability of grading work in mathematics." http://www.jstor.org/stable/1076246?seq=1#page_scan_tab_contents

This is pretty consistent with my experience, both as a grader and a student. The cited experiment was done in 1913, however. I am curious about more recent, broader studies about testing the variance of grading - especially for math exams such as those we give in college algebra or calculus.

The same studies are referenced here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4041495/

The only recent reproduction I have found was for the Starch/Elliot experiment about english papers, not math exams: http://pareonline.net/getvn.asp?v=16&n=17

I don't have an answer to the actual question, but it would be highly surprising if the variation in scores was small. Even if we all agree on what counts as right and wrong, there are so very many ways to weight a mistake. Did anyone ever expect universally comparable grades? It seems like a strawman position. — Adam, Dec 29 '17 at 05:40
Even if the grading were consistent among teachers, there is still tremendous variability in test questions. The important thing is that teachers grade their own test consistently and for that many people use a Rubric — Amy B, Dec 29 '17 at 05:59
Multiple choice is the format the minimizes variance (to zero). It also is the least time intensive form of grading. But it has many other drawbacks.
Because there is wide variation in how different people grade the same work, it's worthwhile to try to avoid having multiple people grade different problems or, if that is necessary, to use a common grade scale that attempts to reduce variance between graders as much as possible. — Michael Joyce, Dec 29 '17 at 15:07
@Adam Regardless of what we expect (as educators), we treat grades as if they were universally comparable; for example because employers and educators look at GPA when considering candidates. (And because of the way they are used, students learn to see grades as universally comparable.) — Elle Najt, Dec 29 '17 at 18:32
@AreaMan I don't think that employers or educators do treat the same grade from different institutions the same. Within the same institution, perhaps. — Adam, Dec 29 '17 at 19:49
@Adam If you read the original article (jstor link), you will see that they get significant (though less) variation even within the same institution. — Elle Najt, Dec 29 '17 at 20:10

guest · Answer 1 · 2017-12-30T04:00:23.053

3

Eductational Testing Service has done a lot of work on essay grading for AP exams and similar (SATs when they've had essays, etc.) They put a fair amount of work into coming up with a detailed key and having support mechanisms: essays are graded twice, large variances examined, essays graded together with a supervisor able to deal with new questions that are generated (e.g. unusual solution method, questions of the graders). They used to have a lot of methodology articles on their website.

From personal experience, having a detailed key with partial credit spelled out helps to drive better results. Maybe more easily justified for a monster exam with many graders (e.g. TA's for a college chemistry exam). But probably still good practice even for an individual teacher grading a 30 person section. [Like war planning, even if you don't anticipate everything, the effort to plan out a key helps when running the grading and dealing with questions that arise as you grade exams/fight the enemy.]

P.s. How about instead of (or at least in addition to) giving us the range, giving us the standard deviation or the 90% interval? While much of mathematics is logical and rigorous, much of EDUCATION is statistical in nature and even Bayesian (in the sense that we don't know all the variables or the actual pdf being sampled). Saying "OMG, what a range" is not even how to think about educational methodology, which is a practical art with cost benefit tradeoffs and uncertainties (not a Euclidean proof). INstead thinking about the typical difference (e.g. SD/mean) is a more intutive way to think about the scale of the issue.

edited Dec 30 '17 at 04:00

answered Dec 29 '17 at 17:21

guest

164
4

If you look at the original article ( the jstor article that I linked to ), you'll see much more detailed statistics. (The quote is part of a polemic, not a statistical analysis.) – Elle Najt Dec 29 '17 at 18:35
Oh...that is a little better. So more of the issue is on the polemic than on you. But my critisism of framing the topic first with range remains. It's not the right framework to even start talking about this. (Sort of a student type error or maybe for adults reflects a mindset used to math rigor rather than science exploration.) And pedagogy research lives in the latter world, not the former. Normalized standard deviation should be an immediate first consideration. Shows you how tight the distribution is. Scales the problem. – guest Dec 29 '17 at 19:49
Frankly, I think range is the issue here. Attacking the objectivity of grading because of a 70% difference in scores is convincing, for most people anyway. We've all had many exams, and the difference that percentage difference makes doesn't need an explanation of scale. (Since it is the difference between failure and an A.) – Elle Najt Dec 29 '17 at 20:08
1

Frankly, I think it is a flaw in how to think about problems in the real world. It's like on sports blogs when people cite Tom Brady as an example of how having a 1st round draft choice is not helpful for getting a good quarterback. There is a world that exists between certainty (same x always leads to same y) and perfect chance (coin flips). Citing a range is a common misconception and is "sound bite-y". Correlations, normalized standard deviations, ratios. This is how to think about social sciences. Pedadogy is a social science. One with lots of complex factors. It is NOT math. – guest Dec 29 '17 at 20:57
I apologize for being contentious. I am just repeating myself. Have a different point, will make in comments since not enough to be its own answer – guest Dec 29 '17 at 21:14
The issue of subjective rating is one that has some literature. It's not a math concern. Not even purely an academic one. I have encountered it in six sigma classes (but there are some more rigorous sources). The question is using imperfect subjective measures of an output function (product quality or what have you). I think it is covered in applied statistics and market research, polling. There are some techniques to drive to drive better results (in terms of standard deviation reduction, repeatability, etc.) Training, reference standards, etc. – guest Dec 29 '17 at 21:16
Here is a review paper (I think more from medical field): https://www.ncbi.nlm.nih.gov/pubmed/12569049 I have only read the abstract (paywall) but it is well cited. A google search on "subjective rating scale" will give more good leads. – guest Dec 29 '17 at 21:21
I think there is a lot of HR literature on this too. Journal of Applied Pyschology covers this sort of thing a lot in terms of hiring evaluations. (But from a standpoint of relative performance, with no expectation of flawlessness.) – guest Dec 29 '17 at 21:40
One other insight (other than normalized SD versus range) is to consider importance and intent of the test. For example, a final exam or qualification or hiring essay may have different stakes (so variance more a concern) than a weekly quiz. Also, you can have evaluations where you are more worried about errors high or low versus just the variation versus the mean (or reference standard: examples: missing the diamonds in the rough in sports scouting versus hiring a bad pilot to fly your airplane. – guest Dec 29 '17 at 21:45
I think that each problem in the real world requires its own analysis about which statistical tools are relevant. I don't feel convinced that range is not relevant for grading, especially a range between failure and A. For example, students doing poorly in an exam can be devastating for their class score, which can prevent them from advancing in the program, or from keeping their scholarship, or destroy their confidence in their ability to succeed. This is a lot like the problem of misdiagnosing someone, except that with grades the disease (I think) is a fiction. I appreciate your input. – Elle Najt Dec 29 '17 at 23:33
In this situation I think the entropy and the range would be the most interesting summary statistics. – Elle Najt Dec 29 '17 at 23:35
Actually, I take that back - entropy doesn't really say much here. In any case, they plot the distribution in the article. – Elle Najt Dec 29 '17 at 23:53
The range is a really bad way to think about this problem because it varies based on the number of observations instead of the distribition. You could have a tight distribution but with many observations get a wide range and then conclude a method was worse than a loose distribution with small range because of few observations. It's just not a good way to even start to think about the issues. – guest Dec 30 '17 at 01:22
Okay, that's a fair point; on the other hand, if one is trying to break the assumption that a grade is "reasonably" objective, the observation that one teacher might fail you and another give you full marks is meaningful. I think this is a situation in which people assume that there aren't such extreme outliers, so the observation that there are is shocking. In any case, the distributions they plotted make it clear that it wasn't one exceptional teacher -- how would you prefer to capture that information? – Elle Najt Dec 30 '17 at 01:54
I think it's important to clarify that I'm asking from a place where I dislike grades and I think we should get rid of them. I'm only interested in finding 'better' ways to grade students as a temporary stopgap to an institutional problem. For the latter question, I agree that measures like standard deviation or confidence intervals are more appropriate. For the former, anything that undermines the myth of the objectivity of grades is good. If you are satisfied with a 90% interval, it's still a big difference: 45 - 84, when pass is 70. 50 - 85 when 75 is pass. – Elle Najt Dec 30 '17 at 02:16
Standard deviation or normalized standard deviation is an intuitive measurement to quickly describe the spread. Given that they already have an intuitive 100 point scale, I would be fine with the unnormalized SD. – guest Dec 30 '17 at 02:16
Additionally please consider the difference between the spread on an essay PROBLEM and an essay TEST. What if we have a 10 problem test with each worth 10 points. Per your comments, the range is 2.8 to 9.5 for a problem. Don't know the SD, but let's speculate it is...donno "2". This sounds bad if you think about an overall test having an average of 80 and a one sigma interval of 60 to 100! But if you think of an individual problem that should be graded "8" and the spread is 6 to 10, that is not as bad sounding. [break] – guest Dec 30 '17 at 02:21
Well, it is not so bad if the grades are drawn independently, but I don't think that will be true -- graders who assign an abnormally low (or high) grade on one problem will likely do that on others. – Elle Najt Dec 30 '17 at 02:29
[resume] And if the errors are independent from question to question, the overall exam result will be 80, but the one sigma interval will be much, much tighter. [Now that is probably an ideal assumption of complete independence! You will not have complete independence. Some graders are inherently easier than others. But it is a factor that will at least partially help reduce the spread on multi-problem tests.] – guest Dec 30 '17 at 02:32
Finally, I sense a sort of surprise that the world is not perfectly fair or clockwork. I see this with job candidates who are aghast that companies frequently reject very good candidates or accept bad ones. But companies are not in the business of finding to a Euclidean certainty if a specific candidate passes the bar for hiring. All they want to do is have some reasonable statistical success. Some correlation at all! Hopefully better than random chance or even inverse correlation. I would actually say grades in school are more accurate than hiring. But still, not perfect. Big deal. – guest Dec 30 '17 at 02:37
This is why it comes down to the question of the purpose of education; hence the relevance of the polemic I linked to above (nyu link). Here is my idealism: the purpose of the education system should be to nurture minds so that they can solve communal problems. The purpose should not to be to function as a business testing their products for reliability, or a state controlling behavior through surveillance. It's no secret that schools function in the later way; the myth of objectivity (and meritocracy) is part of what creates consent for the brutal treatment of kids and college students. – Elle Najt Dec 30 '17 at 02:49
There are a lot of different purposes of tests: screening for exit, screening for entrance, motivation to learn outside the test, the actual "practice" during the test, as well as the learning during the test review. – guest Dec 30 '17 at 03:11
In a different thread, I semijokingly gave one more reason which was psychological hardening. The flip side of "brutalization". (Now I actually had graded boxing in PE in school...with the grade driving real combat...that had some element of scariness...but I saw as very positive to learn to go toe to toe, even when outmatched.) But really...I don't think a grade valid or invalid is the end of the world. There are a lot of problems, tests, courses, and schools in an individual's life. Or if I want to look at from society's POV, there are a lot of individuals! – guest Dec 30 '17 at 03:15
Psychological hardening creates soldiers and sociopaths; that's not a good thing for human-kind. As for the other reasons -- a lot of them are treated in the ncbi article or the polemic. Thanks for the boxing story that's... horrifying. Though this is getting quite off topic, this is the kind of thing I think we should be teaching kids to do (learning through engaging with problems in the community, not preparing for exams) : http://therealnews.com/t2/index.php?option=com_content&task=view&id=31&Itemid=74&jumival=20729#pop1 – Elle Najt Dec 30 '17 at 03:45

Experimental results about the variability of grades on a math exam

1 Answers1