How to judge which test is more difficult?

Question

I used computer to automatically select 30 questions from a database, and forward to 20 students. Then I reviewed their answers, and used 1 to represent a correct answer and 0 for an incorrect one. For each student, I got a 1-0 sequence with sequence length 30 (for instance, 11100011111111011110111101111). Here I denote the 20 1-0 sequences as Seq1, Seq2, ...,Seq20.

After that, I used computer to generate another 30 questions and perform the test again. I got another 20 1-0 sequences, denoted as Seq21, Seq22, ..., Seq40.

I would like to know how to judge whether the difficulty of the first test and the second test is significantly different based on Seq1, ..., Seq40 (the outcomes of the two tests). What statistical tool or hypothesis test shall I use to solve this problem?

What do you mean by significant difficulty? You need to define that to be able to design a test. Also, your question title does not match your actual question. — Richard Hardy, Mar 18 '15 at 10:42
thanks,Richard,i mistype the word, it should be significantly difficult, i wanna know which test is more difficult for the students to answer based on the correctness of their answers, is binomial test applicable for this situation? — ericjp, Mar 18 '15 at 10:51
A binomial test is not suitable unless the individual questions within a test are equally difficult. — Glen_b, Mar 18 '15 at 12:25

Tim · Accepted Answer · 2021-11-20T12:08:27.160

Was it the same students that took both tests? If yes just compare the average scores... Nothing more has to be done since the students had the same abilities so you can assume that the difference in scores is due to the test difficulty. Of course, the tests should measure the same trait (e.g. both should cover similar topics in mathematics). However notice that in this case, you assume that the difference between tests is only due to the difference in difficulty while in practice other factors could play a role as well: students could get bored of tired and it influenced their scores, they could actually have learned something before the second test (e.g. taking the test gave them some new ideas).

If different students took those tests then you should also consider the differences in student abilities. This can be done with Item Response Theory-based methods. One of them - and the most simple and most robust - is the Rasch model

$$ P(X_{ij} = 1) = \frac{\exp(\theta_i - \beta_j)}{1+\exp(\theta_i - \beta_j)} $$

where individual response $Y_{ij}$ are modeled as a function of student ability $\theta_i$ and the item difficulty $\beta_j$. So you get estimates of both abilities and difficulties. The test with more difficult items is the more difficult. In R there are several packages for estimating IRT models, you can check ltm (Rizopoulos, 2006) and mirt (Chalmers, 2012) packages and their documentation for further information. IRT models are estimable as well in other software e.g. MPLUS, MIRT (by Cees A.W. Glas), etc.

How to judge which test is more difficult?

1 Answers1

Linked