I used computer to automatically select 30 questions from a database, and forward to 20 students. Then I reviewed their answers, and used 1 to represent a correct answer and 0 for an incorrect one. For each student, I got a 1-0 sequence with sequence length 30 (for instance, 11100011111111011110111101111). Here I denote the 20 1-0 sequences as Seq1, Seq2, ...,Seq20.
After that, I used computer to generate another 30 questions and perform the test again. I got another 20 1-0 sequences, denoted as Seq21, Seq22, ..., Seq40.
I would like to know how to judge whether the difficulty of the first test and the second test is significantly different based on Seq1, ..., Seq40 (the outcomes of the two tests). What statistical tool or hypothesis test shall I use to solve this problem?