2

There is multiple-choice test and for every possible answer option my algorithm gives some score of how much it is likely to be the right one and picks the one with the maximal value as the answer. If there are 4 options A B C and D and the corresponding scores are for example 3.34, 4.01, 2.78 and 3.01 then the answer should be B. On the other hand, if score are, say, 3.34, 100.01, 2.78 and 3.01 then the answer is still B but the algorithm looks more “confident” I was wondering how to numerically capture this confidence so that I can for example debug the algorithm and see if it is improving. Like for example the maximal one minus the second biggest value or something like that for one question - and the average of this value for the whole set of questions. So I wanted to ask if there is some “standard” commonly-used approach for this.

amoeba
  • 104,745
Sasha
  • 21
  • 1
    Multiple-choice tests usally yield an overall score, after which the individual item data are largely or completely ignored. You seem to be asking about individual items. Are you interested in an overall score or individual items? In any case, how many items do you have? – Joel W. Jul 27 '14 at 14:26
  • yes, the overall score is the most important metric - but imagine that on certain data sets the overall score is 100% - can’t be higher In this case I want to distinguish the algorithm which clearly separates right answers from wrong ones for each individual test quistion and the algorithm with which the right answers happen to have just a slightly better coefficient - such algorithm is less likely to perform well on another unknown data. – Sasha Aug 08 '14 at 03:50
  • Are there answers to each question that are objectively right or wrong? – Joel W. Aug 08 '14 at 21:40
  • yes, there are right answers available. think test on English synonyms for example – Sasha Aug 12 '14 at 07:49
  • 1
    Perhaps you are considering these two methods of grading of each question for each algorithm tested: (1) Did the algorithm predict the correct answer, and (2) How confident is the algorithm of its answer. Your post asks about #2. In typical multiple-choice testing situations, such as high school graduation tests, there is no collection of data such as you have (i.e., that reflect confidence of the test takers that each answer they are choosing is correct). So, you will have to really dig around to find such an existing metric, and it may be that one does not yet exist. – Joel W. Aug 12 '14 at 19:23

1 Answers1

1

I use a system for estimating confidence of my students in multichoice exams. I use a risk return system, such that there are 3 options for the grade of a question. Thus the student gives that answer and also a confidence, eg a,3 or b,2. The three levels are 1 = +1 for a right answer and 0 for a wrong answer. 2 = +1.5 for a right answer and -0.5 for a wrong answer, and 3 = +2 for a right answer and -2 for a wrong answer. If you used this marking system, and used roulete wheel selection of the answer from you AI. ie sum the total of your confidence and then select a random number from the total range. for the first one B would be 4 our of about 15 and for the second B would be 100 our of 105. If you then answer with the confidence the algorithm generates based on <50% is guessing, 50%< answer < 80% is middle confidence and <80% is full confidence you should be able to have it learn from its lack of confidence in the answer.