Confidence Scores for Exam Questions

Recorded: May 28, 2026, 9:03 p.m.

Original

Summarized

Confidence Scores for Exam Questions - No Magic Pill

No Magic PillSubscribeSign inConfidence Scores for Exam QuestionsGuessing correctly doesn't necessarily indicate knowledge. No Magic PillMay 20, 20262ShareThe Problem With GuessingBoth multiple-choice (MCE) and free-response exams (FRE) don’t necessarily show how confident the student is in their answers—they just show the final answer or the thought process, respectively. But that means the student can just guess and get it right, even if they don’t know for sure that the answer or thought process is correct!Take MCEs. For a four-choice question, the student could improve their odds of guessing from 25% to 33-50% chance just by eliminating one or two answer choices. While this demonstrates some knowledge about what the answer isn’t, isn’t necessarily proof that they know what it is, even if they choose correctly.Or FREs. A student may have some idea of what the question is asking, but could take a guess at which formula/process to apply and still get it correct. (This is much less likely than on a MCE, but still possible based on personal experience.)This is unfair for the students who truly know the answer, but get scored the same as the guesser. What would happen if students had to note down how confidence they are that they have the correct answer?Brier Scores for ExamsIntroducing Brier scores (BS). The formula for a BS (as related to exams) is:where:N is the number of questions on the examt is the question numberpt is the student’s prediction that they got question t correct, ranging from 0 (”I definitely got it wrong”) to 1 (”I definitely got it right”)ot is the result of their answer; values are either 0 (incorrect) or 1 (correct)A perfect Brier score is 0, since 1-1=0 (“I think I got it right and I did get it right”) and and 0-0 = 0 ("I think I got it wrong and I did get it wrong"). One issue here is that if the Brier score was all that mattered, is that exams could be easily gamed by just intentionally getting the wrong answer and being confident about it, which is generally easier to do than getting the right answer.The question becomes how are correct, confident answers rewarded?ImplementationPredictions must be greater than 0.5 (50%) because the goal is to get the answer correct. If the student is less than 50% confident they can get the answer right, then they should change their answer to be more confident or be penalized accordingly. This will make each question have two sections of answers: section A, for the actual answer choices, and section B, for the student’s confidence level in the respective answer choice (50%, 60%, 70%, 80%, 90%, or 100%).Scantron forms or tests can be modified to have two sets of answers per question to calculate the BS for that question. The total BS is then calculated to give the student their score with the lowest score being the best (alternatively, the BS can be inverted and the highest score wins).BenefitsExams that sort people by skill should rarely have perfect scores because the signal gets lost, at least to some extent. One person getting a perfect LSAT or perfect SAT signals that person is incredibly intelligent; 1% of the test-taking population getting perfect scores indicate they’re pretty smart and the test is too easy.Adding confidence scores lets further sorting happen. If Ernest, Jameson, Douglas, Alistair, Tannatt, and Raymond all get question X correct, but their confidences are 50-60-70-80-90-100%, respectively, then their scores will be 0.25, 0.16, 0.09, 0.04, 0.01, and 0.00, indicating Raymond is top dog amongst his peers. This is done without needing to make the test more difficult, but simply by figuring out how correct each person really is.Combine this with a more challenging test and the stars will really shine.LiteratureAfter writing all of the above, I found that literature already exists under the name “confidence-based marking”. There truly is very little that’s new left under the sun.A.R. Gardner-Medwin discusses it in their Confidence-Based Marking - towards deeper learning and better exams, whose abstract states:... A critical point is that they [students] benefit either by finding reasons to place greater reliance on an answer or by seeing reasons for reservation. This places a premium on careful thinking, and on checks and the tying together of different facets of knowledge, thereby encouraging deeper learning. In exams it generates higher quality data than conventional scores, with greater statistical reliability and validity as a measure of knowledge, and less contamination from chance factors associated with weak and uncertain knowledge. The puzzle remains, why this seemingly sensible strategy for objectively marked tests is so readily embraced by students and yet so little used by teachers.I’d guess that high performers are fans because it helps distinguish them from the low performers and low performers aren’t fans because it they are more likely to get worse scores than in the binary correct-incorrect scoring system.I won’t go through every study I can find, but Gardner-Medwin’s paper has some good references.2ShareDiscussion about this postCommentsRestacksTopLatestDiscussionsNo postsReady for more?Subscribe© 2026 No Magic Pill · Privacy ∙ Terms ∙ Collection notice Start your SubstackGet the appSubstack is the home for great culture

This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts

The practice of relying on guessing on multiple-choice and free-response examinations does not necessarily reflect a student's actual knowledge. This is because these testing formats only reveal the final answer or the thought process, allowing students to achieve correct results through chance, even without genuine understanding. For multiple-choice questions, narrowing the options can increase the odds of guessing, which demonstrates knowledge about what the answer is not, but not proof of knowing the answer itself. Similarly, in free-response questions, a student might correctly choose an erroneous formula or process based on a guess, which is unfair to those who truly know the material but are scored equally with the guessers. The introduction of confidence scores addresses this issue by seeking to measure the certainty behind an answer.

Brier scores for exams are proposed as a method to evaluate the quality of student responses. The concept involves a Brier score calculated based on a student's prediction regarding their correctness and the actual outcome of their answer. A perfect Brier score is zero, which occurs when the prediction matches the result, such as predicting a correct answer and actually getting it correct, or predicting an incorrect answer and actually getting it incorrect. However, a potential pitfall of relying solely on the Brier score is that it might incentivize students to intentionally provide confident incorrect answers, as it is often easier to secure a wrong answer confidently than a correct one.

To mitigate this, the implementation of confidence scoring requires that predictions must exceed fifty percent because the intent is to earn the correct answer. If a student is less than fifty percent confident in their response, the system should encourage them to adjust their confidence level or apply a penalty. This leads to a structure where each question has two components: the actual answer choices and a section where the student indicates their confidence level, ranging from fifty percent to one hundred percent. This modification allows scantron forms or other tests to be adapted to calculate a Brier score for each individual question. The total score for the student could then be determined by considering these scores, potentially prioritizing the lowest overall score as the best result.

The benefit of incorporating confidence scores lies in enabling more nuanced sorting of individuals based on skill. If students achieve correct answers with varying degrees of confidence, these scores allow for a finer distinction between performers. For instance, comparing multiple individuals who all answer a question correctly, but with varying confidence levels, allows for relative ranking among their peers based on the certainty of their knowledge. This method facilitates sorting based on actual competence without requiring the test itself to be made more difficult.

Literature in the field already exists under the concept of confidence-based marking. A.R. Gardner-Medwin discusses this approach in their work on confidence-based marking, emphasizing that students benefit from having reasons to place greater reliance on an answer or, conversely, reasons for reservation. This strategy promotes careful thought and the integration of various knowledge facets, thereby encouraging deeper learning. Gardner-Medwin suggests that such measures generate higher quality data for assessment, offering greater statistical reliability and validity by reducing contamination from chance factors related to uncertain knowledge. Although this strategy seems logically sensible for objectively marked tests, the text notes a puzzle regarding its adoption by educators and students. It is suggested that high performers may favor this system to distinguish themselves from low performers, while lower performers may avoid it to prevent receiving worse scores under a binary correct-incorrect system.