SimGrade: Using Code Similarity Measures for More Accurate Human Grading
Sonja Johnson-Yu, Nicholas Bowman, Mehran Sahami, Chris Piech
Jun 30, 2021 20:40 UTC+2
—
Session PS1
—
Gather Town
Keywords: similarity, code embeddings, embeddings, assessment, grading, human, simgrade, grade, human in the loop, exam, free response, accuracy
Abstract:
In computer science courses, grading such exam problems can be a difficult and inconsistent process, especially when graded by a large course staff. In this paper, we show how AI techniques for recognizing similar assignments have the potential to improve the human grading process.Ideally, different graders assessing the same student submission would assign the same score, but analysis of historical grading patterns shows that this happens less often than desired — an issue that is not commonly acknowledged in the context of programming problems. These inconsistencies can raise questions of fairness and can negatively impact students’ experiences in the course, necessitating the development of methods to ensure more consistent grading.Through analysis of historical exam data, we demonstrate that graders are able to more accurately assign a score to a student submission when they have previously seen another submission similar to it. As a result, we hypothesize that we can improve exam grading accuracy by ensuring that each submission that a grader sees is similar to at least one submission they have previously seen.We propose several algorithms for (1) assigning student submissions to graders, and (2) ordering submissions to maximize the probability that a grader has previously seen a similar solution, leveraging distributed representations of student code in order to measure similarity between submissions. We demonstrate that these algorithms achieve higher grading accuracy than is achieved by randomly assigning and ordering submissions.