On the Limitations of Human-Computer Agreement in Automated Essay Scoring
Afrizal Doewes, Mykola Pechenizkiy
Jul 02, 2021 21:05 UTC+2
—
Session I1
—
Zoom link
Keywords: Automated Essay Scoring, Testing Scenarios, Reliability and Validity
Abstract:
Scoring an essay is an exhausting and time-consuming task for teachers. Automated Essay Scoring (AES) facilitates the scoring process to be faster and more consistent. The most logical way to assess the performance of an automated scorer is by measuring the score agreement with the human raters. However, we provide empirical evidence that a well-performing essay scorer from the quantitative evaluation point of view are still too risky to be deployed. We propose several input scenarios to evaluate the reliability and the validity of the system, such as off-topic essays, gibberish, and paraphrased answers. We demonstrate that automated scoring models with high human-computer agreement fail to perform well on two out of three test scenarios. We also discuss the strategies to improve the performance of the system.