The CommonLit Ease of Readability (CLEAR) Corpus
Scott Crossley, Aron Heintz, Joon Choi, Jordan Batchelor, Mehrnoush Karimi, Agnes Malatinszky
Jun 30, 2021 20:40 UTC+2
—
Session PS1
—
Gather Town
Keywords: Corpus Linguistics, Text Readability, Natural Language Processing
Abstract:
In this paper, we introduce the Anonymous Ease of Readability (AEAR) corpus. The corpus provides researchers within the educational data mining community with a resource from which to develop and test readability metrics and to model text readability. The AEAR corpus has a number of improvements over previous readability corpora include size (N = ~5,000 reading excerpts), the breadth of the excerpts available, which cover over 250 years of writing in two different genres, and the readability criterion used (teachers’ ratings of text difficulty for their students). This paper discusses the development of the corpus and presents reliability metrics as well as initial analyses of readability.