educationaldatamining.org

The Best Publicly Available Educational Data Set Prize

The Best Publicly Available Educational Data Set Prize is given annually to the data set that has the potential to lead to or has already led to the most significant progress in the scientific field and the community of practice. Nominees will be considered individually, with consideration for both past and future impacts. The prize-winning data set is selected by a committee of leaders in the field, selected by the Board of Directors of the International Educational Data Mining Society. Current committee members are ineligible to receive the award, but former committee members are eligible to receive the award.

Award winners receive a prize of $2,000 and free registration to attend and potentially present an award talk at the 2027 International Conference on Educational Data Mining (EDM 2027).

Nominations are currently closed, and a form will be posted here when they are reopened. Nomination announcements are sent to the IAIED list, and/or edm-announce, and the Learning Engineering Google Group. Please contact Anna Rafferty(arafferty@carleton.edu) if you have questions about the nomination process.

Award Winners

2026. Misconceptions in Mathematics Education Dataset. Data Set provided by Bethany Rittle-Johnson, Rebecca Adler, Kelley Durkin, L Burleigh, Jules King, and Scott Crossley.

The MIME dataset consists of student explanations of their responses to a variety of mathematics questions across grades 4-8, with expert annotation of student responses for specific misunderstandings. The dataset is augmented with synthetic explanations, crafted to be indistinguishable from real explanations to experts. The committee was impressed by the detailed annotation of misunderstandings, noting the frequent challenges of obtaining such data. A number of teams participated in a recent Kaggle competition using the data, demonstrating community interest, and the dataset has the potential to facilitate significant further research into mathematical misunderstandings and ways of addressing these misunderstandings. One compelling aspect of the dataset for the committee was that it has the potential to lead to insights and research on questions relevant to both pedagogy and machine learning.

2025. NCTE Transcript Data. Data Set provided by Dora Demszky and Heather Hill.

The NCTE Transcript Data focuses on observations of teachers and students in 4th and 5th grade elementary mathematics classrooms. The anonymized transcripts include significant metadata, such as annotation of discourse moves for each turn, information about teacher background and classroom practices, and demographics. It has been used in a variety of applications, including exploring how features of teacher and teacher-student language promote inquiry and growth mindset as well as evaluating and building generative AI systems.

2024. PERSUADE (Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements). Data Set provided by Perpetual Baffour, Terry Yu Tian, Meg Benner, Ulrich Boser, and Scott Crossley.
The PERSUADE dataset provides access to comprehensive data such as labels for more than 14,000 essays, including the various argumentative and rhetorical elements contained within each essay response. It also includes the effectiveness rating of these discourse elements, holistic quality scores for the essay responses, and student demographic information that includes grade level, race/ethnicity, economic background, and more. The dataset was developed as a part of the Feedback Prize project, an initiative by Georgia State University and The Learning Agency Lab. The goal of the prize is to spur the development of open-source algorithms in assisted writing feedback tools and help struggling students dramatically improve in writing. The dataset was used in a series of Kaggle challenges, which collectively drew over 10,000 participants. Because the data was publicly shared for research and other uses after the data science competition series, researchers have been able to develop more algorithms for writing feedback outside of competition. Additionally, since the PERSUADE dataset included strong diversity and fairness measures, it has also been sought out by the Gemini AI team at Google to help provide better help to students when students submit essays through their chatbot.

2023. OULAD – Open University Learning Analytics Dataset. Data Set provided by Knowledge Media Institute, The Open University.
The OULAD data set contains information from courses presented at the Open University (OU). What makes the dataset unique is the fact that it contains demographic data together with aggregated clickstream data of students’ interactions in the Virtual Learning Environment (VLE). This enables the analysis of student behaviour, represented by their actions. The dataset contains the information about 22 courses, 32,593 students, their assessment results, and logs of their interactions with the VLE represented by daily summaries of student clicks (10,655,280 entries). The dataset is freely available at https://analyse.kmi.open.ac.uk/open_dataset under a CC-BY 4.0 license. It is described in a paper published in Nature Scientific Data Journal entitled Open University Learning Analytics Dataset (Kuzilek J., Hlosta M., Zdrahal Z., 2017).

2022. Exploring Common Trends in Online Educational Experiments Data. Data Set provided by ASSISTments.
Data on thousands of students participating in one of 88 different assignment-level randomized controlled experiments performed within ASSISTments. Students’ clickstream data has been provided in its raw form as well as aggregated into a problem level and assignment level summary of their performance for each student that participated in an experiment. Additionally, information is provided on each student’s prior performance. So far this data has been used to perform a meta-analysis of findings across similar experiments.

2021. The NeurIPS 2020 Education Challenge. Data Set provided by Eedi.
Data on millions of students’ answers to mathematics questions.
Used in scientific competition by dozens of researchers to predict student responses, determine question quality, and identify a personalized set of questions for each student.