Metrics for Evaluation of Student Models

Radek Pelánek



Researchers use many different metrics for evaluation of performance of student models. The aim of this paper is to provide an overview of commonly used metrics, to discuss properties, advantages, and disadvantages of different metrics, to summarize current practice in educational data mining, and to provide guidance for evaluation of student models. In the discussion we mention the relation of metrics to parameter fitting, the impact of student models on student practice (over-practice, under-practice), and point out connections to related work on evaluation of probability forecasters in other domains. We also provide an empirical comparison of metrics. One of the conclusion of the paper is that some commonly used metrics should not be used (MAE) or should be used more critically (AUC).

Full Text:



ARROYO, I., WOOLF, B. P., BURELSON, W., MULDNER, K., RAI, D., AND TAI, M. 2014. A multimedia

adaptive tutoring system for mathematics that addresses cognition, metacognition and affect.

International Journal of Artificial Intelligence in Education 24, 4, 387–426.

BAKER, R. S., CORBETT, A. T., AND ALEVEN, V. 2008a. Improving contextual models of guessing

and slipping with a truncated training set. In Educational Data Mining. 67–76.

BAKER, R. S., CORBETT, A. T., AND ALEVEN, V. 2008b. More accurate student modeling through

contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In Intelligent

Tutoring Systems. Springer, 406–415.


L. R., MITCHELL, A. P., AND GIGUERE, S. 2010. Contextual slip and prediction of student performance

after use of an intelligent tutor. In User Modeling, Adaptation, and Personalization. Springer,


BAKER, R. S., CORBETT, A. T., AND KOEDINGER, K. R. 2004. Detecting student misuse of intelligent

tutoring systems. In Intelligent tutoring systems. Springer Berlin Heidelberg, 531–540.

BAKER, R. S., CORBETT, A. T., ROLL, I., AND KOEDINGER, K. R. 2008. Developing a generalizable

detector of when students game the system. User Modeling and User-Adapted Interaction 18, 3,



KUSBIT, G. W., OCUMPAUGH, J., AND ROSSI, L. 2012. Sensor-free affect detection in cognitive

tutor algebra. In Educational Data Mining. 126–133.

BAKER, R. S. AND YACEF, K. 2009. The state of educational data mining in 2009: A review and future

visions. Journal of Educational Data Mining 1, 1, 3–17.

BARNES, T. 2005. The q-matrix method: Mining student response data for knowledge. In American

Association for Artificial Intelligence 2005 Educational Data Mining Workshop.

BECK, J. E. AND CHANG, K.-M. 2007. Identifiability: A fundamental problem of student modeling. In

User Modeling 2007. Springer, 137–146.

BECK, J. E. AND MOSTOW, J. 2008. How who should practice: Using learning decomposition to evaluate

the efficacy of different types of practice for different types of students. In Intelligent Tutoring

Systems. Springer, 353–362.

BECK, J. E. AND XIONG, X. 2013. Limits to accuracy: How well can we do at student modeling. In

Educational Data Mining. 4–11.

BRIER, G. W. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review

, 1, 1–3.

BROCKER, J. AND SMITH, L. A. 2007. Increasing the reliability of reliability diagrams. Weather and

forecasting 22, 3, 651–661.

BULL, S. 2004. Supporting learning with open learner models. In Information and Communication Technologies

in Education.

CARUANA, R. AND NICULESCU-MIZIL, A. 2004. Data mining in metric space: an empirical analysis

of supervised learning performance criteria. In Proceedings of the tenth ACM SIGKDD international

conference on Knowledge discovery and data mining. ACM, 69–78.

CEN, H., KOEDINGER, K., AND JUNKER, B. 2006. Learning factors analysis–a general method for

cognitive model evaluation and improvement. In Intelligent Tutoring Systems. Springer, 164–175.

CEN, H., KOEDINGER, K. R., AND JUNKER, B. 2007. Is over practice necessary?-improving learning

efficiency with the cognitive tutor through educational data mining. Frontiers in Artificial Intelligence

and Applications 158, 511.

COCEA, M., HERSHKOVITZ, A., AND BAKER, R. S. 2009. The impact of off-task and gaming behaviors

on learning: immediate or aggregate? In Artificial Intelligence in Education. 507–514.

COHEN, I. AND GOLDSZMIDT, M. 2004. Properties and benefits of calibrated classifiers. In Knowledge

Discovery in Databases: PKDD 2004. Springer, 125–136.

CONATI, C. AND MACLAREN, H. 2009. Empirically building and evaluating a probabilistic model of

user affect. User Modeling and User-Adapted Interaction 19, 3, 267–303.

CORBETT, A. AND ANDERSON, J. 1995. Knowledge tracing: Modeling the acquisition of procedural

knowledge. User modeling and user-adapted interaction 4, 4, 253–278.

DESMARAIS, M. C. AND DE BAKER, R. S. J. 2012. A review of recent advances in learner and skill

modeling in intelligent learning environments. User Model. User-Adapt. Interact. 22, 1-2, 9–38.

DHANANI, A., LEE, S. Y., PHOTHILIMTHANA, P., AND PARDOS, Z. 2014. A comparison of error

metrics for learning model parameters in bayesian knowledge tracing. Tech. rep., Technical Report

UCB/EECS-2014-131, EECS Department, University of California, Berkeley.


Automatic detection of learner’s affect from conversational cues. User modeling and user-adapted

interaction 18, 1-2, 45–80.

DODD, L. E. AND PEPE, M. S. 2003. Partial AUC estimation and regression. Biometrics 59, 3, 614–623.

FANCSALI, S. E., NIXON, T., AND RITTER, S. 2013. Optimal and worst-case performance of mastery

learning assessment with bayesian knowledge tracing. In Proceedings of the 6th International

Conference on Educational Data Mining.

FANCSALI, S. E., NIXON, T., VUONG, A., AND RITTER, S. 2013. Simulated students, mastery learning,

and improved learning curves for real-world cognitive tutors. In AIED Workshops. Citeseer.

FAWCETT, T. 2006. An introduction to roc analysis. Pattern recognition letters 27, 8, 861–874.

FERRI, C., HERN´A NDEZ-ORALLO, J., AND MODROIU, R. 2009. An experimental comparison of performance

measures for classification. Pattern Recognition Letters 30, 1, 27–38.

FOGARTY, J., BAKER, R. S., AND HUDSON, S. E. 2005. Case studies in the use of ROC curve analysis

for sensor-based estimates in human computer interaction. In Proc. of Graphics Interface 2005. 129–

GNEITING, T. AND RAFTERY, A. E. 2007. Strictly proper scoring rules, prediction, and estimation.

Journal of the American Statistical Association 102, 477, 359–378.

GONG, Y., BECK, J. E., AND HEFFERNAN, N. T. 2010. Comparing knowledge tracing and performance

factor analysis by using multiple model fitting procedures. In Intelligent Tutoring Systems. Springer,


GONZALEZ-BRENES, J., HUANG, Y., AND BRUSILOVSKY, P. 2014. General features in knowledge

tracing: Applications to multiple subskills, temporal item response theory, and expert knowledge. In

Proc. of Educational Data Mining. 84–91.

GONZALEZ-BRENES, J. P. AND MOSTOW, J. 2013. What and when do students learn? fully data-driven

joint estimation of cognitive and student models. In Proceedings of the 6th International Conference

on Educational Data Mining. 236–240.

HAMILL, T. M. AND JURAS, J. 2006. Measuring forecast skill: is it real skill or is it the varying climatology?

Quarterly Journal of the Royal Meteorological Society 132, 621C, 2905–2923.

HAND, D. J. 2009. Measuring classifier performance: a coherent alternative to the area under the roc

curve. Machine learning 77, 1, 103–123.

HERLOCKER, J. L., KONSTAN, J. A., TERVEEN, L. G., AND RIEDL, J. T. 2004. Evaluating collaborative

filtering recommender systems. ACM Transactions on Information Systems (TOIS) 22, 1, 5–53.


Discovery with models: A case study on carelessness in computer-based science inquiry. American

Behavioral Scientist 57, 10, 1480–1499.

JARUSEK, P. AND PELANEK, R. 2012. Analysis of a simple model of problem solving times. In Proc. of

Intelligent Tutoring Systems. LNCS, vol. 7315. Springer, 379–388.

JENI, L. A., COHN, J. F., AND DE LA TORRE, F. 2013. Facing imbalanced data–recommendations

for the use of performance metrics. In Affective Computing and Intelligent Interaction (ACII), 2013

Humaine Association Conference on. IEEE, 245–251.

JEWSON, S. 2003. Use of the likelihood for measuring the skill of probabilistic forecasts. arXiv preprint


KASER, T., KOEDINGER, K. R., AND GROSS, M. 2014. Different parameters-same prediction: An

analysis of learning curves. In Proc. of Educational Data Mining. 52–59.

KHAJAH, M., WING, R. M., LINDSEY, R. V., AND MOZER, M. C. 2014. Integrating latent-factor and

knowledge-tracing models to predict individual differences in learning. In Proc. of Educational Data


KLINKENBERG, S., STRAATEMEIER, M., AND VAN DER MAAS, H. 2011. Computer adaptive practice

of maths ability using a new item response model for on the fly ability and difficulty estimation.

Computers & Education 57, 2, 1813–1824.

LEE, J. I. AND BRUNSKILL, E. 2012. The impact on individualizing student models on necessary practice

opportunities. In Educational Data Mining. 118–125.

LIU, C., WHITE, M., AND NEWELL, G. 2011. Measuring and comparing the accuracy of species distribution

models with presence–absence data. Ecography 34, 2, 232–243.

LIU, R., KOEDINGER, K. R., AND MCLAUGHLIN, E. A. 2014. Interpreting model discovery and testing

generalization to a new dataset. In Educational Data Mining. 107–113.

LOBO, J. M., JIMENEZ-VALVERDE, A., AND REAL, R. 2008. AUC: a misleading measure of the performance

of predictive distribution models. Global ecology and Biogeography 17, 2, 145–151.

MURPHY, A. H. 1973. A new vector partition of the probability score. Journal of Applied Meteorology

, 4, 595–600.

NICULESCU-MIZIL, A. AND CARUANA, R. 2005. Predicting good probabilities with supervised learning.

In Proceedings of the 22nd international conference on Machine learning. ACM, 625–632.

NIZNAN, J., PELANEK, R., AND PAPOUSEK, J. 2015. Exploring the role of small differences in predictive

accuracy using simulated data. In AIED Workshop on Simulated Learners.

NIZNAN, J., PELANEK, R., AND RIHAK, J. 2015. Student models for prior knowledge estimation. In

Educational Data Mining.

PAPOUSEK, J., PELANEK, R., AND STANISLAV, V. 2014. Adaptive practice of facts in domains with

varied prior knowledge. In Educational Data Mining. 6–13.

PARDOS, Z. A., BERGNER, Y., SEATON, D. T., AND PRITCHARD, D. E. 2013. Adapting bayesian

knowledge tracing to a massive open online course in edx. In Educational Data Mining. 137–144.

PARDOS, Z. A., GOWDA, S. M., BAKER, R. S., AND HEFFERNAN, N. T. 2012. The sum is greater

than the parts: ensembling models of student knowledge in educational software. ACM SIGKDD

explorations newsletter 13, 2, 37–44.

PARDOS, Z. A. AND HEFFERNAN, N. T. 2010. Modeling individualization in a bayesian networks implementation

of knowledge tracing. In User Modeling, Adaptation, and Personalization. Springer,


PARDOS, Z. A. AND HEFFERNAN, N. T. 2011. Kt-idem: Introducing item difficulty to the knowledge

tracing model. In User Modeling, Adaption and Personalization. Springer, 243–254.

PARDOS, Z. A. AND YUDELSON, M. V. 2013. Towards moment of learning accuracy. In AIED 2013

Workshops Proceedings Volume 4. 3.

PAVLIK, P. I., CEN, H., AND KOEDINGER, K. R. 2009. Performance factors analysis-a new alternative

to knowledge tracing. In Proc. of Artificial Intelligence in Education (AIED). Frontiers in Artificial

Intelligence and Applications, vol. 200. IOS Press, 531–538.

PELANEK, R. 2014. Time decay functions and elo system in student modeling. In Educational Data

Mining. 21–27.

PELANEK, R. 2015. Modeling student learning: Binary or continuous skill? In Educational Data Mining.

QIU, Y., QI, Y., LU, H., PARDOS, Z. A., AND HEFFERNAN, N. T. 2011. Does time matter? modeling

the effect of time with bayesian knowledge tracing. In Educational Data Mining. 139–148.

ROULSTON, M. S. AND SMITH, L. A. 2002. Evaluating probabilistic forecasts using information theory.

Monthly Weather Review 130, 6.

SAN PEDRO, M. O. Z., BAKER, R. S., GOWDA, S. M., AND HEFFERNAN, N. T. 2013. Towards an

understanding of affect and knowledge from student interaction with an intelligent tutoring system.

In Artificial Intelligence in Education. Springer, 41–50.

SAO PEDRO, M. A., BAKER, R. S., AND GOBERT, J. D. 2013. Incorporating scaffolding and tutor

context into bayesian knowledge tracing to predict inquiry skill acquisition. In Educational Data

Mining. 185–192.

STAMPER, J. C., KOEDINGER, K. R., AND MCLAUGHLIN, E. A. 2013. A comparison of model selection

metrics in datashop. In Educational Data Mining. 284–287.

TOTH, Z., TALAGRAND, O., CANDILLE, G., AND ZHU, Y. 2003. Forecast Verification: A Practitioner’s

Guide in Atmospheric Science. Wiley, Chapter Probability and ensemble forecasts, 137–163.

WANG, Y. AND BECK, J. 2013. Class vs. student in a bayesian network student model. In Artificial

Intelligence in Education. Springer, 151–160.

WANG, Y. AND HEFFERNAN, N. 2013. Extending knowledge tracing to allow partial credit: using continuous

versus binary nodes. In Artificial Intelligence in Education. Springer, 181–188.

YUDELSON, M. V. AND KOEDINGER, K. R. 2013. Estimating the benefits of student model improvements

on a substantive scale. In EDM 2013 Workshops Proceedings.

YUDELSON, M. V., KOEDINGER, K. R., AND GORDON, G. J. 2013. Individualized bayesian knowledge

tracing models. In Artificial Intelligence in Education. Springer, 171–180.


  • There are currently no refbacks.