Beyond Accuracy: Embracing Meaningful Parameters in Educational Data Mining
Napol Rachatasumrit
Carnegie Mellon University
5000 Forbes Ave
Pittsburgh, PA 15213
napol@cmu.edu
Paulo F. Carvalho
Carnegie Mellon University
5000 Forbes Ave
Pittsburgh, PA 15213
pcarvalh@cs.cmu.edu
Kenneth R. Koedinger
Carnegie Mellon University
5000 Forbes Ave
Pittsburgh, PA 15213
koedinger@cmu.edu

ABSTRACT

What does it mean for a model to be a better model? One conceptualization, indeed a common one in Educational Data Mining, is that a better model is the one that fits the data better, that is, higher prediction accuracy. However, oftentimes, models that maximize prediction accuracy do not provide meaningful parameter estimates, making them less useful for building theory and practice. Here we argue that models that provide meaningful parameters are better models and, indeed, often also provide higher prediction accuracy. To illustrate our argument, we investigate the Performance Factor Analysis (PFA) model and the Additive Factors Model (AFM). PFA often has higher prediction accuracy than the AFM. However, PFA’s parameter estimates are ambiguous and confounded. We propose more interpretable models (AFMh and PFAh) designed to address the confounded parameters and use synthetic data to demonstrate PFA’s parameter interpretability issues. The results from the experiment with 27 real-world datasets also support our claims and show that more interpretable models will also produce better predictions.

Keywords

Additive Factors Model, Performance Factors Analysis, Student Modeling, Model Comparison, Knowledge Tracing

1. INTRODUCTION

In Educational Data Mining (EDM), the conventional wisdom suggests that a superior model exhibits a better fit to the data. However, this perspective overlooks a critical aspect: models that prioritize prediction accuracy sometimes fall short in providing interpretable and meaningful parameter estimates. Yet, having interpretable and meaningful model parameters is crucial for scientific and practical applications of the models we develop. An example of an application of meaningful parameter estimates is when Koedinger et al. observed irregular slopes in learning curves for area planning which led to the discovery of a better Knowledge Component (KC) model [6]. For this purpose, prediction accuracy is merely a means to an end, not the goal itself. An exception might be black-box models used for their enhanced predictive capabilities within recommender systems to great practical outcomes.

Unfortunately, recent trends in EDM research have only predominantly concentrated on model prediction accuracy, often neglecting the importance of the meaningfulness of the parameters. Our goal in this paper is to demonstrate that meaningful parameter estimation is not a necessary consequence of more accurate model prediction. One prominent example is the Deep Knowledge Tracing (DKT) model [12], a knowledge tracing model based on Recurrent Neural Networks (RNNs) [15], which has been shown to achieve high prediction accuracy in many datasets, but the parameters in its networks are nearly uninterpretable. In this work, we perform this demonstration in the context of two popular models of student learning: the Performance Factors Analysis (PFA) [11] and the Additive Factors Model (AFM) [2]. While PFA tends to produce better predictions than AFM, PFA’s parameter estimates are not meaningful because their interpretation is ambiguous. As we will explain in more detail below, interpreting the slope parameters in PFA is difficult because it could mean individual differences in learning rates or differences in prior knowledge or difficulty of specific student-KC combinations but it could also mean different learning rates from successful and unsuccessful attempts, or even “unlearning” from errors. Conversely, AFM’s slope is consistently and unambiguously interpretable as learning rate [4].

To demonstrate how PFA’s parameters are confounded, we proposed and evaluated two alternative models (AFMh and PFAh) designed to unconfound the interaction between KCs and students. We demonstrated the capabilities of these alternative models with synthetic data generated from different models and configurations. Then, we conducted an experiment with 27 real-world datasets from Datashop [3], and found that PFA outperforms AFM in 17 datasets, but our further analysis with the new alternative models showed that PFA’s parameters are indeed difficult to interpret. We also argue for the importance of parameter interpretability by comparing AFM and PFA with these alternative models AFMh and PFAh to demonstrate their meaningful interpretations leading to potential insights and applications. In particular, we are interested in these research questions:

2. RELATED WORK

2.1 DataShop

In this work, we use a variety of real-world datasets across different domains from the DataShop repository [3]. DataShop is an open data repository of the Pittsburgh Science of Learning Center (http://learnlab.org/datashop) for educational data with associated visualization and analysis tools, which has data from thousands of students derived from interactions with on-line course materials and intelligent tutoring systems, such as CTAT [1].

In DataShop terminology, KCs are used to represent pieces of knowledge, concepts or skills that students need to solve problems or particular steps in problems [5]. When a specific set of KCs are mapped to a set of instructional tasks (usually steps in problems) they form a KC Model, which is a specific kind of student model.

2.2 AFM and PFA

The Additive Factors Model (AFM) [2] is a logistic regression that extends item response theory by incorporating a growth or learning term. The model gives the probability \(p_{ij}\), in log-odds, that a student \(i\) will get a problem step \(j\), with related KCs (\(k\)) specified by \(q_{jk}\), correct based on the student’s baseline ability (\(\theta _i\)), the baseline difficulty of the related KCs on the problem step (\(\beta _k\)), and the learning rate of the KCs (\(\gamma _k\)). The learning rate represents the improvement on a KC with each additional practice opportunity, so it is multiplied by the number of practice opportunities (\(T_{ik}\)) that the student already had on the KC:

\begin {equation} log(\frac {p_{ij}}{1-p_{ij}}) = \theta _i + \Sigma _{k}(q_{jk}\beta _k + q_{jk}\gamma _{k}T_{ik}) \ \end {equation}

The Performance Factor Analysis (PFA) [11] is an extension of the AFM model that splits the number of practice opportunities (\(T_{ik}\)) into the number of successful opportunities (\(s_{ik}\)), where students successfully complete the problem steps, and the number of failed opportunities (\(f_{ik}\)), where students make errors. Both (\(s_{ik}\)) and (\(f_{ik}\)) have their own slopes, \(\gamma _k\) and \(\rho _k\):

\begin {equation} log(\frac {p_{ij}}{1-p_{ij}}) = \theta _i + \Sigma _{k}q_{jk}(\beta _k + \gamma _{k}s_{ik} + \rho _{k}f_{ik}) \ \end {equation}

While PFA tends to produce better predictions than AFM, its parameters are not particularly meaningful [8], particularly because their slope interpretation is ambiguous. One interpretation, which is consistent with the intention of PFA, is that these parameters capture individual differences in student mastering that are particular to KCs (i.e. student-KC interactions). Namely, students who make more errors on a KC than otherwise expected will master that KC more slowly than otherwise expected.

An alternative, and perhaps more straightforward, interpretation is that the success slope (S-slope; \(\gamma _k\)) and failure slope (F-slope, \(\rho _k\)) represent different learning rates for prior initially successful versus failed practice opportunities. An indication supporting this notion is the occasional occurrence of a negative F-slope, which, under the second interpretation, can be interpreted as students being unable to learn from unsuccessful attempts [8]. This interpretation could be problematic since it implies that a true novice does not learn (or even unlearns) from making errors. This seems unlikely given modeling and empirical evidence that making errors can contribute significantly to positive learning, as long as feedback is provided [91613].

In this work, we aim to demonstrate how the parameters in PFA are confounded and propose an extension of the existing models designed to unconfound the interactions between KCs and students from the PFA’s slopes.

3. AFMh AND PFAh MODELS

In order to unconfound the student-KC interaction from the success and failure slopes, we need to add additional variables to the models to capture the student-KC interaction. A straightforward approach is to add a variable for each student-KC pair to capture the interaction, but this can lead to overparameterization. Instead, we introduce a success-history variable (\(h_{ik}\)), which is a ratio between a number of successful past attempts at solving a KC (\(s_{ik}\)) and a number of total past attempts at solving that KC (\(t_ik\)). The intuition behind the success-history variable is that a student who has better prior knowledge of a particular KC would yield higher success rates for the KC. We formulated \(h_{ik}\) such that its value will be \(0.5\) at the first opportunity because \(h_{ik}\) should be distinguishable in the case of consecutive failed attempts at the beginning. If \(h_{ik}\) started at \(0\), its value would remain \(0\) regardless of the number of failed attempts at the beginning, which could be problematic for the model:

\begin {equation} h_{ik} = \frac {s_{ik} + 1}{t_{ik} + 2} \ \end {equation}

We incorporated the \(h_{ik}\) variables into AFM and PFA models to create AFMh and PFAh models, in the term \(q_{jk}\eta _{k}h_{k}\). The equations for AFMh (Eq. 3) and PFAh (Eq. 4) are below.

\begin {equation} log(\frac {p_{ij}}{1-p_{ij}}) = \theta _i + \Sigma _{k}q_{jk}(\beta _k + \gamma _{k}T_{ik} + \eta _{k}h_{ik}) \ \end {equation}

\begin {equation} log(\frac {p_{ij}}{1-p_{ij}}) = \theta _i + \Sigma _{k}q_{jk}(\beta _k + \gamma _{k}s_{ik} + \rho _{k}f_{ik} + \eta _{k}h_{ik}) \ \end {equation}

Table 1: The expected best-fitting model for each dataset configuration. PFA is expected to be the best-fitting model when there are different learning rates and no student-KC interactions, but if there are strong student-KC interactions, PFAh is expected to be the best-fitting model. Similarly, if there is a single learning rate and no student-KC interactions, AFM is expected to be the best-fitting model, but if there are strong student-KC interactions, AFMh is expected to be the best-fitting model.
No Interaction With Interaction

1-slope (i.e. 1

learning rates)

AFM AFMh

2-slopes (i.e. 2

learning rates)

PFA PFAh

Table 2: BIC scores of all 4 models for each synthetic dataset with interaction SD = 0.2. Light grey highlights the best-fitting model among the models. AFM is always the best-fitting model when the generation model is AFM regardless of student-KC interactions. Similarly, PFA is always the best-fitting model when the generation model is PFA regardless of student-KC interactions.
Student KC Generation Interaction
AFM
PFA
AFMh
PFAh
Best
Yes 1590.290 1627.946 1598.361 1636.017 AFM
AFM
No 1630.425 1662.996 1634.406 1669.617 AFM
Yes 2091.749 1436.743 1538.479 1444.813 PFA
8
PFA
No 2072.443 1514.381 1607.171 1522.153 PFA
Yes 3818.870 3880.883 3827.027 3885.613 AFM
AFM
No 3808.662 3868.290 3817.426 3877.054 AFM
Yes 4010.223 2807.466 2893.398 2815.151 PFA
16
PFA
No 3949.252 2840.803 2913.090 2849.557 PFA
Yes 6114.022 6196.097 6121.329 6205.297 AFM
AFM
No 6042.236 6125.623 6051.586 6135.080 AFM
Yes 7925.592 6382.965 6676.408 6392.397 PFA
10
32
PFA
No 7823.461 6348.209 6673.301 6357.680 PFA
Yes 4791.102 4837.957 4799.797 4846.721 AFM
AFM
No 4601.883 4653.242 4610.647 4662.006 AFM
Yes 6755.818 6403.026 6700.326 6411.790 PFA
8
PFA
No 6728.999 6445.256 6715.965 6453.907 PFA
Yes 6520.145 6597.033 6529.602 6606.491 AFM
AFM
No 6334.954 6405.390 6342.483 6410.950 AFM
Yes 9840.107 8331.947 8969.829 8338.121 PFA
16
PFA
No 10059.017 8498.802 9050.723 8508.260 PFA
Yes 10894.995 10989.292 10905.136 10999.442 AFM
AFM
No 10614.447 10714.491 10624.598 10723.488 AFM
Yes 17967.629 14766.013 15470.549 14776.163 PFA
20
32
PFA
No 18373.613 14781.398 15415.666 14791.548 PFA
Yes 7752.478 7813.250 7762.159 7822.930 AFM
AFM
No 7465.130 7529.155 7474.811 7538.835 AFM
Yes 8978.669 6766.349 7572.593 6776.029 PFA
8
PFA
No 9386.140 7121.818 8032.094 7131.499 PFA
Yes 17436.148 17535.014 17446.522 17545.388 AFM
AFM
No 17380.842 17468.669 17390.404 17478.980 AFM
Yes 23980.442 17452.077 19262.037 17462.450 PFA
16
PFA
No 23881.545 17732.729 19555.968 17743.103 PFA
Yes 28246.575 28398.769 28257.642 28409.835 AFM
AFM
No 28505.827 28648.146 28515.574 28658.121 AFM
Yes 33787.825 30985.826 31862.632 30996.893 PFA
50
32
PFA
No 35348.852 32002.575 32923.707 32013.642 PFA

4. EXPERIMENTS

We conducted two experiments, on synthetic data and real student data, to evaluate the performance of new models (AFMh and PFAh) compared to the standard models (AFM and PFA). We used Bayesian information criterion (BIC) [10] as the main metric to compare model performance. Our hypothesis is that if there are strong student-KC interactions, the \(h\) models will outperform the standard models, and if there are different learning rates for successful and failed attempts, PFA-based models (i.e. PFA and PFAh) will be better-fitting models, but if there is a single learning rate (i.e. slope), AFM-based models will perform better. Our hypotheses are summarized in Table 1. Additionally, if PFA parameters are indeed confounded by both the student-KC interactions and two learning rates, we expect PFA to outperforms AFM in configurations with either student-KC interactions or 2 learning rates (or both). In other words, all configurations except a single learning rate with no interaction. Consequently, if the \(h\) variables are in fact able to unconfound them by capturing the student-KC interactions, PFAh and AFMh will outperform PFA in their corresponding configuration.

4.1 Experiment 1: Synthetic Data

4.1.1 Methods

In this experiment, we aim to validate the efficacy of our newly developed model in capturing the interaction dynamics between students and KCs. To achieve this, we evaluate this model on synthetic data with known characteristics by sampling model parameters such as student intercepts, KC intercepts, and KC slopes from normal distributions with statistical properties similar to those observed in real-student data. We generated synthetic datasets based on either the AFM or PFA models, serving as the ground truth for student error rates and correctness [14]. Specifically, AFM will generate datasets that are assumed a single learning rate (i.e. slope), but PFA will generate datasets that are assumed different learning rates for successful and failed attempts. To emulate the student-KC interactions observed in real-world scenarios, we introduced variability by augmenting datasets with student-KC interaction effects. This was achieved by sampling values from a normal distribution, reflecting the variance in student performance specific to each KC. Overall, we created 18 dataset groups encompassing varying the number of students (10, 20, and 50), the number of KCs (8, 16, and 32), and the strength of the student-KC interactions (SD = 0.2 and 1.2), where each configuration was used to generate 4 datasets based on each generation models (AFM, PFA, AFM+Interaction, and PFA+Interaction) to form a 2x2 experimental design, corresponding to Table 1. The standard deviations used to simulate student-KC interactions were selected based on the standard deviations of student intercepts from all real students in our datasets estimated using AFM. We used this value as an estimate of the likely amount of variation in student intercepts in a dataset, which could be used as a proxy for reasonable variation in student-KC interactions. We evaluate all four models (AFM, PFA, AFMh, and PFAh) on each dataset. Table 2 and Table 3 show the BIC scores for each model on each dataset in this experiment and summarize the best-fitting models by BIC score.

Table 3: BIC scores of all 4 models for each synthetic dataset with interaction SD = 1.2. Light grey highlights the best-fitting model among the models. AFM is always the best-fitting model when the generation model is AFM without student-KC interaction, but AFMh is the best-fitting model when there are student-KC interactions. Similarly, PFA is always the best-fitting model when the generation model is PFA without student-KC interaction, but PFAh is usually the best-fitting model when there are student-KC interactions.
Student KC Generation Interaction
AFM
PFA
AFMh
PFAh
Best
Yes 1051.481 1094.670 1059.552 1102.728 AFM
AFM
No 1117.250 1110.974 1095.651 1121.092 AFMh
Yes 2086.542 1736.834 1768.927 1744.905 PFA
8
PFA
No 2442.974 1779.640 1788.976 1778.851 PFAh
Yes 2209.120 2267.256 2217.864 2276.020 AFM
AFM
No 2412.882 2359.565 2333.930 2359.085 AFMh
Yes 3741.063 3585.428 3684.478 3594.192 PFA
16
PFA
No 4298.942 3809.425 3870.989 3807.412 PFAh
Yes 6362.627 6444.527 6371.700 6453.985 AFM
AFM
No 7290.315 6785.575 6770.986 6784.784 AFMh
Yes 10103.516 8081.974 8434.942 8091.431 PFA
10
32
PFA
No 10653.994 8404.126 8559.083 8410.545 PFA
Yes 2387.151 2438.373 2395.171 2447.137 AFM
AFM
No 2811.167 2698.942 2661.740 2695.280 AFMh
Yes 5208.531 4508.708 4661.685 4515.641 PFA
8
PFA
No 5448.687 4676.877 4718.731 4649.611 PFAh
Yes 5605.182 5687.103 5614.639 5696.560 AFM
AFM
No 6109.225 5905.782 5833.515 5876.967 AFMh
Yes 10155.346 7978.861 8504.096 7988.318 PFA
16
PFA
No 11099.476 8051.809 8196.360 8011.967 PFAh
Yes 11602.318 11720.229 11612.225 11730.379 AFM
AFM
No 12897.355 11902.381 11796.091 11832.277 AFMh
Yes 18625.785 14559.687 15284.133 14569.251 PFA
20
32
PFA
No 20953.855 14522.347 14889.161 14501.333 PFAh
Yes 9270.245 9337.691 9279.925 9347.372 AFM
AFM
No 10248.059 9472.805 9301.816 9334.143 AFMh
Yes 13377.323 10083.043 10708.542 10092.723 PFA
8
PFA
No 14207.732 9690.340 9895.612 9638.426 PFAh
Yes 16027.836 16120.648 16038.208 16130.733 AFM
AFM
No 17820.780 16525.557 16326.445 16361.036 AFMh
Yes 19711.027 15708.241 16163.369 15718.614 PFA
16
PFA
No 23266.309 16106.685 16374.813 15996.808 PFAh
Yes 24554.830 24708.746 24565.897 24719.813 AFM
AFM
No 27686.058 25585.924 25288.177 25326.152 AFMh
Yes 47960.208 38961.412 40581.090 38972.479 PFA
50
32
PFA
No 52031.370 40238.448 40847.476 40038.740 PFAh

4.1.2 Results

As shown in Table 2, when the student-KC interaction is weak (SD = 0.2), AFM and PFA are the best-fitting models in all datasets depending on the generating model (i.e. AFM is the best-fitting model when the generating model is AFM, and PFA is the best-fitting model when the generating model is PFA). However, when the student-KC interaction is strong (SD = 1.2), the model corresponding to the generation method is the best-fitting model in all datasets, except one (student=10, KC=32, method=PFA+Interaction), as shown in Table 3. In other words, when there is a reasonably strong interaction between students and KCs, the models with the h variable consistently outperform the standard models. Moreover, the result shows that PFA consistently outperforms AFM when there are student-KC interactions, even when the base generation model is AFM, in which AFMh also consistently outperforms PFA. This supports our hypothesis that PFA parameters are confounded by both the student-KC interactions and two learning rates, but the \(h\) variable will be able to unconfound them by capturing the student-KC interactions. Overall, these results also demonstrate the capability of the h models to capture the dynamics of student-KC interactions.

4.2 Experiment 2: Real Student Data

4.2.1 Methods

We conducted an experiment with 27 real-world dataset from Datashop across different domains (e.g., geometry, fractions, physics, statistics, English articles, Chinese vocabulary), educational levels (e.g., grades 5 to 12, college, adult learners), and settings (e.g., in class vs. out of class as homework). We evaluated all four models (AFM, PFA, AFMh, and PFAh) on each dataset. Table 4 shows the BIC score obtained when fitting each model on each dataset in this experiment.

4.2.2 Results

Table 4 shows the BIC score of each model on each real-student dataset. When comparing between AFM and PFA, PFA outperforms AFM in 17 out of 27 datasets, replicating prior evidence. However, when comparing among all four models, PFA is the best-fitting model in only one dataset (where the difference in BIC score is relatively small), while AFM is the best-fitting model in 4 datasets. AFMh and PFAh are the best-fitting models in 11 datasets each. Among the 17 datasets that PFA outperforms AFM, AFMh is the best-fitting model in 5 datasets. In fact, AFMh outperforms PFA in 24 out of 27 datasets, in contrast to PFAh which outperforms PFA in only 13 out of 27 datasets. Generally, the results demonstrate that the \(h\) models usually fit the data better compared to the standard models because they are the best-fitting models in 22 out of 27 datasets.

5. DISCUSSION

5.1 RQ1: Confounding Parameters in PFA

From both synthetic datasets and real-student datasets, we demonstrated that PFA is usually a better fitting model compared to AFM, 45 out of 72 in synthetic datasets (63%) and 17 out of 27 in real-student datasets (63%). However, we argued that the interpretation of the parameters in PFA is not meaningful because their slope interpretation is ambiguous between individual differences in student mastering that are particular to KCs (i.e. student-KC interactions) and different learning rates, which in turn makes PFA’s superiority questionable. The results from both experiments and our alternative models support this hypothesis.

A scatter plot shows standard deviation of Residuals vs $\eta_k$ with a linear trend line. The residuals and $\eta_k$ are positively correlated.
Figure 1: SD of Residuals vs \(\eta _k\). The residuals and \(\eta _k\) are positively correlated.

In the synthetic data experiment, we demonstrated the capability of AFMh and PFAh to capture the interactions between students and KCs, as those models outperform standard AFM and PFA when interactions are incorporated in the synthetic datasets. Particularly, PFAh effectively handles the confounding slopes in PFA because the added \(\eta _{k}\) captures interactions and the slopes capture different rates of learning from errors and successes. It is worth noting that PFA also outperforms AFM in all datasets with strong interactions where the generation method is not AFM without interaction, including AFM with interaction. In other words, PFA is a better fitting model when the generation method includes either student-KC interactions or independent slopes for errors and successes (or both), which attests that the PFA parameters are indeed confounded.

This claim is further validated by the experiment with the real-student datasets. Of the 27 datasets, PFA produces better predictions than AFM on 17 of them – so, indeed, PFA is generally a more predictive model even if it is less interpretable than AFM. However, for 16 of these 17 datasets, either of the new more meaningful models, AFMh (5 out of 17) or PFAh (11 out of 17), yields better predictions than PFA. In other words, PFA is rarely the best-fitting model when we compare it with the models that are designed to separately capture the student-KC interactions. Moreover, even though PFA outperforms AFM in the majority of the datasets, when compared with PFAh and AFMh, it is the best model only in one dataset (6%). On the contrary, AFM is the best model in four datasets (40%). Generally, the results also show that it is possible for a model to be both interpretable and produce better predictions, as evidenced by AFMh and PFAh.

Table 4: BIC scores of all 4 models on 27 real-student datasets. Light grey highlights a better-fitting model between AFM and PFA. Dark grey highlights the best-fitting model among all 4 models.
DS
AFM
PFA
AFMh
PFAh
99 14568.873 14564.965 14506.087 14522.619
104 6965.241 6978.620 6957.865 6987.335
115 20752.969 20612.962 20722.641 20622.806
253 14598.394 14585.407 14563.883 14585.933
271 1277.940 1305.424 1283.093 1309.691
308 3072.037 3115.442 3079.713 3120.485
1980 6920.579 6944.683 6917.875 6951.888
372 6283.754 6213.442 6207.816 6222.314
1899 5541.982 5555.805 5534.952 5564.308
392 29177.451 29005.429 29006.499 28994.564
394 5580.649 5557.175 5550.959 5565.836
445 4964.794 4971.661 4945.798 4978.275
562 57459.694 56460.229 56410.123 56355.453
563 58377.219 57007.220 56876.034 56840.820
564 67622.473 66165.224 66035.163 65999.477
565 60111.965 57395.729 57057.449 56987.445
566 64040.573 63603.997 63459.030 63470.794
567 49015.532 48010.910 48117.234 48009.947
605 3355.982 3381.284 3361.952 3388.193
1935 8034.666 8052.826 8027.439 8060.300
1330 49749.563 49698.893 49623.904 49622.238
447 87354.605 85040.246 84523.160 84499.571
531 110398.18 106320.62 106032.06 105714.36
1943 127785.50 120277.02 118027.78 117993.15
1387 3298.273 3324.936 3300.726 3330.990
1007 3720.511 3738.319 3688.687 3723.710
4555 36957.404 36506.379 36365.781 36349.639

5.2 RQ2: Meaningful Parameters

We return to the claim that the significance of model parameters and their interpretability supersedes goodness-of-fit or prediction accuracy. The results with real-student datasets demonstrate that AFMh and PFAh are usually better fitting models compared to standard AFM or PFA, but the question remains: do these models hold meaningful interpretations, particularly concerning the h parameter?

It is essential to distinguish between the \(h_{ik}\) variable and its associated estimated parameters, \(\eta _k\). Defined in Eq. 3, the \(h\) variable denotes the ratio of successful past attempts and total past attempts, positing that students with higher prior knowledge in a specific KC exhibit comparatively higher h values. \(h_{ik}\) is deterministically calculated from the data. On the other hand, its parameter, \(\eta _k\), is estimated from fitting the model to the data and indicates the relative influence of the variable on predicting the outcome.

In a meaningful model, parameter estimates typically offer clear interpretations. For instance, in AFM, the student intercept represents the student’s prior knowledge, while the KC intercept reflects the difficulty of the KC. But what insights does \(\eta _k\) offer?

To answer this question, we investigated the relationship between \(\eta _k\) and the residuals, the difference between the actual outcomes and the model predictions, for each student on corresponding KCs. Particularly, we investigated ds99 dataset, where \(\eta _k\) ranges from -0.46 to 3.95 (\(\mu =1.12\)). Let’s first look at the \(h_{ik}\) variables. When the KC has a strong variance for the interactions, which means some students are really strong while some students are really weak on the KC, we will also expect a high variance for \(h_{ik}\) of that KC. In contrast, when the student-KC interactions have a weak variance, \(h_{ik}\) will also be expected to have a low variance. As a result, \(\eta _{k}\) should be correlated with the variance of the corresponding student-KC interactions. The result from the real-student data, as shown in Fig. 1, supports this hypothesis and shows that the variance of the residuals and \(\eta _{k}\) are in fact correlated.

Consequently, the \(\eta _{k}\) can be interpreted as representing the variance of student-KC interactions of the associated KC. In other words, when \(\eta _{k}\) is high, some students are really good at the KC while other students are not. For example, number-letter is a KC with a relatively high \(\eta _{k}\) from the English Article Tutor. The number-letter KC describes a skill that involves selecting an English article (i.e. "a" or "an") to fill in the blank. Examples of problems with number-letter KC are "This is the first time that I’ve received ___ ’99’ on a test." or "My name begins with ___ ’L’.". Some, perhaps otherwise struggling, students may learn this skill faster because they happen to focus on the sound of the letter in the following noun and whether it is a vowel or consonant sound. Other, perhaps otherwise good, students may learn this skill slower because they focus on the written letter and whether it is a vowel or consonant. This latter encoding sometimes works, so it is non-trivial to reject in early induction if a learner thinks of it, However, it produces errors and slows down learning overall. On the other hand, when \(\eta _{k}\) is low, most students are relatively similarly good at that given KC, so the differences in their performance will depend on their overall characteristics, such as student intercepts (prior knowledge). The corollary of this finding is that when \(\eta _{k}\) is low, students are performing as expected from the model’s prediction (Fig. 2) due to the small variances of residuals. Conversely, students are not performing as expected on the KCs when \(\eta _{k}\) is large (Fig. 2). Taken together, these results demonstrate that the \(h\) models are not only better fitting models, but their parameters are also meaningful and interpretable. To illustrate the usefulness of the meaningful interpretations, the above suggests a change in the KC model and associated instruction so that the number-letter KC becomes unambiguous and the variance of students’ learning is reduced.

The implications of an interpretable knowledge tracing model with better predictive power are immense, especially with practical applications. For example, Liu et al. demonstrate that meaningful interpretations of AFM parameters (e.g. learning rates for knowledge components’ slopes) can lead to new scientific insights (e.g. improved cognitive models discovery) and results in useful practical applications (e.g. an intelligent tutoring system redesign) [7]. Similarly, our work has many potential practical applications, such as improved ITS design, better student tracing, and overall improvements to the use of model parameters to make decisions about student learning and mastery.

A scatter plot shows Actual Outcomes vs Predicted Outcomes when $\eta_k$=0.16 with a linear trend line. When $\eta_k$ is low, students are performing as expected from the model’s prediction.
Figure 2: Actual Outcomes vs Predicted Outcomes (\(\eta _k\)=0.16). When \(\eta _k\) is low, students are performing as expected from the model’s prediction.
A scatter plot shows Actual Outcomes vs Predicted Outcomes when $\eta_k$=3.35 with a linear trend line. When $\eta_k$ is low, students are performing as expected from the model’s prediction.
Figure 3: Actual Outcomes vs Predicted Outcomes (\(\eta _k\)=3.35). When \(\eta _k\) is high, students are not performing as expected from the model’s prediction.

6. CONCLUSIONS AND FUTURE WORK

In this work, we argued that models with high prediction accuracy do not necessarily exhibit meaningful parameter estimates, which are important for scientific and practical applications. We demonstrated our claim in the context of PFA using both synthetic data and real-student data. The result supported our hypothesis that while PFA is a better fitting model compared to AFM, its parameters’ interpretation is ambiguous. Further, we proposed new models AFMh and PFAh, introducing a success-history variable (\(h_k\)) designed to capture student-KC interactions, to the existing models. We evaluated their capabilities also with synthetic data and real-student data and demonstrated that the new models are both more interpretable and better fitting compared to PFA.

While \(h_k\) works reasonably well as a proxy of student-KC interactions, in future work it might be important to test a model with straightforward student:KC interaction terms; though, there might be a possibly intractable number of parameters. In addition, other possible configurations of \(h_k\) variables could be interesting to experiment with, such as formulating \(h_k\) to be centered at 0 instead of 0.5 or using logarithmic form.

7. ACKNOWLEDGEMENTS

The author(s) disclosed the following financial support for the research, authorship, and/or publication of this article: The preparation of this manuscript was partially supported by National Science Foundation grant #2301130.

8. REFERENCES

  1. V. Aleven, B. M. McLaren, J. Sewall, and K. R. Koedinger. The cognitive tutor authoring tools (ctat): Preliminary evaluation of efficiency gains. In Intelligent Tutoring Systems: 8th International Conference, ITS 2006, Jhongli, Taiwan, June 26-30, 2006. Proceedings 8, pages 61–70. Springer, 2006.
  2. H. Cen, K. Koedinger, and B. Junker. Learning factors analysis–a general method for cognitive model evaluation and improvement. In International conference on intelligent tutoring systems, pages 164–175. Springer, 2006.
  3. K. R. Koedinger, R. S. Baker, K. Cunningham, A. Skogsholm, B. Leber, and J. Stamper. A data repository for the edm community: The pslc datashop. Handbook of educational data mining, 43:43–56, 2010.
  4. K. R. Koedinger, P. F. Carvalho, R. Liu, and E. A. McLaughlin. An astonishing regularity in student learning rate. Proceedings of the National Academy of Sciences, 120(13):e2221311120, 2023.
  5. K. R. Koedinger, A. T. Corbett, and C. Perfetti. The knowledge-learning-instruction framework: Bridging the science-practice chasm to enhance robust student learning. Cognitive science, 36(5):757–798, 2012.
  6. K. R. Koedinger, J. C. Stamper, E. A. McLaughlin, and T. Nixon. Using data-driven discovery of better student models to improve student learning. In Artificial Intelligence in Education: 16th International Conference, AIED 2013, Memphis, TN, USA, July 9-13, 2013. Proceedings 16, pages 421–430. Springer, 2013.
  7. R. Liu and K. R. Koedinger. Closing the loop: Automated data-driven cognitive model discoveries lead to improved instruction and learning gains. Journal of Educational Data Mining, 9(1):25–41, 2017.
  8. C. Maier, R. S. Baker, and S. Stalzer. Challenges to applying performance factor analysis to existing learning systems. In Proceedings of the 29th International Conference on Computers in Education, 2021.
  9. J. Metcalfe. Learning from errors. Annual review of psychology, 68:465–489, 2017.
  10. A. A. Neath and J. E. Cavanaugh. The bayesian information criterion: background, derivation, and applications. Wiley Interdisciplinary Reviews: Computational Statistics, 4(2):199–203, 2012.
  11. P. I. Pavlik Jr, H. Cen, and K. R. Koedinger. Performance factors analysis–a new alternative to knowledge tracing. Online Submission, 2009.
  12. C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, L. J. Guibas, and J. Sohl-Dickstein. Deep knowledge tracing. Advances in neural information processing systems, 28, 2015.
  13. N. Rachatasumrit, P. F. Carvalho, S. Li, and K. R. Koedinger. Content matters: A computational investigation into the effectiveness of retrieval practice and worked examples. In International Conference on Artificial Intelligence in Education, pages 54–65. Springer, 2023.
  14. N. Rachatasumrit and K. R. Koedinger. Toward improving student model estimates through assistance scores in principle and in practice. International Educational Data Mining Society, 2021.
  15. R. M. Schmidt. Recurrent neural networks (rnns): A gentle introduction and overview. arXiv preprint arXiv:1912.05911, 2019.
  16. D. Weitekamp, Z. Ye, N. Rachatasumrit, E. Harpstead, and K. Koedinger. Investigating differential error types between human and simulated learners. In Artificial Intelligence in Education: 21st International Conference, AIED 2020, Ifrane, Morocco, July 6–10, 2020, Proceedings, Part I 21, pages 586–597. Springer, 2020.