Parametric Constraints for Bayesian Knowledge Tracing from First Principles
Denis Shchepakin, Sreecharan Sankaranarayanan, and Dawn Zimmaro
Amazon.com, Inc., USA
{dshch, sreeis, dzimmaro}@amazon.com

ABSTRACT

Bayesian Knowledge Tracing (BKT) is a probabilistic model of a learner’s state of mastery for a knowledge component. The learner’s state is a “hidden" binary variable updated based on the correctness of the learner’s responses to questions corresponding to that knowledge component. The parameters used for this update are inferred/learned from historical ground truth data. For this, BKT is often represented as a Hidden Markov Model and the Expectation-Maximization algorithm is used to infer the parameters. The algorithm, can however, suffer from issues including settling into local minima, producing degenerate parameter values (such as stating that learners who do not know the skill are more likely to answer correctly than those who do), and a high computational cost during fitting. To address these, we take a “from first principles" approach to derive necessary constraints that can be imposed on the BKT parameter space. Starting from the basic mathematical truths of probability and using conceptual behaviors expected of the BKT parameters in real systems, we derive succinct constraints for the BKT parameter space. As necessary conditions, applying the constraints prior to fitting reduces computational cost and the issues emerging from the EM procedure. We further introduce a novel algorithm for estimating BKT parameters subject to the newly defined constraints. While the issue of degenerate parameters has been reported previously, this paper is the first, to the best of our knowledge, to derive necessary constraints from first principles and also present an algorithm that respects those constraints.

Keywords

Adaptive Learning Systems, Student Modeling, Learner Modeling, Bayesian Knowledge Tracing, Hidden Markov Model, Expectation-Maximization

1. INTRODUCTION

Bayesian Knowledge Tracing (BKT)  [6] was introduced as a way to model the changing knowledge states of students who were interacting with an adaptive learning system for skill acquisition. To this day, it remains the most widespread way to model student learning, predominantly owing to its sufficient complexity for many use cases  [7161115]. The model considers the learner’s state of mastery as a “hidden" or latent binary variable with two possible states – Mastery and Non-Mastery (Sometimes called Proficient and Not-yet-proficient  [414], or Knows and Does-not-know  [10]). It then uses four parameters – the initial probability of mastery, the transition probability from non-mastery to mastery over a single learning opportunity, the probability of a correct answer with the learner in the non-mastery state (guess), and the probability of an incorrect answer with the learner in the mastery state (slip), to “predict" (i.e., calculate) whether a given learner is in the mastery state or not. In order to learn the value of these parameters, BKT is most often represented as a Hidden Markov Model  [3], and the parameters are determined using an Expectation-Maximization (EM) algorithm  [85] on historical data.

As with any other EM algorithm, this may result in multiple sets of (highly dissimilar) parameters that fit the data equally well  [3] which, in the case of BKT, affects interpretability. Further, the parameters may be degenerate, i.e., fit the data but violate the conceptual meaning, such as by stating that a learner is more likely to answer correctly if they don’t know the skill than if they do  [1]. This can lead to incorrect decision-making in real systems. Finally, if the algorithm needs to be re-run after it is post-hoc determined to produce degenerate parameters, then that becomes computationally expensive. Several approaches have been suggested that can partially resolve the issue, including determining the starting values that lead to degenerate parameters  [17] (and avoiding them), computing Dirichlet priors for each parameter and using that to bias the search  [24], clustering parameters across similar skills  [2112], and machine-learned models for some of the parameters  [1]. Approaches that provably avoid degenerate parameters have also been discussed in literature, but they instead sacrifice the precision provided by the EM algorithm  [10]. Thus, in this paper, we first derive parametric constraints for the BKT parameters from first principles, that, if satisfied, necessarily avoid degenerate parameters. Then, we present a novel EM algorithm that respects the derived constraints thus allowing them to be used in practice.

While similar constraints have previously emerged by studying fixed points of the BKT model  [23], here we derive them from first principles applied to the conceptual meaning of the modeled process. Moreover, we prove that these less strict constrains are sufficient compared to the ones derived in  [23]. Finally, we also present a novel EM algorithm that respects these constraints.

2. DEFINING THE BKT MODEL

BKT assumes that for each knowledge component (KC), a learner can be in either the Proficient or Not-yet-proficient state at a given point in time. After attempting an assessment, the learner receives feedback, either explicitly or gleans it from the fact that their response was marked correct or incorrect. Thus, this is an opportunity to become proficient in the corresponding knowledge component. If the learner learns successfully, they transition from the not-yet-proficient state to the proficient state for that corresponding KC. Once the learner becomes proficient in a KC, they cannot transition back to the not-yet-proficient state (Note that variations of the BKT model that incorporate “forgetting" allow this transition  [13]). A BKT model is then constructed and applied for each KC independently.

Let \(L^{(d)}_t\) be an event that learner \(d\) is proficient after receiving \(t\) rounds of feedback; \(C^{(d)}_t\) is an event that learner \(d\) answers assessment \(t\) correctly; \(G^{(d)}_t\) is an event that a learner \(d\) guesses a correct answer for an assessment \(t\) while not being proficient; \(S^{(d)}_t\) is an event that a learner \(d\) makes a mistake (“slips") at an assessment \(t\) while being proficient; and \(R^{(d)}_t\) is an event that a non-proficient learner \(d\) transitions to a proficient state after receiving \(t\) rounds of feedback. The classic BKT model assumes that probabilities of guess, slip, and transition events are independent of the learner and the assessment and depend only on the learner’s proficiency state. Moreover, the initial proficiency probability \(P(L^{(d)}_0)\) is also assumed to be independent from the learner, and, rather, a proportion of learners in the population that are proficient before attempting their first assessment is used. So, we will omit redundant upper and lower indexes for \(G\), \(S\), \(R\), and \(L_0\) events and their probabilities. See Figure 1 for an outline of the model.

\(P(L_0)\) Prior Proficiency the probability the learner is in proficient
state for the KC prior to first feedback.
\(P(G)\) Guess the probability of correct guess at
an assessment while not being proficient
at the corresponding KC.
\(P(S)\) Slip the probability of slip (mistake) at
an assessment while being proficient at
the corresponding KC.
\(P(R)\) Transition the probability of a learner becoming
proficient in KC after making an attempt
at an assessment and reading the feedback.
\(P(L^{(d)}_t)\) Proficiency the probability of learner \(d\) being in a proficient
state after receiving \(t\) rounds of feedback.
\(P(C^{(d)}_t)\) Correctness of the probability of learner \(d\) answering
an Attempt assessment \(t\) correctly.
Figure 1: BKT Parameters and Notation

The BKT model defines \(P(C^{(d)}_{t+1})\) thus – \begin {equation} \label {eq:bkt:correct_answer} P(C^{(d)}_{t+1}) = P(L^{(d)}_t) \cdot (1 - P(S)) + ( 1- P(L^{(d)}_t) ) \cdot P(G) \end {equation} Using the Bayes’ rule, we get – \begin {equation} \label {eq:bkt:bayes_update} \begin {array}{c} P(L^{(d)}_t|C^{(d)}_{t+1}) = \frac {P(L^{(d)}_t) \cdot ( 1 - P(S) ) }{ P(L^{(d)}_t) \cdot ( 1 - P(S) ) + ( 1 - P(L^{(d)}_t) ) \cdot P(G)},\\ \\ P(L^{(d)}_t|\overline {C^{(d)}_{t+1}}) = \frac {P(L^{(d)}_t) \cdot P(S) }{ P(L^{(d)}_t) \cdot P(S) + ( 1 - P(L^{(d)}_t) ) \cdot ( 1 - P(G) )} \end {array} \end {equation} where \(\overline {C^{(d)}_t}\) is an event complementary to \(C^{(d)}_t\), i.e., an event that learner \(d\) answers assessment \(t\) incorrectly. After an attempt at the assessment and receiving a feedback, the learner has a chance of transitioning if they are not already proficient – \begin {equation} \label {eq:bkt:transitioning} P(L^{(d)}_{t+1}) = P(L^{(d)}_t | \boldsymbol {\cdot } ) + P(R) \cdot ( 1 - P(L^{(d)}_t | \boldsymbol {\cdot } ) ) \end {equation} where \(P(L^{(d)}_t | \boldsymbol {\cdot } )\) is either \(P(L^{(d)}_t|C^{(d)}_t)\) or \(P(L^{(d)}_t|\overline {C^{(d)}_t})\) depending on the collected data for the learner.

Knowing the values of all parameters of the BKT model will allow us to predict the probability of learner \(d\) being proficient in the KC, \(P(L^{(d)}_t)\) (which we will refer to simply as “proficiency" of learner \(d\)).

3. RESTRICTIONS ON THE BKT PARAMETERS

Prior to estimating the BKT parameters, we need to place some restrictions on them to maintain the conceptual meaning of the modeled process when used in real systems. All results obtained in this section are not learner specific. Thus, for the sake of readability, we will omit all learner-specific indexes in this section, e.g., we will use \(L_t\) instead of \(L^{(d)}_t\).

First, it does not make sense for \(P(S)\) and \(P(G)\) to be \(0\) since that would simply eliminate their use as parameters entirely. \(P(G)\) being \(1\) would mean that a learner in the non-mastery state would necessarily guess and get the question right each time which is unrealistic. Similarity, \(P(S)\) being equal to \(1\) would mean that a learner in the mastery state would necessarily slip and get the question wrong each time which obviates the very definition of mastery. Thus, our first constraint is that \(P(S)\) and \(P(G)\) both vary between \(0\) and \(1\) without ever taking the extreme values exactly. Next, \(P(R)\) is also between \(0\) and \(1\). If \(P(R) = 0\), then learners cannot transition, and the learning experience is a priori useless. if \(P(R) = 1\), then the learning experience has a 100% success rate. Both situations cannot be guaranteed. Next, if \(P(L_0) = 0\), then from (\eqref{eq:bkt:bayes_update}) - (\eqref{eq:bkt:transitioning}), all \(P(L_t) = 0\), which is an uninteresting scenario to consider. Moreover, if \(P(L_t) = 0\), then from (\eqref{eq:bkt:transitioning}), it follows that \(P(L_{t - 1} | \boldsymbol {\cdot }) = 0\). And from (\eqref{eq:bkt:bayes_update}), we can see it is possible only if \(P(L_{t-1}) = 0\). Therefore, by induction, \(P(L_0) = 0\). Similarly, \(P(L_t) = 1\) if and only if \(P(L_0) = 1\). Thus, we can assume –

  1. \(0 < P(G) < 1\),
  2. \(0 < P(S) < 1\),
  3. \(0 < P(R) < 1\),
  4. \(0 < P(L_t) < 1\) for all \(t=0,\cdots ,T\).

There are some additional restrictions we can add for the BKT parameters. Namely, we want correct answers to increase our estimate of learner’s proficiency both before and after the transition. Similarly, incorrect answers should lower the proficiency. \begin {equation} \label {eq:bkt:condition_pre-transition_general} P(L_t | C_{t+1}) \geq P(L_t) \geq P(L_t | \overline {C_{t+1}}) \end {equation} and \begin {equation} \label {eq:bkt:condition_post-transition_general} P(L_{t + 1} | C_{t+1}) \geq P(L_t) \geq P(L_{t + 1} | \overline {C_{t+1}}) \end {equation}

The inequalities in (\eqref{eq:bkt:condition_pre-transition_general}) yield a natural restriction on the parameters – \begin {equation} 1 - P(S) \geq P(G) \end {equation} that is, the probability of answering an assessment correctly is higher if a learner is proficient. Moreover, this restriction is also sufficient for the left inequality in (\eqref{eq:bkt:condition_post-transition_general}) to be true. Let us prove these statements. Proof. Let us consider the first three inequalities.

  1. \(P(L_t|C_{t+1}) \geq P(L_t)\) \[\frac {P(L_t) \cdot ( 1 - P(S) ) }{ P(L_t) \cdot ( 1 - P(S) ) + P(G) \cdot ( 1 - P(L_t) )} \geq P(L_t),\] \[\frac { 1 - P(S) }{ P(L_t) \cdot ( 1 - P(S) ) + P(G) \cdot ( 1 - P(L_t) )} \geq 1,\] \[1 - P(S) \geq P(L_t) \cdot ( 1 - P(S) ) + P(G) \cdot ( 1 - P(L_t) ),\] \[1 - P(S) - P(G) \geq P(L_t) \cdot ( 1 - P(S) -P(G)),\] \[(1 - P(S) - P(G)) \cdot (1 - P(L_t)) \geq 0,\] which is always true if and only if \(1 - P(S) - P(G) \geq 0\).
  2. \(P(L_t) \geq P(L_t|\overline {C_{t+1})}\): \[P(L_t) \geq \frac {P(L_t) \cdot P(S)}{P(L_t) \cdot P(S) + ( 1 - P(L_t) ) \cdot ( 1 - P(G) )},\] \[1 \geq \frac {P(S)}{P(L_t) \cdot P(S) + ( 1 - P(L_t) ) \cdot ( 1 - P(G) )},\] \[P(L_t) \cdot P(S) + ( 1 - P(L_t) ) \cdot ( 1 - P(G) ) \geq P(S),\] \[(1 - P(L_t)) \cdot (1 - P(G) - P(S)) \geq 0,\] which is always true if and only if \(1 - P(S) - P(G) \geq 0\).
  3. \(P(L_{t + 1} | C_{t+1}) \geq P(L_t)\): \begin {multline*} P(L_{t+1} | C_{t+1}) = P(L_t | C_{t+1} ) \\+ P(R) \cdot ( 1 - P(L_t | C_{t+1} ) ) \geq P(L_t), \end {multline*} \[P(L_t | C_{t+1} ) \cdot (1 - P(R)) + P(R) \geq P(L_t),\] where left-hand side is \[\frac {P(L_t) \cdot ( 1 - P(S) ) \cdot (1 - P(R))}{ P(L_t) \cdot ( 1 - P(S) ) + P(G) \cdot ( 1 - P(L_t) )} + P(R)\] \[=\frac {P(L_t) \cdot ( 1 - P(S) ) + P(G) \cdot ( 1 - P(L_t) ) \cdot P(R)}{ P(L_t) \cdot ( 1 - P(S) ) + P(G) \cdot ( 1 - P(L_t) )}.\] It follows \begin {multline*} P(L_t) \cdot ( 1 - P(S) ) + P(G) \cdot ( 1 - P(L_t) ) \cdot P(R)\\ \geq P(L_t)^2 \cdot ( 1 - P(S) ) + P(G) \cdot P(L_t) \cdot ( 1 - P(L_t) ), \end {multline*} and can be further simplified to \begin {multline*} P(L_t) \cdot (1 - P(S)) \cdot (1 - P(L_t)) \\+ (1 - P(L_t)) \cdot P(G) \cdot (P(R) - P(L_t)) \geq 0, \end {multline*} \[P(L_t) \cdot (1 - P(S)) + P(G) \cdot (P(R) - P(L_t)) \geq 0,\] \[P(L_t) \cdot (1 - P(S) - P(G)) + P(G) \cdot P(R) \geq 0,\] which is true if \(1 - P(S) - P(G) \geq 0\).

_

The final inequality in (\eqref{eq:bkt:condition_post-transition_general}) yields a non-trivial restriction – \begin {equation} \label {eq:bkt:steady_state_general} P(L_t) \geq \dfrac {( 1 - P(G) ) \cdot P(R)}{1 - P(S) - P(G)}. \end {equation} Proof. Similar to the previous proof, from \[P(L_t) \geq P(L_{t + 1} | \overline {C_{t+1}})\] we have \[P(L_t) \geq \frac {P(L_t) \cdot P(S) + ( 1 - P(L_t) ) \cdot ( 1 - P(G) ) \cdot P(R)}{P(L_t) \cdot P(S) + ( 1 - P(L_t) ) \cdot ( 1 - P(G) )},\] which leads to \begin {multline*} P(L_t)^2 \cdot P(S) + P(L_t) \cdot ( 1 - P(L_t) ) \cdot ( 1 - P(G) )\\ \geq P(L_t) \cdot P(S) + ( 1 - P(L_t) ) \cdot ( 1 - P(G) ) \cdot P(R), \end {multline*} and further simplifies to \begin {multline*} P(L_t) \cdot P(S) \cdot (1 - P(L_t)) +\\ (1 - P(L_t)) \cdot (1 - P(G)) \cdot (P(R) - P(L_t)) \leq 0, \end {multline*} \[P(L_t) \cdot P(S) + (1 - P(G)) \cdot (P(R) - P(L_t)) \leq 0,\] \[P(L_t) \cdot (1 - P(G) - P(S)) \geq (1 - P(G)) \cdot P(R).\] Note that \(1 - P(G) - P(S) \neq 0\), otherwise \(P(G) = 1\) or \(P(R) = 0\). Therefore, \[P(L_t) \geq \dfrac {( 1 - P(G) ) \cdot P(R)}{1 - P(S) - P(G)}.\] __

Let us define the value in the right-hand side of (\eqref{eq:bkt:steady_state_general}) as \(P^*\). It can be shown that if \(P^* < P(L_{t^*}) < 1\), then \(P^* < P(L_t) < 1\) for any \(t > t^*\) and any sequence of attempts. Namely, the following is true –

Theorem 1. In a sequence of all failed attempts \(P(L_t)\) will asymptotically approach \(P^*\) from the right, and in a sequence of all successful attempts \(P(L_t)\) will asymptotically approach \(1\) from the left.

Proof. Let us consider a sequence of only failed attempts \(F = (F_1, F_2, \cdots , F_T)\). And let \[P(F_t) = P^* + \frac {\varepsilon }{(1 - P(S) - P(G))}\] for some \(0 < \varepsilon < (1 - P(S) - P(G)) \cdot (1 - P^*)\) and some \(t\). Note that using definition for \(P^*\) we get \begin {equation} \label {eq:bkt:proof_ss_low:Ft} P(F_t) = \frac {(1 - P(G)) \cdot P(R) + \varepsilon }{(1 - P(S) - P(G))}. \end {equation} Next, we write \(P(F_{t + 1})\) in the following form \[P(R) + \frac {P(F_t) \cdot P(S) \cdot (1 - P(R))}{P(F_t) \cdot P(S) + (1 - P(F_t)) \cdot (1 - P(G))}\] \begin {align*} =& P(R) + \frac {P(F_t) \cdot P(S) \cdot (1 - P(R))}{- P(F_t) \cdot (1 - P(S) - P(G)) + (1 - P(G))}\\&\\ =& P(R) + \frac {\frac {(1 - P(G)) \cdot P(R) + \varepsilon }{1 - P(S) - P(G)} \cdot P(S) \cdot (1 - P(R))}{- (1 - P(G)) \cdot P(R) - \varepsilon + (1 - P(G))}\\&\\ =& P(R) \\&+ \frac {((1 - P(G)) \cdot P(R) + \varepsilon ) \cdot P(S) \cdot (1 - P(R))}{(1 - P(S) - P(G)) \cdot ((1 - P(G)) \cdot (1 - P(R)) - \varepsilon )} \\&\\ =& P(R) \\&+ \frac {P(S) \Big [ P(R) \cdot ((1 - P(G)) \cdot (1 - P(R)) - \varepsilon ) + \varepsilon \Big ]}{(1 - P(S) - P(G)) \cdot ((1 - P(G)) \cdot (1 - P(R)) - \varepsilon )}\\&\\ =& P(R) + \frac {P(R) \cdot P(S)}{1 - P(S) - P(G)} \\&+ \frac {\varepsilon \cdot P(S)}{(1 - P(S) - P(G)) \cdot ((1 - P(G)) \cdot (1 - P(R)) - \varepsilon )}\\&\\ =& P^* \\&+ \frac {\varepsilon }{(1 - P(S) - P(G))} \cdot \frac {P(S)}{(1 - P(G)) \cdot (1 - P(R)) - \varepsilon }, \end {align*}

where it can be easily shown from (\eqref{eq:bkt:proof_ss_low:Ft}) and the fact that \(P(F_t) < 1\) that \[0 <\frac {P(S)}{(1 - P(G)) \cdot (1 - P(R)) - \varepsilon } < 1.\] Therefore, \[P^* < P(F_{t + 1}) < P^* + \frac {\varepsilon }{1 - P(S) - P(G)} = P(F_t).\] This proves the statement that if \(P(F_0) > P^*\), then the sequence \((P(F_1), P(F_1), \cdots , P(F_T))\) asymptotically approaches \(P^*\) from the right.

Let us now consider a sequence of only successful attempts \(U = (U_1, U_2, \cdots , U_T)\). We know that the sequence of proficiencies \((P(U_1), P(U_2), \cdots , P(U_T))\) is increasing and cannot reach \(1\) from previous discussion. Let us show, that it asymptotically approaches \(1\). Let \(P(U_t) = 1 - \epsilon \) for some \(0 < \epsilon < 1\) and some \(t\). Then we can write \(P(U_{t + 1})\) in the following form: \begin {multline} \label {eq:bkt:proof_ss_high} \frac {P(U_t) \cdot (1 - P(S)) \cdot (1 - P(R))}{P(U_t) \cdot (1 - P(S)) + (1 - P(U_t)) \cdot P(G)} + P(R)\\\\ = \dfrac {P(U_t) \cdot (1 - P(S)) + (1 - P(U_t)) \cdot P(G) \cdot P(R)}{P(U_t) \cdot (1 - P(S)) + (1 - P(U_t)) \cdot P(G)} =\\\\ = \dfrac {(1 - \epsilon ) \cdot (1 - P(S)) + \epsilon \cdot P(G) \cdot P(R)}{(1 - \epsilon ) \cdot (1 - P(S)) + \epsilon \cdot P(G))} =\\\\ 1 - \dfrac {\epsilon \cdot P(G) \cdot (1 - P(R))}{(1 - \epsilon ) \cdot (1 - P(S)) + \epsilon \cdot P(G))}. \end {multline}

Note that – \[(1 - \epsilon ) \cdot (1 - P(S) - P(G)) + P(R) \cdot P(G) > 0,\] \[(1 - \epsilon ) \cdot (1 - P(S)) - P(G) + \epsilon \cdot P(G) + P(R) \cdot P(G) > 0,\] \[(1 - \epsilon ) \cdot (1 - P(S)) + \epsilon \cdot P(G) > P(G) \cdot (1 - P(R)),\] \[\dfrac {P(G) \cdot (1 - P(R))}{(1 - \epsilon ) \cdot (1 - P(S)) + \epsilon \cdot P(G)} < 1.\] Using (\eqref{eq:bkt:proof_ss_high}), gives us \[P(U_t) = 1 - \epsilon < P(U_{t+1}) < 1.\] Therefore, if \(P(U_0) < 1\), then the sequence \((P(U_1), P(U_2), \cdots ,\)
\(P(U_T))\) asymptotically approaches \(1\) from the left.

Finally, for any sequence of attempts \(L = (L_1, L_2, \cdots , L_T)\) if \(P(L_0) = P(F_0) = P(U_0)\), then sequence \(F\) is the lower bound for \(L\), and sequence \(U\) is the upper bound for \(L\). __

Thus, we arrive at the following succinct list of restrictions on the BKT model parameters – \begin {equation} \label {eq:bkt:final_cond_G} 0 < P(G) < 1, \end {equation}

\begin {equation} \label {eq:bkt:final_cond_S} 0 < P(S) < 1, \end {equation}

\begin {equation} \label {eq:bkt:final_cond_R} 0 < P(R) < 1, \end {equation}

\begin {equation} \label {eq:bkt:final_cond_SG} 1 - P(S) - P(G) \geq 0, \end {equation}

\begin {equation} \label {eq:bkt:final_cond_ss} \dfrac {( 1 - P(G) ) \cdot P(R)}{1 - P(S) - P(G)} < P(L_0) < 1. \end {equation}

4. ESTIMATING THE PARAMETERS

Let \(X = (X^{(1)}, X^{(2)}, \cdots , X^{(D)})\) be a hidden process of learners’ states of proficiency where \(X^{(d)} = (X^{(d)}_1, X^{(d)}_2, \cdots , X^{(d)}_{T^{(d)}})\) is a sequence corresponding to learner \(d\). \(X^{(d)}_t\) takes value \(0\) if learner \(d\) is not proficient after \(t\) rounds of feedback, and value \(1\) if learner \(d\) is proficient after \(t\) rounds of feedback. Analogously, let \(Y= ( Y^{(1)}, Y^{(2)}, \cdots , Y^{(D)} )\) be an observable process of a learner’s attempts at assessments, with \(Y^{(d)}= ( Y_1^{(d)} , Y_2^{(d)} , \cdots , Y^{(d)}_{T^{(d)}})\) is a sequence of attempts corresponding to a learner \(d\). Similarly, \(Y^{(d)}_t\) takes value \(0\) and \(1\) for incorrect and correct answers by learner \(d\) at assessment \(t\), respectively. And let \(y\) be a realization of \(Y\), i.e., an available dataset. We cannot directly observe a corresponding realization of the latent process \(X\). For maximum likelihood estimation of parameters we would need to maximize the marginal likelihood function with respect to \(X\), which is intractable. Let us consider approaches to estimate the parameters of the BKT model for a given KC: \(P(L_0)\), \(P(G)\), \(P(S)\), and \(P(R)\).

4.1 Expectation-Maximization Algorithm

Expectation–maximization (EM) algorithm  [8] is an iterative algorithm to find local maximum likelihood estimates of parameters in a model with unobserved latent variables. EM is not guaranteed to converge to an optimal solution, but rather to a local optimal solution. EM is used when computation of the likelihood function is intractable due to presence of latent variables. Starting with a random initial guess for parameters, the algorithm improves the estimate for model parameters in each iteration by guaranteeing that the new parameter values correspond to higher values of the log-likelihood function without explicitly computing it. Let us denote \(\theta \) as a vector of all model parameters, and \(\theta ^*\) as the parameter estimates at the current step of EM. During each step, the function \(Q(\theta ) = Q(\theta | \theta ^*)\) is constructed as the expected value of log-likelihood function of \(\theta \) with respect to the current conditional distribution of \(X\) given data \(y\) and \(\theta ^*\) – \begin {equation} \label {eq:EM:Q_general} Q(\theta | \theta ^*) = \mathbb {E}_{X|y, \theta ^*} \left [ \log P(y, X | \theta ) \right ]. \end {equation} EM states that – \begin {equation} \label {eq:EM:main_property} \forall \theta : \log P (y | \theta ) - \log P (y | \theta ^*) \geq Q(\theta | \theta ^*) - Q(\theta ^* | \theta ^*), \end {equation} that is, any \(\theta \) that increases value of \(Q\) over \(Q(\theta ^* | \theta ^*)\) also improves the value of the corresponding marginal log-likelihood function. EM defines the next value of \(\theta ^*\) as the – \begin {equation} \label {eq:EM:argmax_Q} \theta ^*_{\text {next}} = \argmax \limits _\theta Q(\theta | \theta ^*). \end {equation}

For a discrete case, like a BKT model, (\eqref{eq:EM:Q_general}) becomes – \begin {equation} \label {eq:EM:Q_discrete} Q(\theta | \theta ^*) = \sum \limits _{x \in \mathcal {X}} \log \left [ P(y, x | \theta )\right ] \cdot P(x | y, \theta ^*), \end {equation} where \(\mathcal {X}\) is a set of all possible \(X\). Because \(P(y, x | \theta ^*) = P(x | y, \theta ^*) \cdot P(y | \theta ^*)\) and \(P(y | \theta ^*)\) is a constant with respect to \(\theta \), maximization of (\eqref{eq:EM:Q_discrete}) is equivalent to a maximization of – \begin {equation} \label {eq:EM:Q_hat_discrete} \widehat {Q}(\theta | \theta ^*) = \sum \limits _{x \in \mathcal {X}} \log \left [ P(y, x | \theta )\right ] \cdot P(y, x | \theta ^*). \end {equation} Sometimes, maximization of (\eqref{eq:EM:Q_hat_discrete}) is more convenient than (\eqref{eq:EM:Q_discrete}).

BKT can be modeled as a Hidden Markov model and, therefore, a special case of EM algorithm, Baum-Welch algorithm  [2], can be used. The Baum-Welch algorithm provides closed forms for \(\theta ^*_{\text {next}}\) based on \(\theta ^*\) values. It is fully described in Appendix A.

Note, that the Baum-Welch algorithm does not guarantee that (\eqref{eq:bkt:final_cond_SG}) - (\eqref{eq:bkt:final_cond_ss}) are satisfied. To avoid cases where Baum-Welch converges to a unsuitable parameters (i.e., degenerate parameters), we offer to use a different approach.

4.1.1 Novel EM Algorithm using the Interior-Point Method

We want an algorithm that will always yield meaningful parameters for the BKT model by satisfying conditions (\eqref{eq:bkt:final_cond_G}) - (\eqref{eq:bkt:final_cond_ss}). That can be achieved if instead of just maximizing \(\widehat {Q}\) in (\eqref{eq:EM:argmax_Q}), we maximize it under (\eqref{eq:bkt:final_cond_G}) - (\eqref{eq:bkt:final_cond_ss}) restrictions. Note that the corresponding log-likelihood function will increase due to property (\eqref{eq:EM:main_property}). Conditions (\eqref{eq:bkt:final_cond_G}) - (\eqref{eq:bkt:final_cond_R}) and right-hand side of (\eqref{eq:bkt:final_cond_ss}) are satisfied automatically due to the form of \(\log \) functions in \(\widehat {Q}\), see (\eqref{eq:BKT:Q}). Finally, (\eqref{eq:bkt:final_cond_SG}) and left-hand side of (\eqref{eq:bkt:final_cond_ss}) can be combined into a single inequality, resulting in the following non-linear optimization problem – \begin {equation} \label {eq:EM_interior:KKT_problem} \begin {array}{lcl} \theta ^*_{\text {next}} & = & \argmax \limits _\theta \widehat {Q}(\theta | \theta ^*),\\ & \text {s.t.} & (1 - P(S) - P(G)) \cdot P(L_0) \\ & &\hspace {40pt}- (1 - P(G)) \cdot P(R) \geq 0, \end {array} \end {equation} where \( \widehat {Q}\) has form (\eqref{eq:BKT:Q}). We will use the interior-point method on (\eqref{eq:EM_interior:KKT_problem}). The goal is to find the maximum of the barrier function – \begin {equation} \label {eq:EM_interior:B} B(\theta , \mu ) = \widehat {Q}(\theta | \theta ^*) + \mu \cdot \log c(\theta ), \end {equation} where \(c(\theta )\) is the left-hand side of the constrain from (\eqref{eq:EM_interior:KKT_problem}), and \(\mu \) is a so-called barrier parameter. We will iterate through a decreasing sequence of values for \(\mu \) parameter \(\mu _1 > \mu _2 > \cdots > \mu _W = 0\), finding a maximum of \(W\) in each iteration. As \(\mu \) approaches \(0\), the maximum of \(B\) converges to the solution of (\eqref{eq:EM_interior:KKT_problem}). Next, a dual variable \(\lambda \) is introduced, defined as \(c(\theta ) \cdot \lambda = \mu \). To find the extremum point of \(B\) we need to find zero of the following vector function – \begin {equation} F = \left [ \begin {array}{c} \nabla B\\ \lambda \cdot c(\theta ) - \mu \end {array}\right ]. \end {equation} We will use a Newton’s method to find zero of \(F\). We start with some initial guesses \(\theta _1\) and \(\lambda _1\) by solving and update them by solving – \begin {equation} \label {eq:EM_interior:Newton_method} J_F(\theta _k, \lambda _k) \times \left [ \begin {array}{c} \Delta \theta \\ \Delta \lambda \end {array}\right ] = - F(\theta _k, \lambda _k), \end {equation} where \(J_F\) is a Jacobian of \(F\); and updating – \begin {equation} \label {eq:EM_interior:Newton_update} \begin {array}{clc} \theta _{k+1} = \theta _k + \nu \cdot \Delta \theta ,\\ \lambda _{k+1} = \lambda _k + \nu \cdot \Delta \lambda , \end {array} \end {equation} where \(\nu \) is a value small enough, so updated \(\theta _{k+1}\) and \(\lambda _{k+1}\) satisfy \(c(\theta _{k+1}) \geq 0\) and \(\lambda _{k+1} \geq 0\). Next, \begin {equation} \label {eq:EM_interior:F} F = \left [ \begin {array}{ccl} \dfrac {\partial \widehat {Q}}{\partial P(L_0)} &+& \lambda \cdot (1 - P(S) - P(G))\\\\ \dfrac {\partial \widehat {Q}}{\partial P(G)} &+& \lambda \cdot (P(R) - P(L_0))\\\\ \dfrac {\partial \widehat {Q}}{\partial P(S)} &-& \lambda \cdot P(L_0)\\\\ \dfrac {\partial \widehat {Q}}{\partial P(R)} &-& \lambda \cdot (1 - P(G))\\\\ \lambda \cdot (1 - P(S) - P(G)) \cdot P(L_0)\\ - \lambda \cdot (1 - P(G)) \cdot P(R) - \mu \end {array}\right ], \end {equation} where partial derivatives of \(\widehat {Q}\) are given by (\eqref{eq:BKT:dQdL_0}) - (\eqref{eq:BKT:dQdR}). And the Jacobian \(J_F\) has the following form \begin {equation} \label {eq:EM_interior:JF} \left [ \begin {array}{ccccc} \dfrac {\partial ^2 \widehat {Q}}{\partial P(L_0)^2} & -\lambda & -\lambda & 0 & 1 - P(S) - P(G) \\\\ - \lambda & \dfrac {\partial ^2 \widehat {Q}}{\partial P(G)^2} & 0 & \lambda & P(R) - P(L_0)\\\\ -\lambda & 0 & \dfrac {\partial ^2 \widehat {Q}}{\partial P(S)^2} & 0 & -P(L_0)\\\\ 0 & \lambda & 0 & \dfrac {\partial ^2 \widehat {Q}}{\partial P(R)^2} & -1+P(G)\\\\ \lambda \cdot \dfrac {\partial c}{\partial P(L_0)} & \lambda \cdot \dfrac {\partial c}{\partial P(G)} & \lambda \cdot \dfrac {\partial c}{\partial P(S)} & \lambda \cdot \dfrac {\partial c}{\partial P(R)} & c, \end {array}\right ] \end {equation} where – \begin {equation} \label {eq:EM_interior:dc} \begin {array}{lcl} \dfrac {\partial c}{\partial P(L_0)} & = & 1 - P(S) - P(G),\\ \dfrac {\partial c}{\partial P(G)} & = & P(R) - P(L_0),\\ \dfrac {\partial c}{\partial P(S)} & = & -P(L_0),\\ \dfrac {\partial c}{\partial P(R)} & = & - 1 + P(G). \end {array} \end {equation} Note from (\eqref{eq:BKT:dQdL_0}) - (\eqref{eq:BKT:dQdR}) that each first partial derivative of \(\widehat {Q}\) has the following form – \begin {equation} \label {eq:EM_interior:dQ} \dfrac {\partial \widehat {Q}}{\partial P(\boldsymbol {\cdot })} = \dfrac {A}{P(\boldsymbol {\cdot })} - \dfrac {B}{1 - P(\boldsymbol {\cdot })} \end {equation} with some values of \(A\) and \(B\) independent of \(P(\boldsymbol {\cdot })\). Therefore, all corresponding second partial derivatives have the following form – \begin {equation} \label {eq:EM_interior:dQ2} \dfrac {\partial ^2 \widehat {Q}}{\partial P(\boldsymbol {\cdot })^2} = -\dfrac {A}{P(\boldsymbol {\cdot })^2} - \dfrac {B}{(1 - P(\boldsymbol {\cdot }))^2}. \end {equation}

To summarize, we begin with \(\mu =\mu _1\) and find zero of function \(F(\mu _1)\) starting with some random initial guesses \((\theta _1(\mu _1), \lambda _1(\mu _1))\) and update them using rule (\eqref{eq:EM_interior:Newton_update}) and formulae (\eqref{eq:EM_interior:Newton_method}), (\eqref{eq:EM_interior:F}) - (\eqref{eq:EM_interior:dQ2}), (\eqref{eq:BKT:dQdL_0}) - (\eqref{eq:BKT:dQdR}) until it converges to values \((\theta _k(\mu _1), \lambda _k(\mu _1))\) for some \(k\). Then we apply the same procedure to find zero of function \(F(\mu _2)\) using \((\theta _k(\mu _1), \lambda _k(\mu _1))\) as initial guesses. We continue until we converge to \((\theta _{k'}(\mu _W), \lambda _{k'}(\mu _W)) \), the solution of \(F(\mu _W) = 0\), which maximizes (\eqref{eq:EM_interior:B}) and is solution of (\eqref{eq:EM_interior:KKT_problem}).

5. DEMONSTRATING THE EM-NEWTON ALGORITHM ON SIMULATED DATA

The simulated data used in this section is made available along with the paper  [22]. In this section, we compare the performance of the method proposed in this paper, which we call EM-Newton algorithm, with the classical Baum-Welch algorithm. First, let us provide an example where the Baum-Welch algorithm yields degenerate parameters, i.e., the ones not satisfying (\eqref{eq:bkt:final_cond_G}) – (\eqref{eq:bkt:final_cond_ss}). We simulated 100 datasets using the following parameter values: \(P(L_0)=0.45\), \(P(R)=0.3\), \(P(S)=0.1\), and \(P(G)=0.25\) (the same values for all datasets). All simulated datasets contained 300 learners answering 10 questions each. We fit each dataset using both EM-Newton and Baum-Welch algorithms starting from the same random initial parameter guesses. The Baum-Welch algorithm yielded two clusters of fitted parameters: the “good estimates", the ones close to the true parameters (80 cases), and the “bad estimates", the ones far from the true parameters (20 cases). Moreover, the true parameters and all “good estimates" satisfied the conditions (\eqref{eq:bkt:final_cond_G}) – (\eqref{eq:bkt:final_cond_ss}), while all “bad estimates" violated at least one of the conditions. Interestingly, EM-Newton algorithm was able to rescue the “bad estimates" for the corresponding “bad datasets" and produce valid parameter estimates. While both algorithms yielded very similar values of the parameters for the “good datasets", which were close to the true parameter values and satisfied the conditions, EM-Newton algorithm estimates did not fail even for “bad datasets" by forcing the parameters into a non-degenerate space. See Fig. 2.

BKT model fit to 100 simulated datasets using classical Baum-Welch algorithm ($\times$ and $+$) and proposed EM-Newton method ($\Circle$ and $\circ$). The true parameter values were: $P(L_0)=0.45$, $P(R)=0.3$, $P(S)=0.1$, and $P(G)=0.25$. Baum-Welch algorithm produced two clusters of estimates: the ones satisfying conditions (\eqref{eq:bkt:final_cond_G}) -- (\eqref{eq:bkt:final_cond_ss}) and close to the true parameter values ($\times$; Baum-Welch [C]), and the degenerate ones not satisfying the conditions and far from the true parameter values ($+$; Baum-Welch [D]). The same datasets were used to produced parameter estimates using EM-Newton algorithm ($\Circle$; EM-Newton [C] and $\circ$; EM-Newton [D], respectively). Note that EM-Newton algorithm produced only one cluster of the parameter estimates, that are all close to the true parameter values. Also note that all EM-Newton parameter estimates satisfied conditions (\eqref{eq:bkt:final_cond_G}) -- (\eqref{eq:bkt:final_cond_ss}), i.e., not degenerate, by design. This image shows the prior and transition parameters.This image shows the guess and slip parameters for the second experiment.
Figure 2: BKT model fit to 100 simulated datasets using classical Baum-Welch algorithm (\(\times \) and \(+\)) and proposed EM-Newton method (\(\Circle \) and \(\circ \)). The true parameter values were: \(P(L_0)=0.45\), \(P(R)=0.3\), \(P(S)=0.1\), and \(P(G)=0.25\). Baum-Welch algorithm produced two clusters of estimates: the ones satisfying conditions (\eqref{eq:bkt:final_cond_G}) – (\eqref{eq:bkt:final_cond_ss}) and close to the true parameter values (\(\times \); Baum-Welch [C]), and the degenerate ones not satisfying the conditions and far from the true parameter values (\(+\); Baum-Welch [D]). The same datasets were used to produced parameter estimates using EM-Newton algorithm (\(\Circle \); EM-Newton [C] and \(\circ \); EM-Newton [D], respectively). Note that EM-Newton algorithm produced only one cluster of the parameter estimates, that are all close to the true parameter values. Also note that all EM-Newton parameter estimates satisfied conditions (\eqref{eq:bkt:final_cond_G}) – (\eqref{eq:bkt:final_cond_ss}), i.e., not degenerate, by design.

Next, we repeated the same experiment but instead of fitting 100 dataset once, we randomly selected one of the “good datasets" and fitted it 100 times with both algorithms using 100 random initial parameter guesses (again, using the same initial guesses between two algorithms). Again, Baum-Welch algorithm produced two clusters of the parameter estimates with the same properties as in the previous experiment (80 “good estimates", 20 “bad estimates"), and the EM-Newton algorithm was able to rescue the “bad" cluster. See Fig. 3. Interestingly, this experiment indicates that the simulated datasets are likely not inherently “bad" or “good", since the same dataset produced both “good" and “bad" estimates at the same rate as a series of different datasets. Although the conditions under which Baum-Welch algorithm produces degenerate results are out of scope of this work, it seems to be dependent on the initial parameter guesses. That makes a lot of sense due to the local nature of the Baum-Welch algorithm. It also explains prior work that attempts to solve this problem by determining the starting values that lead to degenerate parameter values in order to avoid them  [17].

Experiment similar to the one on Fig. \ref{fig:different_datasets}, but instead of 100 different datasets, the same dataset was fit 100 times using different initial parameter guesses. The outcome of the experiment is virtually identical, with the difference of the estimates grouped much closer. Nevertheless, it is easy to see that the ``bad cluster" consists exclusively of a subset of Baum-Welch estimates not satisfying (\eqref{eq:bkt:final_cond_G}) -- (\eqref{eq:bkt:final_cond_ss}) conditions ($+$; Baum-Welch [D]) (and, which is harder to see, it is the only cluster where they are present). This image shows the prior and transition parameters.This image shows the guess and slip parameters.
Figure 3: Experiment similar to the one on Fig. 2, but instead of 100 different datasets, the same dataset was fit 100 times using different initial parameter guesses. The outcome of the experiment is virtually identical, with the difference of the estimates grouped much closer. Nevertheless, it is easy to see that the “bad cluster" consists exclusively of a subset of Baum-Welch estimates not satisfying (\eqref{eq:bkt:final_cond_G}) – (\eqref{eq:bkt:final_cond_ss}) conditions (\(+\); Baum-Welch [D]) (and, which is harder to see, it is the only cluster where they are present).

After providing some examples of situations when Baum-Welch algorithm produces degenerate results, we looked towards more systematic analysis of the comparison between two algorithms for different combinations of the true parameter values. We randomly sampled 100 different sets of parameters from the space of non-degenerate parameters, defined by conditions (\eqref{eq:bkt:final_cond_G}) – (\eqref{eq:bkt:final_cond_ss}), see Fig. 4. Then, for each set of parameters we repeated the first experiment of the section. We compared the Sum of Squared Errors (SSE) for the estimates yielded by the EM-Newton and Baum-Welch algorithms. As expected, the distribution of EM-Newton SSE was shifted to the left compared to the Baum-Welch algorithm: the outputs of EM-Newton were either similar to the Baum-Welch ones (for “good" cases) or closer to true parameter values (for rescued “bad" cases), see Figure 5, Left. This logic was further supported by EM-Newton SSE having lower variability than Baum-Welch SSE: a single-clustered output of EM-Newton algorithm had lower range of SSE for each given dataset than a potentially multiple-clustered output of Baum-Welch algorithm, see Figure (5), Right. This experiment demonstrated that EM-Newton algorithm has both higher average accuracy (SSE distribution is shifted to the left) and higher precision (SSE have lower variation within a dataset).

The non-degenerate parameter space defined by (\eqref{eq:bkt:final_cond_G}) -- (\eqref{eq:bkt:final_cond_ss}). Note that there are four parameters, so a 4D plot would be required to actually display the space. Instead, we can rewrite condition (\eqref{eq:bkt:final_cond_ss}) with respect to $P(R) / P(L_0)$, and utilize a 3D plot instead. The non-degenerate parameter space is underneath the shown surface depicted from two angles (with additional restriction of all parameters being between $0$ and $1$). The straight lines on the surface are the contour lines. The sampled non-degenerate parameters are marked as dots. This plot shows $P(S), P(G), and P(R) / P(L_0)$This plot shows $P(S), P(G), and P(R) / P(L_0)$ rotated on their axes.
Figure 4: The non-degenerate parameter space defined by (\eqref{eq:bkt:final_cond_G}) – (\eqref{eq:bkt:final_cond_ss}). Note that there are four parameters, so a 4D plot would be required to actually display the space. Instead, we can rewrite condition (\eqref{eq:bkt:final_cond_ss}) with respect to \(P(R) / P(L_0)\), and utilize a 3D plot instead. The non-degenerate parameter space is underneath the shown surface depicted from two angles (with additional restriction of all parameters being between \(0\) and \(1\)). The straight lines on the surface are the contour lines. The sampled non-degenerate parameters are marked as dots.
Distribution of SSE for parameter estimates produced by Baul-Welch and EM-Newton algorithms.For each dataset and algorithm we found the range of produced SSE (minimum SSE subtracted from maximum SSE). Note that the variation in SSE is always lower for EM-Newton than for Baum-Welch.
Figure 5: Top. Distribution of SSE for parameter estimates produced by Baul-Welch and EM-Newton algorithms. Bottom. For each dataset and algorithm we found the range of produced SSE (minimum SSE subtracted from maximum SSE). Note that the variation in SSE is always lower for EM-Newton than for Baum-Welch.

6. DISCUSSION

The paper first derives a list of constraints on the BKT parameter space following from the conceptual meaning of the modeled process. One question that may arise in the reader’s mind is around the validity of the constraints imposed on the BKT parameters in practice. While the justifications for the constraints are mentioned in the text, they also assume that the questions are “well-designed". It follows, therefore, that using this process, it is possible to address the complementary issue of identifying poorly performing KCs as those for whom these constraints are violated and flag them to learning designers with appropriately recommended fixes. For example, \(P(R) = 1\) being true could mean that the learning experience is not connected to the KC since it is leading to proficiency regardless of mastery. While \(1 - P(S) < P(G)\) could tell us the question is worded in such a way that leads to overthinking, i.e., skillful learners are less likely to answer it correctly than unskillful learners guessing the answer by chance.

Ultimately, we derived an algorithm that converges to a set of parameters that are guaranteed to meet the constraints. Additionally, we compared our algorithm to the classic Baum-Welch algorithm used to estimate parameters of Hidden Markov Models, including BKT. We demonstrated that both algorithms converge to similar values of parameters in cases where the values satisfy the derived conditions. We also demonstrated that Baum-Welch algorithm occasionally converges to the values of parameters than are neither close to the true values nor satisfying of the conditions, with our algorithm being able to rescue those cases. Although a single run of the Baum-Welch algorithm is less computational heavy than a single run of our algorithm (ours requires the Newton method to converge on each iteration), the Baum-Welch algorithm is often run multiple times with different initial conditions after post-hoc finding degenerate parameters. Our algorithm can be run once and, therefore, be less computational heavy overall.

Finally, let us also notice that the derivation approach described in the paper can be followed to devise an algorithm subject to a different set of constraints as well, so long as the set of constraints remain tractable. The approach can, therefore, be extended to BKT extensions such as the addition of individual item difficulty  [19], individualization  [1825], time between attempts  [20], or forgetting  [13].

7. CONCLUSION AND FUTURE WORK

This paper derives succinct constraints that can be imposed on the BKT parameter space from first principles. Then, a new Expectation-Maximization algorithm using the Interior-Point Method is introduced that produces parameters subject to those constraints and is, therefore, guaranteed to produce valid, i.e., non-degenerate parameters. While the computational cost savings may not be dramatic for 4-parameter BKT, as presented, they become increasingly more so for extensions to BKT that can still use Expectation-Maximization such as the addition of individual item difficulty  [19], individualization  [1825], time between attempts  [20], or forgetting  [13] parameters which we will present in future work. Experiments with real-time adaptive learning systems is also currently in progress and will be reported in future work. More complex extensions such as BKT+  [4] incorporate many of these, but, start to commensurately require more complex methods such as Markov Chain Monte Carlo (MCMC) or even deep learning  [9] which could make such first principles derivation untenable.

8. REFERENCES

  1. R. S. d. Baker, A. T. Corbett, and V. Aleven. More accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In Intelligent Tutoring Systems: 9th International Conference, ITS 2008, Montreal, Canada, June 23-27, 2008 Proceedings 9, pages 406–415. Springer, 2008.
  2. L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The annals of mathematical statistics, 41(1):164–171, 1970.
  3. J. E. Beck and K.-m. Chang. Identifiability: A fundamental problem of student modeling. In International Conference on User Modeling, pages 137–146. Springer, 2007.
  4. S. Bhatt, J. Zhao, C. Thille, D. Zimmaro, and N. Gattani. Evaluating bayesian knowledge tracing for estimating learner proficiency and guiding learner behavior. In Proceedings of the Seventh ACM Conference on Learning@ Scale, pages 357–360, 2020.
  5. K.-m. Chang, J. Beck, J. Mostow, and A. Corbett. A bayes net toolkit for student modeling in intelligent tutoring systems. In Intelligent Tutoring Systems: 8th International Conference, ITS 2006, Jhongli, Taiwan, June 26-30, 2006. Proceedings 8, pages 104–113. Springer, 2006.
  6. A. T. Corbett and J. R. Anderson. Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction, 4:253–278, 1994.
  7. S. D. Craig, X. Hu, A. C. Graesser, A. E. Bargagliotti, A. Sterbinsky, K. R. Cheney, and T. Okwumabua. The impact of a technology-based mathematics after-school program using aleks on student’s knowledge and behaviors. Computers & Education, 68:495–504, 2013.
  8. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977.
  9. T. Gervet, K. Koedinger, J. Schneider, T. Mitchell, et al. When is deep learning the best approach to knowledge tracing? Journal of Educational Data Mining, 12(3):31–54, 2020.
  10. W. J. Hawkins, N. T. Heffernan, and R. S. Baker. Learning bayesian knowledge tracing parameters with a knowledge heuristic and empirical probabilities. In Intelligent Tutoring Systems: 12th International Conference, ITS 2014, Honolulu, HI, USA, June 5-9, 2014. Proceedings 12, pages 150–155. Springer, 2014.
  11. T. Kabudi, I. Pappas, and D. H. Olsen. Ai-enabled adaptive learning systems: A systematic mapping of the literature. Computers and Education: Artificial Intelligence, 2:100017, 2021.
  12. T. Käser, S. Klingler, A. G. Schwing, and M. Gross. Dynamic bayesian networks for student modeling. IEEE Transactions on Learning Technologies, 10(4):450–462, 2017.
  13. M. Khajah, R. V. Lindsey, and M. C. Mozer. How deep is knowledge tracing?. International Educational Data Mining Society, 2016.
  14. Y. Kim, S. Sankaranarayanan, C. Piech, and C. Thille. Variational temporal irt: Fast, accurate, and explainable inference of dynamic learner proficiency. International Educational Data Mining Society, 2023.
  15. T. Liu. Knowledge tracing: A bibliometric analysis. Computers and Education: Artificial Intelligence, page 100090, 2022.
  16. S. Minn. Ai-assisted knowledge assessment techniques for adaptive learning environments. Computers and Education: Artificial Intelligence, 3:100050, 2022.
  17. Z. Pardos and N. Heffernan. Navigating the parameter space of bayesian knowledge tracing models: Visualizations of the convergence of the expectation maximization algorithm. In Educational Data Mining 2010, 2010.
  18. Z. A. Pardos and N. T. Heffernan. Modeling individualization in a bayesian networks implementation of knowledge tracing. In User Modeling, Adaptation, and Personalization: 18th International Conference, UMAP 2010, Big Island, HI, USA, June 20-24, 2010. Proceedings 18, pages 255–266. Springer, 2010.
  19. Z. A. Pardos and N. T. Heffernan. Kt-idem: Introducing item difficulty to the knowledge tracing model. In User Modeling, Adaption and Personalization: 19th International Conference, UMAP 2011, Girona, Spain, July 11-15, 2011. Proceedings 19, pages 243–254. Springer, 2011.
  20. Y. Qiu, Y. Qi, H. Lu, Z. A. Pardos, and N. T. Heffernan. Does time matter? modeling the effect of time with bayesian knowledge tracing. In EDM, pages 139–148, 2011.
  21. S. Ritter, T. K. Harris, T. Nixon, D. Dickison, R. C. Murray, and B. Towle. Reducing the knowledge tracing space. International Working Group on Educational Data Mining, 2009.
  22. D. Shchepakin, S. Sankaranarayanan, and D. Zimmaro. Inferring bayesian knowledge tracing parameters using classical expectation-maximization and a novel newton method-based approach, 2023.
  23. B. van De Sande. Properties of the bayesian knowledge tracing model. Journal of Educational Data Mining, 5(2):1–10, 2013.
  24. Y. Wang and J. Beck. Class vs. student in a bayesian network student model. In Artificial Intelligence in Education: 16th International Conference, AIED 2013, Memphis, TN, USA, July 9-13, 2013. Proceedings 16, pages 151–160. Springer, 2013.
  25. M. V. Yudelson, K. R. Koedinger, and G. J. Gordon. Individualized bayesian knowledge tracing models. In Artificial Intelligence in Education: 16th International Conference, AIED 2013, Memphis, TN, USA, July 9-13, 2013. Proceedings 16, pages 171–180. Springer, 2013.

APPENDIX

A. BAUM-WELCH ALGORITHM

We have – \begin {equation*} \begin {array}{ll} P(y, x | \theta ) = \displaystyle \prod \limits _{d=1}^D P(y^{(d)}, x^{(d)} | \theta )\\ = \displaystyle \prod \limits _{d=1}^D \Biggl (& (1 - P(L_0))^{(1 - x^{(d)}_1)} \cdot P(L_0)^{x^{(d)}_1}\\ &\begin {array}{rl} \times \displaystyle \prod \limits _{t=1}^{T^{(d)}} \Biggl \{ &\left [ P(G)^{y^{(d)}_t} \cdot (1 - P(G))^{1 - y^{(d)}_t}) \right ]^{1 - x^{(d)}_t}\\ \times &\left [ P(S)^{1 - y^{(d)}_t} \cdot (1 - P(S))^{y^{(d)}_t} \right ]^{x^{(d)}_t}\Biggr \} \end {array} \\ &\begin {array}{rl} \times \displaystyle \prod \limits _{t=1}^{T^{(d)} - 1} \Biggl \{ &\left [ (1 - P(R))^{1 - x^{(d)}_{t+1}} \cdot P(R)^{x^{(d)}_{t+1}} \right ]^{1 - x^{(d)}_t}\\ \times & \left [ x^{(d)}_{t+1} \right ]^{x^{(d)}_t} \Biggr \}\Biggr ). \end {array} \end {array} \end {equation*} Thus, (\eqref{eq:EM:Q_hat_discrete}) becomes – \begin {equation} \label {eq:BKT:Q} \begin {array}{lrrl} \widehat {Q}&(\theta | \theta ^*) =&&\\ &\displaystyle \sum \limits _{x\in \mathcal {X}} &\Biggl [ \displaystyle \sum \limits _{d=1}^D &\biggl ( (1 - x^{(d)}_1) \cdot \log (1 - P(L_0))\\ &&& + x^{(d)}_1 \cdot \log P(L_0) \biggr )\\ &+& \displaystyle \sum \limits _{t = 1}^{T^{(d)}} &\biggl ( (1 - x^{(d)}_t) \cdot y^{(d)}_t \cdot \log P(G)\\ &&& + (1 - x^{(d)}_t) \cdot (1 - y^{(d)}_t) \cdot \log (1 - P(G)) \biggr ) \\ &+& \displaystyle \sum \limits _{t=1}^{T^{(d)}} &\biggl ( x^{(d)}_t \cdot (1 - y^{(d)}_t) \cdot \log P(S)\\ &&& + x^{(d)}_t \cdot y^{(d)}_t \cdot \log (1 - P(S)) \biggr ) \\ &+& \displaystyle \sum \limits _{t=1}^{T^{(d)} - 1} &\biggl ( (1 - x^{(d)}_t) \cdot (1 - x^{(d)}_{t+1}) \cdot \log (1 - P(R))\\ &&& + (1 - x^{(d)}_t) \cdot x^{(d)}_{t+1} \cdot \log P(R) \biggr ) \\ &+& \displaystyle \sum \limits _{t=1}^{T^{(d)} - 1}& x^{(d)} \log x^{(d)}_{t+1} \Biggr ] \cdot P(y, x | \theta ^*) \end {array} \end {equation}

The maximum of the function can be found by finding extremum of \(\widehat {Q}(\theta | \theta ^*)\). For \(P(L_0)\) we have \begin {multline} \label {eq:BKT:dQdL_0} \frac {\partial \widehat {Q}}{\partial P(L_0)} =\\ \frac {\partial }{\partial P(L_0)} \Biggl [ \sum \limits _{x \in \mathcal {X}} \sum \limits _{d=1}^D \biggl ( (1 - x^{(d)}_1) \cdot \log (1 - P(L_0)) \\ + x^{(d)}_1 \cdot \log P(L_0) \biggr ) \cdot P(y, x | \theta ^*) \Biggr ] \end {multline} \begin {multline} = \frac {\partial }{\partial P(L_0)} \Biggl [ \sum \limits _{d=1}^D \log (1 - P(L_0)) \cdot P(x^{(d)}_1 = 0, y | \theta ^*)\\ + \sum \limits _{d=1}^D \log P(L_0) \cdot P(x^{(d)}_1 = 1, y | \theta ^*) \Biggr ] \\ = \frac {\sum \limits _{d=1}^D P(x^{(d)}_1 = 1, y | \theta ^*)}{P(L_0)} - \frac {\sum \limits _{d=1}^D P(x^{(d)}_1 = 0, y | \theta ^*)}{1 - P(L_0)}. \end {multline} For \(P(G)\) we have – \begin {multline} \label {eq:BKT:dQdG} \frac {\partial \widehat {Q}}{\partial P(G)} =\\ \frac {\partial }{\partial P(G)} \Biggl [ \sum \limits _{x \in \mathcal {X}} \sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)}} (1 - x^{(d)}_t) \cdot \biggl ( y^{(d)}_t \cdot \log P(G)\\ + (1 - y^{(d)}_t) \cdot \log (1 - P(G)) \biggr ) \Biggr ]\\ = \frac {\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)}} y^{(d)}_t \cdot P(x^{(d)}_t = 0, y | \theta ^*)}{P(G)} \\ - \frac {\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)}} (1 - y^{(d)}_t) \cdot P(x^{(d)}_t = 0, y | \theta ^*)}{1 - P(G)}. \end {multline} For \(P(S)\) we have – \begin {multline} \label {eq:BKT:dQdS} \frac {\partial \widehat {Q}}{\partial P(S)} =\\ \frac {\partial }{\partial P(S)} \Biggl [ \sum \limits _{x \in \mathcal {X}} \sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)}} x^{(d)}_t \cdot \biggl ( (1 - y^{(d)}_t) \cdot \log P(S) \\ + y^{(d)}_t \cdot \log (1 - P(S)) \biggr ) \Biggr ]\\ = \frac {\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)}} (1 - y^{(d)}_t) \cdot P(x^{(d)}_t = 1, y | \theta ^*)}{P(S)} \\ - \frac {\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)}} y^{(d)}_t \cdot P(x^{(d)}_t = 1, y | \theta ^*)}{1 - P(S)}. \end {multline} And for \(P(R)\) we have – \begin {multline} \label {eq:BKT:dQdR} \frac {\partial \widehat {Q}}{\partial P(R)} =\\ \frac {\partial }{\partial P(R)} \Biggl [ \sum \limits _{x \in \mathcal {X}} \sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)} - 1} (1 - x^{(d)}_t) \cdot \biggl ( (1 - x^{(d)}_{t+1}) \cdot \log (1 - P(R))\\ + x^{(d)}_{t+1} \cdot \log P(R) \biggr ) \Biggr ] = \frac {\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)} - 1} P(x^{(d)}_t = 0, x^{(d)}_{t+1} = 1, y | \theta ^*)}{P(R)} \\ - \frac {\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)} - 1} P(x^{(d)}_t = 0, x^{(d)}_{t+1} = 0, y | \theta ^*)}{1 - P(R)}. \end {multline} Setting partial derivatives (\eqref{eq:BKT:dQdL_0}) - (\eqref{eq:BKT:dQdR}) to zero, yields a closed form solution for the parameters: \begin {equation} \label {eq:BKT:Baum-Welch_general} \begin {array}{lcl} P(L_0) & = & \dfrac {\sum \limits _{d=1}^D P(x^{(d)}_1 = 1, y | \theta ^*)}{\sum \limits _{d=1}^D P(y | \theta ^*)}\\ & = & \dfrac {1}{D} \sum \limits _{d=1}^D P(x^{(d)}_1 = 1 | y^{(d)}, \theta ^*),\\\\ P(G) & = & \dfrac {\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)}} y^{(d)}_t \cdot P(x^{(d)}_t = 0 | y^{(d)}, \theta ^*)}{\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)}} P(x^{(d)}_t = 0 | y^{(d)}, \theta ^*)}, \\\\ P(S) & = & \dfrac {\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)}} (1 - y^{(d)}_t) \cdot P(x^{(d)}_t = 1 | y^{(d)}, \theta ^*)}{\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)}} P(x^{(d)}_t = 1 | y^{(d)}, \theta ^*)},\\\\ P(R) & = & \dfrac {\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)} - 1} P(x^{(d)}_t = 0, x^{(d)}_{t+1} = 1 | y^{(d)}, \theta ^*)}{\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)} - 1} P(x^{(d)}_t = 0 | y^{(d)}, \theta ^*)}. \end {array} \end {equation}

Let us now describe and algorithm to find probabilities in (\eqref{eq:BKT:Baum-Welch_general}). The hidden proficiency process \(X\) is Markov, therefore we can define a transition matrix \(A=\{a_{ij}\}=\{P(X^{(d)}_t = j | X^{(d)}_{t - 1} = i)\}\) for \(i=0,1\) and \(j=0,1\): \[A = \left [ \begin {array}{cc} 1 - P(R) & P(R)\\ 0 & 1 \end {array} \right ],\] and a so-called emission matrix \(B=\{b_j(i)\}=\{P(Y^{(d)}_t = i | X^{(d)}_t = j)\}\) for \(i=0,1\) and \(j=0,1\): \[B = \left [ \begin {array}{cc} 1 - P(G) & P(G)\\ P(S) & 1 - P(S) \end {array} \right ].\] And a vector of initial states \(\pi = \{\pi _i\}=\{P(X^{(d)}_0=i)\}\) for \(i=0,1\): \[ \pi = \left [ \begin {array}{c} 1 - P(L_0)\\ P(L_0) \end {array}\right ].\]

Starting with some random initial guess for parameters \(P(S)\), \(P(G)\), \(P(R)\), \(P(L_0)\), denoted as vector \(\theta \), we compute a so-called Forward Procedure for \(d=1,\cdots ,D\) and \(m=0,1\): \begin {multline} \label {eq:BKT:alpha_general} \alpha ^{(d)}_i(t) = P\bigl (Y^{(d)}_1 = y^{(d)}_1, \\ Y^{(d)}_2 = y^{(d)}_2, \cdots , Y^{(d)}_t = y^{(d)}_t, X^{(d)}_t=i \big | \theta \bigr ), \end {multline} by recursive formulae – \begin {gather} \label {eq:BKT:alpha_recursion} \begin {array}{rcl} \alpha ^{(d)}_i(1) & = & \pi _i \cdot b_i(y^{(d)}_1),\\\\ \alpha ^{(d)}_i(t + 1) & = & b_i(y^{(d)}_{t+1}) \cdot \left (\alpha ^{(d)}_0(t) \cdot a_{0i} + \alpha ^{(d)}_1(t) \cdot a_{1i}\right ), \end {array} \raisetag {2.5\baselineskip } \end {gather} and a so-called Backward Procedure – \begin {multline} \label {eq:BKT:beta_general} \beta ^{(d)}_i(t) = P \bigl ( Y^{(d)}_{t+1} = y^{(d)}_{t+1}, \\ Y^{(d)}_{t+2} = y^{(d)}_{t+2}, \cdots , Y^{(d)}_T = y^{(d)}_T \big | X^{(d)}_t=i, \theta \bigr ), \end {multline} by recursive formulae – \begin {equation} \label {eq:BKT:beta_recursion} \begin {array}{rcl} \beta ^{(d)}_i(T) & = & 1,\\\\ \beta ^{(d)}_i(t) & = & \beta ^{(d)}_0(t+1) \cdot a_{i0} \cdot b_0(y^{(d)}_{t+1}) \\\\ & & + \beta ^{(d)}_1(t+1) \cdot a_{i1} \cdot b_1(y^{(d)}_{t+1}). \end {array} \end {equation} Then we can define \begin {multline} \label {eq:BKT:gamma} \gamma ^{(d)}_i(t) = P\left (\left . X^{(d)}_t = i \right | y^{(d)}, \theta \right ) = \dfrac {P\left (\left . X^{(d)}_t = i, y^{(d)} \right | \theta \right )}{P\left (\left . y^{(d)} \right | \theta \right )}\\ = \dfrac {\alpha ^{(d)}_i(t) \cdot \beta ^{(d)}_i(t)}{\sum \limits _{k=0}^1 \alpha ^{(d)}_k(t) \cdot \beta ^{(d)}_k(t)}, \end {multline} and \begin {multline} \label {eq:BKT:xi} \xi ^{(d)}_{ij}(t) = P\left (\left . X^{(d)}_t = i, X^{(d)}_{t+1} = j \right | y^{(d)}, \theta \right ) \\\\ = \dfrac {P\left (\left . X^{(d)}_t = i, X^{(d)}_{t+1} = j, y^{(d)}\right | \theta \right )}{P\left (\left . y^{(d)}\right | \theta \right )} \\\\ =\dfrac {\alpha ^{(d)}_i(t) \cdot a_{ij} \cdot b_j(y^{(d)}_{t+1}) \cdot \beta ^{(d)}_j(t+1)}{\sum \limits _{k=0}^1 \sum \limits _{w=0}^1 \alpha ^{(d)}_k(t) \cdot a_{kw} \cdot b_w(y^{(d)}_{t+1}) \cdot \beta ^{(d)}_w(t+1)}. \end {multline}

Therefore, using (\eqref{eq:BKT:alpha_general}) - (\eqref{eq:BKT:xi}), solution (\eqref{eq:BKT:Baum-Welch_general}) becomes \begin {equation} \label {eq:BKT:Baum-Welch} \begin {array}{lcl} P(L_0) & = & \dfrac {1}{D} \sum \limits _{d=1}^D \gamma ^{(d)}_1 (1),\\\\ P(G) & = & \dfrac {\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)}} y^{(d)}_t \cdot \gamma ^{(d)}_0(t)}{\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d)}} \gamma ^{(d)}_0(t)},\\\\ P(S) & = & \dfrac {\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d})} (1 - y^{(d)}) \cdot \gamma ^{(d)}_1(t)}{\sum \limits _{d=1}^D \sum \limits _{t=1}^{T^{(d})} \gamma ^{(d)}_1(t)},\\\\ P(R) & = & \dfrac {\sum \limits _{d=1}^D \sum \limits _{t=1}^{T{(d)} - 1} \xi ^{(d)}_{01}(t)}{\sum \limits _{d=1}^D \sum \limits _{t=1}^{T{(d)} - 1} \gamma ^{(d)}_0(t)}. \end {array} \end {equation}