Faster Confidence Intervals for Item Response Theory via an Approximate Likelihood Profile

Paaßen, Benjamin; Göpfert, Christina; Pinkwart, Niels

doi:10.5281/zenodo.6852950

Benjamin Paaßen

German Research Center for

Artificial Intelligence

Berlin, Germany

benjamin.paassen@dfki.de

Christina Göpfert

Faculty of Technology

Bielefeld University

Bielefeld, Germany

Niels Pinkwart

Insitute of Computer Science

Humboldt-University of Berlin

Berlin, Germany

ABSTRACT

Item response theory models the probability of correct student responses based on two interacting parameters: student ability and item difficulty. Whenever we estimate student ability, students have a legitimate interest in knowing how certain the estimate is. Confidence intervals are a natural measure of uncertainty. Unfortunately, computing confidence intervals can be computationally demanding. In this paper, we show that confidence intervals can be expressed as the solution to a feature relevance optimization problem. We use this insight to develop a novel solver for confidence intervals and thus achieve speedups by 4-50x while retaining near-indistinguishable results to the state-of-the-art approach.

Keywords

item response theory, confidence intervals, relevance intervals, approximation

1. INTRODUCTION

Item response theory (IRT) is a well-established method to model the responses of students on a test [1, 5, 8]. The basic version models the probability of student $i$ to answer correctly on item $j$ as $p_{i, j} = 1 ∕ (1 + \exp [b_{j} - 𝜃_{i}])$ , where $𝜃_{i}$ is a parameter representing the ability of student $i$ , and $b_{j}$ is a parameter representing the difficulty of item $j$ . If a student’s ability exceeds the item’s difficulty, the success probability is larger than $0.5$ , and vice versa. Numerous extensions have been developed over the years, such as parameters for item discrimination (two-parameter IRT), base success chance due to random guessing (three-parameter IRT) [1, 5], IRT for multiple skills [10], performance factor analysis, which describes the increased success chance for repeated attempts [12], or combinations with machine learning [11, 13].

Whenever we use IRT to estimate student ability, students have a legitimate interest in making sure that our model does not underestimate them (in line with the European Union’s concept of a right to explanation [6, 9]). Accordingly, it is important that we not only estimate each student’s ability, but also our uncertainty. We can quantify uncertainty in IRT models via confidence intervals [3, 4]. Roughly speaking, an $α$ -confidence interval describes a range of possible parameter values such that the ‘true’ value is outside the range with probability at most $α$ . For example, Wald’s method assumes that parameters are Gaussian-distributed and uses the Gaussian’s standard deviation to estimate confidence intervals [3]. While this is efficient, the Gaussian approximation assumes symmetry of the distribution and is only valid for large numbers of items in the test[4]. This is often unrealistic, which is why Wald’s method tends to provide inaccurate confidence intervals in practice [4]. Doebler et al. have reacted by providing more exact methods under the assumption of constant item parameters but the resulting confidence intervals are disconnected regions [4], which are challenging to interpret.

More recently, Chalmers et al. [3] suggested the likelihood profile method, which is based on a likelihood ratio test. In particular, we look for a parameter range in which the log-likelihood is at least the optimal log-likelihood minus half the $1 - α$ -quantile of the $χ^{2}$ -distribution with one degree of freedom. Because this method does not assume symmetry and works on the ’true’ likelihood, it improves the accuracy of confidence intervals considerably. However, the computation requires a nested optimization scheme for every single parameter, which is computationally demanding, especially for large numbers of students.

In this paper, we provide a new perspective by showing that likelihood profile confidence intervals are equivalent to feature relevance intervals, which can be found via an optimization problem [7]. Second, we utilize this theoretical insight to develop a novel solver for confidence intervals which is considerably faster (by factors of 4-50x), while the resulting confidence intervals are almost indistinguishable from direct computation. We evaluate our proposed solver on a range of synthetic experimental conditions from 30-500 students and tests with 10 or 20 items. The experimental code can be found at https://github.com/bpaassen/ability_bounds.

2. METHOD

Our goal is to develop a faster solver for confidence bounds for ability parameters of IRT models via the likelihood profile technique [3]. For an illustration of the technique, consider Fig. 1. The blue curve illustrates the best negative log-likelihood (NLL) $ℓ$ we can achieve when fixing the ability parameter $𝜃_{i}$ to the value on the $x$ axis. The likelihood profile technique now asks for the values $𝜃_{i}^{−}$ and $𝜃_{i}^{+}$ , such that the NLL is exactly $ℓ^{∗} + \frac{1}{2} χ_{α}^{2, - 1}$ , where $ℓ^{∗}$ is the overall minimum NLL and $χ_{α}^{2, - 1}$ is the $1 - α$ quantile of the $χ^{2}$ distribution with one degree of freedom. Our $α$ confidence interval is, then, given as $[𝜃_{i}^{−}, 𝜃_{i}^{+}]$ .

A line plot of minimum NLL versus theta. Lower and upper bounds fo — Figure 1: Illustration of the optimal NLL we can achieve for a certain value of $𝜃_{i}$ in a single-student, single-item setup with $C = 1$ and $y_{1, 1} = 1$ . The dashed line shows the loss bound for $α = . 05$ , the solid lines mark the corresponding confidence interval

More precisely, the NLL of an IRT model on a dataset with $m$ students and $n$ items is given as

\begin{align} ℓ (\vec{𝜃}, \vec{b}) = & \sum_{i = 1}^{m} ∑_{j = 1}^{n} - y_{i, j} \cdot \log [p_{i, j}] - (1 - y_{i, j}) \cdot \log [1 - p_{i, j}] \\ + \frac{1}{2 C} \cdot (∥ \vec{𝜃} ∥^{2} + ∥ \vec{b} ∥^{2}), & (1) \end{align}

where $y_{i, j}$ is $1$ if student $i$ answered correctly on item $j$ and is $0$ , otherwise, $p_{i, j} = 1 ∕ (1 + \exp [b_{j} - 𝜃_{i}])$ , and $C$ is the variance of a Gaussian prior on the ability parameters $\vec{𝜃}$ and item parameters $\vec{b}$ [1] (chapter 7).

Now, let ${\vec{𝜃}}_{\neg i}$ denote the vector $\vec{𝜃}$ without its $i$ th element and let $ℓ_{i} (𝜃_{i}) : = \min_{{\vec{𝜃}}_{\neg i}, \vec{b}} ℓ (\vec{𝜃}, \vec{b})$ . In other words, $ℓ_{i} (𝜃_{i})$ is the minimum NLL value we can achieve if we hold $𝜃_{i}$ fixed but optimize all other parameters. Then, the likelihood profile method solves the equation $ℓ_{i} (𝜃_{i}) = ℓ^{∗} + \frac{1}{2} χ_{α}^{2, - 1}$ , where $ℓ^{∗} = \min_{\vec{𝜃}, \vec{b}} ℓ (\vec{𝜃}, \vec{b})$ . Because $ℓ$ is convex in $𝜃_{i}$ , this equation has exactly two solutions, which correspond to our interval bounds. The drawback of the likelihood profile method is its computational demand. For each student $i$ , we need to solve a nested optimization problem, where the outer optimization searches for a solution to the equation $ℓ_{i} (𝜃_{i}) = ℓ^{∗} + \frac{1}{2} χ_{α}^{2, - 1}$ , and the inner optimization computes $ℓ_{i} (𝜃_{i})$ for eachvalue of $𝜃_{i}$ probed by the outer optimization. We look for a way to speed up this computation.

Our key inspiration is the concept of feature relevance intervals (FRI) proposed by Göpfert et al. [7]. The FRI for some parameter $𝜃$ is defined as the interval between the minimum and maximum value that still retains a loss of at most $(1 - δ)$ times the optimal loss, where $δ$ is a user-defined hyperparameter. More precisely, given some loss function $ℓ$ of some parameter vector $\vec{𝜃}$ and some specific parameter $𝜃_{i}$ , the FRI for $𝜃_{i}$ is computed by solving the following minimization/maximization problem:

\min_{\vec{𝜃}} ∕ \max_{\vec{𝜃}} 𝜃_{i} s.t. ℓ (\vec{𝜃}) \leq ℓ^{∗} \cdot (1 + δ) .

(2)

Note that this concept is more general than confidence intervals and is mostly intended for classifiers in machine learning [7]. However, for IRT, confidence intervals via the likelihood profile method and FRIs happen to be equivalent.

Theorem 1. Let $ℓ^{∗} = \min_{\vec{𝜃}, \vec{b}} ℓ (\vec{𝜃}, \vec{b})$ for the NLL (1) and some $C > 0$ . Then, for any $α \in (0, 1)$ and any $i \in {1, \dots, m}$ , it holds: Problem (3) is convex with a global optimum ${\vec{𝜃}}^{±}, {\vec{b}}^{±}$ , such that $ℓ ({\vec{𝜃}}^{±}, {\vec{b}}^{±}) = ℓ_{i} (𝜃_{i}^{±}) = ℓ^{∗} + \frac{1}{2} χ_{α}^{2, - 1}$ .

\min_{\vec{𝜃}, \vec{b}} ∕ \max_{\vec{𝜃}, \vec{b}} 𝜃_{i} s.t. ℓ (\vec{𝜃}, \vec{b}) \leq ℓ^{∗} + \frac{1}{2} χ_{α}^{2, - 1} .

(3)
Proof. As a notational shorthand, let

L_{α} : = ℓ^{∗} + \frac{1}{2} χ_{α}^{2, - 1}

. Note that the objective function of (3) is linear and hence convex. Now, consider the feasible set

𝒳 = {(\vec{𝜃}, \vec{b}) | ℓ (\vec{𝜃}, \vec{b}) \leq L_{α}}

. Let

x, y \in 𝒳

and let

z_{γ} = γ \cdot x + (1 - γ) \cdot y

for some

γ \in [0, 1]

, that is,

z_{γ}

is on the connecting line between

x

and

y

. Then, it must hold:

ℓ (z_{γ}) \leq γ \cdot ℓ (x) + (1 - γ) \cdot ℓ (y)

, because

ℓ

is convex with respect to

\vec{𝜃}

and

\vec{b}

. Further, because

x, y \in 𝒳

,

ℓ (x) \leq L_{α}

and

ℓ (y) \leq L_{α}

. Therefore,

ℓ (z_{γ}) \leq L_{α}

, implying that

z_{γ} \in 𝒳

. Hence,

𝒳

is convex and (3) is convex.

Next, note that, for any $α \in (0, 1)$ , $χ_{α}^{2, - 1}$ is strictly larger zero. Therefore, the feasible set $𝒳$ is not empty (it contains at least the minimizer of $ℓ$ ). In turn, problem (3) must have a optimum $({\vec{𝜃}}^{±}, {\vec{b}}^{±})$ with $ℓ ({\vec{𝜃}}^{±}, {\vec{b}}^{±}) = L_{α}$ because the objective function is strictly increasing.

Finally, we need to show that $ℓ_{i} (𝜃_{i}^{±}) = L_{α}$ . Assume this is not the case. If $ℓ_{i} (𝜃_{i}^{±}) > L_{α}$ , then $\min_{{\vec{𝜃}}_{\neg i}, \vec{b}} ℓ (\vec{𝜃}, \vec{b}) > ℓ ({\vec{𝜃}}^{±}, {\vec{b}}^{±})$ , which is a contradiction. If $ℓ_{i} (𝜃_{i}^{±}) < L_{α}$ , we need to inspect the behavior of $ℓ_{i}$ in more detail. Let $(𝜃^{∗}, b^{∗})$ be the solution to the unconstrained problem $\min_{\vec{𝜃}, \vec{b}} ℓ (\vec{𝜃}, \vec{b})$ . Then, we know that $ℓ_{i}$ attains its minimum at $𝜃_{i}^{∗}$ with $ℓ_{i} (𝜃_{i}^{∗}) = ℓ^{∗}$ . Further, because $ℓ_{i}$ is defined as the component-wise minimum of a convex function, it is convex itself [2] (p. 87). In turn, we know that for any $𝜃_{i} > 𝜃_{i}^{∗}$ , $ℓ_{i}$ is increasing and for any $𝜃_{i} < 𝜃_{i}^{∗}$ , $ℓ_{i}$ is decreasing. Because $C > 0$ , it is also guaranteed that $ℓ_{i} (𝜃_{i})$ exceeds $L_{α}$ for sufficiently large/small $𝜃_{i}$ , e.g., $ℓ_{i} (\pm \sqrt{2 \cdot C \cdot L_{α}}) > \frac{1}{2 C} {\sqrt{2 \cdot C \cdot L_{α}}}^{2} = L_{α}$ . Now, consider the minimization version of (3). In that case, we certainly have $𝜃_{i}^{−} \leq 𝜃_{i}^{∗}$ , otherwise $𝜃_{i}^{−}$ would not be minimal. Now, because $ℓ_{i} (𝜃_{i}^{−}) < L_{α}$ ,because $ℓ_{i}$ is decreasing for all values $𝜃_{i} < 𝜃_{i}^{∗}$ , and because $ℓ_{i}$ is continuous, there must exist some value $𝜃_{i} < 𝜃_{i}^{−}$ with $ℓ_{i} (𝜃_{i}^{−}) < ℓ_{i} (𝜃_{i}) \leq L_{α}$ . Therefore, $({\vec{𝜃}}^{−}, {\vec{b}}^{−})$ is not a solution for the minimization version of (3), which is a contradiction. The argument for the maximization version is analogous. __

Fig. 1 provides a graphical intuition for the proof: We can find the two solutions to $ℓ_{i} (𝜃_{i}) = ℓ^{∗} + \frac{1}{2} χ_{α}^{2, - 1}$ by starting at the minimum $𝜃_{i}^{∗}$ of $ℓ$ and then decreasing/increasing $𝜃_{i}$ as much as possible while maintaining a loss that is at most $ℓ^{∗} + \frac{1}{2} χ_{α}^{2, - 1}$ . To do so, we automatically need to adjust all other parameters to allow for as much slack as possible to increase $𝜃_{i}$ .

The key insight for our solver is that problem (3) becomes much more efficient if we only optimize over $𝜃_{i}$ and not over any remaining parameters. In particular, we can re-write the NLL (1) as:

\begin{align} {\tilde{ℓ}}_{i} (𝜃_{i}) = & \sum_{j = 1}^{n} - y_{i, j} \cdot \log [p_{i, j}] - (1 - y_{i, j}) \cdot \log [1 - p_{i, j}] \\ + \frac{1}{2 C} \cdot 𝜃_{i}^{2} + ℓ_{\neg i}, & (4) \end{align}

where $ℓ_{\neg i}$ is a constant term that does not depend on $𝜃_{i}$ . Computing ${\tilde{ℓ}}_{i}$ only requires $𝒪 (n)$ computations, whereas the full NLL (1) requires $𝒪 (m \cdot n)$ . Overall, we obtain the simplified problem:

\min_{𝜃_{i}} \pm 𝜃_{i} s.t. {\tilde{ℓ}}_{i} (𝜃_{i}) \leq ℓ^{∗} + \frac{1}{2} χ_{α}^{2, - 1} .

(5)

By extension of Theorem 1, solving (5) is equivalent to solving the equation ${\tilde{ℓ}}_{i} (𝜃_{i}) = ℓ^{∗} + \frac{1}{2} χ_{α}^{2, - 1}$ , which we can do efficiently via standard nonlinear equation solvers.

Importantly, our solution of (5) will be sub-optimal according to the original problem (3) because ${\tilde{ℓ}}_{i}$ only approximates $ℓ_{i}$ . Therefore, we update ${\tilde{ℓ}}_{i}$ by keeping $𝜃_{i}$ fixed and minimizing over all other parameters ${\vec{𝜃}}_{\neg i}$ and $\vec{b}$ . Then, we can solve (5) again, and so forth, until convergence. This is our proposed alternating optimization (AO) solver.

Figure 2 shows an illustration of the algorithm. To infer the upper bound $𝜃_{i}^{+}$ , we start at $𝜃_{0, i}^{+} = 𝜃_{i}^{∗}$ , i.e. the optimal value according to the NLL. We obtain the next estimate $𝜃_{1, i}^{+}$ by solving the $- 𝜃_{i}$ version of (5) with respect to the current surrogate NLL ${\tilde{ℓ}}_{0, i}^{+}$ . Then, we update all other parameters ${\vec{𝜃}}_{\neg i}$ and $\vec{b}$ by minimizing the NLL (1), yielding a new surrogate ${\tilde{ℓ}}_{1, i}^{+}$ . Solving (5) again yields the next estimate $𝜃_{2, i}^{+}$ . In this example, $𝜃_{2, i}^{+}$ would already be indistinguishable from the optimal $𝜃_{i}^{+}$ according to (3) because ${\tilde{ℓ}}_{1, i}^{+}$ and $ℓ_{i}$ strongly overlap. In general, we can prove that $𝜃_{t, i}^{+}$ converges to $𝜃_{i}^{+}$ .

A line plot of surrogate NLL and actual NLL versus theta for successive iteration — Figure 2: Illustration of the alternating optimization algorithm. We start with the optimum value for $𝜃_{0, i}^{+} = 𝜃_{1}^{∗}$ , then maximize $𝜃_{i}$ via (5), yielding $𝜃_{1, i}^{+}$ , then update ${\tilde{ℓ}}_{i}$ , then maximize $𝜃_{i}$ via (5) again, and so forth.

Theorem 2. Let $𝜃_{0, i}^{±} = 𝜃_{i}^{∗}$ , where $({\vec{𝜃}}^{∗}, {\vec{b}}^{∗})$ minimizes the NLL $ℓ (\vec{𝜃}, \vec{b})$ . Let ${\tilde{ℓ}}_{t, i}^{±}$ be the surrogate NLL (4) for parameters $\min_{{\vec{𝜃}}_{\neg i}, \vec{b}} ℓ (\vec{𝜃}, \vec{b})$ for fixed $𝜃_{i} = 𝜃_{t, i}^{±}$ , and let $𝜃_{t + 1, i}^{±}$ be the solutions of the $∓$ versions of (5) for ${\tilde{ℓ}}_{i} = {\tilde{ℓ}}_{t, i}^{±}$ . Then, for $C > 0$ and $t \to \infty$ , $𝜃_{t, i}^{±}$ converge to solutions of the minimization/maximization version of (3).

Proof. As a notational shorthand, let $L_{α} : = ℓ^{∗} + \frac{1}{2} χ_{α}^{2, - 1}$ . For simplicity, we only consider $𝜃_{t, i}^{+}$ here; the proof for $𝜃_{t, i}^{−}$ is analogous. First, observe that, for all $t$ , we have ${\tilde{ℓ}}_{t, i}^{+} (𝜃_{t, i}^{+}) = ℓ_{i} (𝜃_{t, i}^{+})$ by construction, and we have ${\tilde{ℓ}}_{t, i}^{+} (𝜃_{t + 1, i}^{+}) = L_{α}$ , otherwise $𝜃_{t + 1, i}^{+}$ would not be a solution to (5). Further, by definition, we have ${\tilde{ℓ}}_{t, i}^{+} (𝜃_{i}) \geq ℓ_{i} (𝜃_{i})$ for all $t$ and all $𝜃_{i} \in ℝ$ . Therefore, we obtain ${\tilde{ℓ}}_{t, i}^{+} (𝜃_{t, i}^{+}) = ℓ_{i} (𝜃_{t, i}^{+}) \leq {\tilde{ℓ}}_{t - 1, i}^{+} (𝜃_{t, i}^{+}) = L_{α}$ . Accordingly, whenever we solve (5) in step $t + 1$ , $𝜃_{t, i}^{+}$ is a feasible initial point. Therefore, $𝜃_{t + 1, i}^{+} \geq 𝜃_{t, i}^{+}$ .

Whenever $𝜃_{t + 1, i}^{+} = 𝜃_{t, i}^{+}$ , we have ${\tilde{ℓ}}_{t + 1, i}^{+} = {\tilde{ℓ}}_{t, i}^{+}$ and, thus, $𝜃_{t, i}^{+} = 𝜃_{t + 1, i}^{+} = 𝜃_{t + 2, i}^{+} = \dots$ , that is, we have a fixed point. Further, we have $ℓ_{i} (𝜃_{t + 1, i}^{+}) = {\tilde{ℓ}}_{t + 1, i} (𝜃_{t + 1, i}^{+}) = {\tilde{ℓ}}_{t + 1, i} (𝜃_{t, i}^{+}) = L_{α}$ , which is a solution to the maximization version of (3). Before the fixed point, the sequence $𝜃_{0, i}^{+}, 𝜃_{1, i}^{+}, \dots$ is strictly increasing. Since $ℓ_{i}$ is convex with a minimum at $𝜃_{0, i}^{+}$ , the sequence $ℓ_{i} (𝜃_{0, i}^{+}), ℓ_{i} (𝜃_{1, i}^{+}), \dots$ is also increasing. For $C > 0$ , it would eventually grow beyond bounds due to the regularization term, but since it is upper-bounded by $L_{α}$ , it must converge to $L_{α}$ and, hence, $𝜃_{t, i}^{+}$ must converge to $𝜃_{i}^{+}$ . __

While Theorem 2 only holds in the limit, very few steps $t$ suffice for a close-to-optimal solution in practice (e.g., Figure 2). We investigate this issue in more detail in our experiments.

3. EXPERIMENTS

In our experiments, we simulate synthetic data from a ground-truth IRT model with ability and difficulties sampled from a standard normal distribution. We varied the number of students $m$ in the range ${30, 50, 100, 500}$ and the number of items $n$ in the range ${10, 20}$ (similar to the protocol of [3]). For each model, we repeated the sampling $10$ times to asses variation. In each repeat, we computed confidence intervals for $𝜃_{i}$ at $α = 0.05$ via Wald’s method, the likelihood profile-method via solving $ℓ_{i} (𝜃_{i}) = ℓ^{∗} + \frac{1}{2} χ_{α}^{2, - 1}$ for each $i$ (LP), a log-barrier solver (barrier) for (3), and our proposed alternating optimization scheme (Theorem 2) for $t = 1$ (AO(1)), $t = 2$ (AO(2)), and $t = 3$ (AO(3)) steps. All experimental code is available at https://github.com/bpaassen/ability_bounds.

Table 1: Coverage rates of all methods across experimental conditions
$m$	$n$	Wald	LP	barrier	AO(1)	AO(2)	AO(3)

$30$	$10$	$0.983 \pm 0.022$	$0.943 \pm 0.054$	$0.910 \pm 0.056$	$0.943 \pm 0.054$	$0.943 \pm 0.054$	$0.943 \pm 0.054$
$30$	$20$	$1.000 \pm 0.000$	$0.943 \pm 0.021$	$0.083 \pm 0.050$	$0.940 \pm 0.025$	$0.943 \pm 0.021$	$0.943 \pm 0.021$
$50$	$10$	$0.998 \pm 0.006$	$0.926 \pm 0.041$	$0.884 \pm 0.047$	$0.920 \pm 0.041$	$0.926 \pm 0.041$	$0.926 \pm 0.041$
$50$	$20$	$0.998 \pm 0.006$	$0.942 \pm 0.039$	$0.098 \pm 0.039$	$0.938 \pm 0.040$	$0.942 \pm 0.039$	$0.942 \pm 0.039$
$100$	$10$	$0.998 \pm 0.004$	$0.934 \pm 0.028$	$0.892 \pm 0.029$	$0.933 \pm 0.027$	$0.934 \pm 0.028$	$0.934 \pm 0.028$
$100$	$20$	$0.999 \pm 0.003$	$0.937 \pm 0.030$	$0.059 \pm 0.012$	$0.933 \pm 0.031$	$0.937 \pm 0.030$	$0.937 \pm 0.030$
$500$	$10$	$0.998 \pm 0.002$	$0.940 \pm 0.011$	$0.898 \pm 0.015$	$0.939 \pm 0.010$	$0.940 \pm 0.011$	$0.940 \pm 0.011$
$500$	$20$	$1.000 \pm 0.001$	$0.949 \pm 0.008$	$0.097 \pm 0.016$	$0.948 \pm 0.008$	$0.949 \pm 0.008$	$0.949 \pm 0.008$

Table 1 shows the coverage rates of all methods, that is, the rate at which the ground truth $𝜃_{i}$ value was included in the confidence interval $[𝜃_{i}^{−}, 𝜃_{i}^{+}]$ [3]. For $α = 0.05$ , the coverage rate should be as close as possible to $0.95$ . We observe that Wald’s method selects too large confidence intervals, yielding rates close to $100 %$ . The barrier method selects smaller confidence intervals, yielding rates around $90 %$ for $n = 10$ items. If we choose $n = 20$ items, the barrier method becomes numerically unstable and the coverage rates degrade (smaller $10 %$ ). The likelihood profile (LP) method consistently achieves rates between $92 %$ and $95 %$ and thus is closest to the desired value of $95 %$ . Alternating optimization achieves indistinguishable results to LP at 3 significant digits for $t \geq 2$ (AO(2) and AO(3)).

Figure 3 displays the time it took to compute confidence intervals for all students in every experimental settings in log-log plots. Dotted, gray lines are linear reference functions at .1, 1, 10, and 100ms per student. Using linear fits, we find that Wald’s method is fastest (at about $. 3$ ms per student), followed by AO(1) (about $. 8$ ms per student), barrier (about 4ms per student), AO(2) (about 9ms per student), AO(3) (about 12ms per student), and finally LP (about 37ms per student). Accordingly, AO(1) is roughly 50x faster than LP, and AO(2) is roughly 4x faster.

Plots of runtime versus number of students for 10 items and 20 items — Figure 3: Log-Log plots for the runtime in seconds versus $m$ for $n = 10$ (top) and $n = 20$ (bottom). Dotted, gray lines show linear functions at .1, 1, 10, and 100ms per student.

4. DISCUSSION AND CONCLUSION

In this paper, we introduced a new solver for confidence intervals on item response theory parameters via the likelihood profile method. In particular, we found that our alternating optimization solver was 4-50 times faster than the existing solver while achieving almost indistinguishable results. For anyone who seeks to compute confidence intervals, this should provide massive speedups in practice. More generally, though, we hope that our new solver can help the community to respond to an emerging need in educational data mining: more and more, policy makers and the general public expect machine learning models to provide explanations for their decisions. This is exemplified by recent policy initiatives in the European Union for a right to explanation and a risk-based approach to regulate artificial intelligence. Grading student ability—like in item response theory—will likely be under increasing scrutiny in the years to come. As such, we believe that it is crucial to quantify a model’s uncertainty precisely, which our solver can help to do.

We also provide a key theoretical insight in our paper: Traditionally, confidence intervals express the range of parameter values which is likely to contain the ’true’ value. We showed that the likelihood profile method for confidence intervals can also be interpreted as an optimization problem: We try to find the minimum/maximum ability value for a student which is still consistent with a high likelihood of the data. This interpretation provides a new way to explain an ability estimate: The upper bound of our confidence interval is the highest possible grade we can give to a student while still being consistent with the responses they provided.

Beyond this paper, there remains ample room for future work. Our evaluation has only covered synthetic data to validate the runtime advantage and the closeness to existing methods. Future work could investigate how large confidence intervals tend to be in practical scenarios. Further, our experiments indicated that the size of our confidence is still larger than it would need to be to cover the ’true’ ability value. Future work could try to find new methods to compute confidence intervals which are more precise. In particular, it may be promising to investigate combinations with recently proposed models for variational item response theory [14].

Acknowledgements

Funding by the German Ministry for Education and Research (BMBF) under grant number 21INVI1403 (project KIPerWeb) is greatfully acknowledged.

5. REFERENCES

F. Baker and S.-H. Kim. Item Response Theory: Parameter Estimation Techniques. CRC Press, Boca Raton, FL, USA, 2 edition, 2004.
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2009.
R. P. Chalmers, J. Pek, and Y. Liu. Profile-likelihood confidence intervals in item response theory models. Multivariate Behavioral Research, 52(5):533–550, 2017.
A. Doebler, P. Doebler, and H. Holling. Optimal and most exact confidence intervals for person parameters in item response theory models. Psychometrika, 78(1):98–115, 2013.
S. Embretson and S. Reise. Item response theory for psychologists. Psychology Press, New York, NY, USA, 2000.
European Commission. Laying down harmonised rules on artificial inteligence and amending certain union legislative acts, 2021.
C. Göpfert, L. Pfannschmidt, J. P. Göpfert, and B. Hammer. Interpretation of linear classifiers by means of feature relevance bounds. Neurocomputing, 298:69–79, 2018.
R. Hambleton and H. Swaminathan. Item response theory: Principles and applications. Springer Science+Business Media, New York, NY, USA, 1985.
M. E. Kaminski. The right to explanation, explained. Berkeley Technology Law Journal, 34:190–218, 2019.
R. P. McDonald. A basis for multidimensional item response theory. Applied Psychological Measurement, 24(2):99–114, 2000.
K. Niemeijer, R. Feskens, G. Krempl, J. Koops, and M. J. S. Brinkhuis. Constructing and predicting school advice for academic achievement: A comparison of item response theory and machine learning techniques. In Proceedings of the Tenth International Conference on Learning Analytics & Knowledge (LAK), page 462–471, New York, NY, USA, 2020. Association for Computing Machinery.
P. I. Pavlik, H. Cen, and K. R. Koedinger. Performance factors analysis –a new alternative to knowledge tracing. In Proceedings of the International Conference on Artificial Intelligence in Education (AIED), page 531–538, 2009.
K. Pliakos, S.-H. Joo, J. Y. Park, F. Cornillie, C. Vens, and W. Van den Noortgate. Integrating machine learning into item response theory for addressing the cold start problem in adaptive learning systems. Computers & Education, 137:91–103, 2019.
M. Wu, R. L. Davis, B. W. Domingue, C. Piech, and N. Goodman. Variational item response theory: Fast, accurate, and expressive. In A. Rafferty, J. Whitehill, V. Cavalli-Sforza, and C. Romero, editors, Proceedings of the 13th International Conference on Educational Data Mining (EDM), pages 257–268, 2020.

[1] F. Baker and S.-H. Kim. Item Response Theory: Parameter Estimation Techniques. CRC Press, Boca Raton, FL, USA, 2 edition, 2004.

[2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2009.

[3] R. P. Chalmers, J. Pek, and Y. Liu. Profile-likelihood confidence intervals in item response theory models. Multivariate Behavioral Research, 52(5):533–550, 2017.

[4] A. Doebler, P. Doebler, and H. Holling. Optimal and most exact confidence intervals for person parameters in item response theory models. Psychometrika, 78(1):98–115, 2013.

[5] S. Embretson and S. Reise. Item response theory for psychologists. Psychology Press, New York, NY, USA, 2000.

[6] European Commission. Laying down harmonised rules on artificial inteligence and amending certain union legislative acts, 2021.

[7] C. Göpfert, L. Pfannschmidt, J. P. Göpfert, and B. Hammer. Interpretation of linear classifiers by means of feature relevance bounds. Neurocomputing, 298:69–79, 2018.

[8] R. Hambleton and H. Swaminathan. Item response theory: Principles and applications. Springer Science+Business Media, New York, NY, USA, 1985.

[9] M. E. Kaminski. The right to explanation, explained. Berkeley Technology Law Journal, 34:190–218, 2019.

[10] R. P. McDonald. A basis for multidimensional item response theory. Applied Psychological Measurement, 24(2):99–114, 2000.

[11] K. Niemeijer, R. Feskens, G. Krempl, J. Koops, and M. J. S. Brinkhuis. Constructing and predicting school advice for academic achievement: A comparison of item response theory and machine learning techniques. In Proceedings of the Tenth International Conference on Learning Analytics & Knowledge (LAK), page 462–471, New York, NY, USA, 2020. Association for Computing Machinery.

[12] P. I. Pavlik, H. Cen, and K. R. Koedinger. Performance factors analysis –a new alternative to knowledge tracing. In Proceedings of the International Conference on Artificial Intelligence in Education (AIED), page 531–538, 2009.

[13] K. Pliakos, S.-H. Joo, J. Y. Park, F. Cornillie, C. Vens, and W. Van den Noortgate. Integrating machine learning into item response theory for addressing the cold start problem in adaptive learning systems. Computers & Education, 137:91–103, 2019.

[14] M. Wu, R. L. Davis, B. W. Domingue, C. Piech, and N. Goodman. Variational item response theory: Fast, accurate, and expressive. In A. Rafferty, J. Whitehill, V. Cavalli-Sforza, and C. Romero, editors, Proceedings of the 13th International Conference on Educational Data Mining (EDM), pages 257–268, 2020.