# ABSTRACT

Item response theory models the probability of correct student responses based on two interacting parameters: student ability and item difficulty. Whenever we estimate student ability, students have a legitimate interest in knowing how certain the estimate is. Confidence intervals are a natural measure of uncertainty. Unfortunately, computing confidence intervals can be computationally demanding. In this paper, we show that confidence intervals can be expressed as the solution to a feature relevance optimization problem. We use this insight to develop a novel solver for confidence intervals and thus achieve speedups by 4-50x while retaining near-indistinguishable results to the state-of-the-art approach.

# Keywords

# 1. INTRODUCTION

Item response theory (IRT) is a well-established method to model the responses of students on a test [1, 5, 8]. The basic version models the probability of student $i$ to answer correctly on item $j$ as ${p}_{i,j}=1\u2215(1+\mathrm{exp}[{b}_{j}-{\mathit{\theta}}_{i}])$, where ${\mathit{\theta}}_{i}$ is a parameter representing the ability of student $i$, and ${b}_{j}$ is a parameter representing the difficulty of item $j$. If a student’s ability exceeds the item’s difficulty, the success probability is larger than $0.5$, and vice versa. Numerous extensions have been developed over the years, such as parameters for item discrimination (two-parameter IRT), base success chance due to random guessing (three-parameter IRT) [1, 5], IRT for multiple skills [10], performance factor analysis, which describes the increased success chance for repeated attempts [12], or combinations with machine learning [11, 13].

Whenever we use IRT to estimate student ability, students have a legitimate interest in making sure that our model does not underestimate them (in line with the European Union’s concept of a right to explanation [6, 9]). Accordingly, it is important that we not only estimate each student’s ability, but also our uncertainty. We can quantify uncertainty in IRT models via confidence intervals [3, 4]. Roughly speaking, an $\alpha $-confidence interval describes a range of possible parameter values such that the ‘true’ value is outside the range with probability at most $\alpha $. For example, Wald’s method assumes that parameters are Gaussian-distributed and uses the Gaussian’s standard deviation to estimate confidence intervals [3]. While this is efficient, the Gaussian approximation assumes symmetry of the distribution and is only valid for large numbers of items in the test[4]. This is often unrealistic, which is why Wald’s method tends to provide inaccurate confidence intervals in practice [4]. Doebler et al. have reacted by providing more exact methods under the assumption of constant item parameters but the resulting confidence intervals are disconnected regions [4], which are challenging to interpret.

More recently, Chalmers et al. [3] suggested the likelihood profile method, which is based on a likelihood ratio test. In particular, we look for a parameter range in which the log-likelihood is at least the optimal log-likelihood minus half the $1-\alpha $-quantile of the ${\chi}^{2}$-distribution with one degree of freedom. Because this method does not assume symmetry and works on the ’true’ likelihood, it improves the accuracy of confidence intervals considerably. However, the computation requires a nested optimization scheme for every single parameter, which is computationally demanding, especially for large numbers of students.

In this paper, we provide a new perspective by showing that
likelihood profile confidence intervals are equivalent to feature
relevance intervals, which can be found via an optimization
problem [7]. Second, we utilize this theoretical insight to develop
a novel solver for confidence intervals which is considerably
faster (by factors of 4-50x), while the resulting confidence
intervals are almost indistinguishable from direct computation.
We evaluate our proposed solver on a range of synthetic
experimental conditions from 30-500 students and tests with
10 or 20 items. The experimental code can be found at
`https://github.com/bpaassen/ability_bounds`

.

# 2. METHOD

Our goal is to develop a faster solver for confidence bounds for ability parameters of IRT models via the likelihood profile technique [3]. For an illustration of the technique, consider Fig. 1. The blue curve illustrates the best negative log-likelihood (NLL) $\ell $ we can achieve when fixing the ability parameter ${\mathit{\theta}}_{i}$ to the value on the $x$ axis. The likelihood profile technique now asks for the values ${\mathit{\theta}}_{i}^{\text{\u2212}}$ and ${\mathit{\theta}}_{i}^{\text{+}}$, such that the NLL is exactly ${\ell}^{\text{\u2217}}+\frac{1}{2}{\chi}_{\alpha}^{2,-1}$, where ${\ell}^{\text{\u2217}}$ is the overall minimum NLL and ${\chi}_{\alpha}^{2,-1}$ is the $1-\alpha $ quantile of the ${\chi}^{2}$ distribution with one degree of freedom. Our $\alpha $ confidence interval is, then, given as $[{\mathit{\theta}}_{i}^{\text{\u2212}},{\mathit{\theta}}_{i}^{\text{+}}]$.

More precisely, the NLL of an IRT model on a dataset with $m$ students and $n$ items is given as

$$\begin{array}{llll}\hfill \ell (\overrightarrow{\mathit{\theta}},\overrightarrow{b})=& \sum _{i=1}^{m}\text{\u2211}_{j=1}^{n}-{y}_{i,j}\cdot \mathrm{log}[{p}_{i,j}]-(1-{y}_{i,j})\cdot \mathrm{log}[1-{p}_{i,j}]\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & +\frac{1}{2C}\cdot (\parallel \overrightarrow{\mathit{\theta}}{\text{\u2225}}^{2}+\parallel \overrightarrow{b}{\text{\u2225}}^{2}),\phantom{\rule{2em}{0ex}}& \hfill \text{(1)}\phantom{\rule{0.33em}{0ex}}\end{array}$$where ${y}_{i,j}$ is $1$ if student $i$ answered correctly on item $j$ and is $0$, otherwise, ${p}_{i,j}=1\u2215(1+\mathrm{exp}[{b}_{j}-{\mathit{\theta}}_{i}])$, and $C$ is the variance of a Gaussian prior on the ability parameters $\overrightarrow{\mathit{\theta}}$ and item parameters $\overrightarrow{b}$ [1] (chapter 7).

Now, let ${\overrightarrow{\mathit{\theta}}}_{\neg i}$ denote the vector $\overrightarrow{\mathit{\theta}}$ without its $i$th element and let ${\ell}_{i}({\mathit{\theta}}_{i}):=\underset{{\overrightarrow{\mathit{\theta}}}_{\neg i},\overrightarrow{b}}{\mathrm{min}}\ell (\overrightarrow{\mathit{\theta}},\overrightarrow{b})$. In other words, ${\ell}_{i}({\mathit{\theta}}_{i})$ is the minimum NLL value we can achieve if we hold ${\mathit{\theta}}_{i}$ fixed but optimize all other parameters. Then, the likelihood profile method solves the equation ${\ell}_{i}({\mathit{\theta}}_{i})={\ell}^{\text{\u2217}}+\frac{1}{2}{\chi}_{\alpha}^{2,-1}$, where ${\ell}^{\text{\u2217}}=\underset{\overrightarrow{\mathit{\theta}},\overrightarrow{b}}{\mathrm{min}}\ell (\overrightarrow{\mathit{\theta}},\overrightarrow{b})$. Because $\ell $ is convex in ${\mathit{\theta}}_{i}$, this equation has exactly two solutions, which correspond to our interval bounds. The drawback of the likelihood profile method is its computational demand. For each student $i$, we need to solve a nested optimization problem, where the outer optimization searches for a solution to the equation ${\ell}_{i}({\mathit{\theta}}_{i})={\ell}^{\text{\u2217}}+\frac{1}{2}{\chi}_{\alpha}^{2,-1}$, and the inner optimization computes ${\ell}_{i}({\mathit{\theta}}_{i})$ for eachvalue of ${\mathit{\theta}}_{i}$ probed by the outer optimization. We look for a way to speed up this computation.

Our key inspiration is the concept of feature relevance intervals (FRI) proposed by Göpfert et al. [7]. The FRI for some parameter $\mathit{\theta}$ is defined as the interval between the minimum and maximum value that still retains a loss of at most $(1-\delta )$ times the optimal loss, where $\delta $ is a user-defined hyperparameter. More precisely, given some loss function $\ell $ of some parameter vector $\overrightarrow{\mathit{\theta}}$ and some specific parameter ${\mathit{\theta}}_{i}$, the FRI for ${\mathit{\theta}}_{i}$ is computed by solving the following minimization/maximization problem:

$$\underset{\overrightarrow{\mathit{\theta}}}{\mathrm{min}}\u2215\underset{\overrightarrow{\mathit{\theta}}}{\mathrm{max}}\phantom{\rule{1em}{0ex}}{\mathit{\theta}}_{i}\phantom{\rule{1em}{0ex}}\text{s.t.}\phantom{\rule{1em}{0ex}}\ell (\overrightarrow{\mathit{\theta}})\le {\ell}^{\text{\u2217}}\cdot (1+\delta ).$$(2)Note that this concept is more general than confidence intervals and is mostly intended for classifiers in machine learning [7]. However, for IRT, confidence intervals via the likelihood profile method and FRIs happen to be equivalent.

Theorem 1. *Let *${\ell}^{\text{\u2217}}=\underset{\overrightarrow{\mathit{\theta}},\overrightarrow{b}}{\mathrm{min}}\ell (\overrightarrow{\mathit{\theta}},\overrightarrow{b})$ *for
the NLL *(1) *and some *$C>0$*.
Then, for any *$\alpha \in (0,1)$
*and any *$i\in \{1,\dots ,m\}$*,
it holds: Problem *(3) *is convex with a global optimum*
${\overrightarrow{\mathit{\theta}}}^{\text{\xb1}},{\overrightarrow{b}}^{\text{\xb1}}$*, such
that *$\ell ({\overrightarrow{\mathit{\theta}}}^{\text{\xb1}},{\overrightarrow{b}}^{\text{\xb1}})={\ell}_{i}({\mathit{\theta}}_{i}^{\text{\xb1}})={\ell}^{\text{\u2217}}+\frac{1}{2}{\chi}_{\alpha}^{2,-1}$*.*

Proof.

*As a notational shorthand, let*${L}_{\alpha}:={\ell}^{\text{\u2217}}+\frac{1}{2}{\chi}_{\alpha}^{2,-1}$

*. Note that the objective function of*(3)

*is linear and hence convex. Now, consider the feasible set*$\mathcal{\mathcal{X}}=\{(\overrightarrow{\mathit{\theta}},\overrightarrow{b})|\ell (\overrightarrow{\mathit{\theta}},\overrightarrow{b})\le {L}_{\alpha}\}$

*. Let*$x,y\in \mathcal{\mathcal{X}}$

*and let*${z}_{\gamma}=\gamma \cdot x+(1-\gamma )\cdot y$

*for some*$\gamma \in [0,1]$

*, that is,*${z}_{\gamma}$

*is on the connecting line between*$x$

*and*$y$

*. Then, it must hold:*$\ell ({z}_{\gamma})\le \gamma \cdot \ell (x)+(1-\gamma )\cdot \ell (y)$

*, because*$\ell $

*is convex with respect to*$\overrightarrow{\mathit{\theta}}$

*and*$\overrightarrow{b}$

*. Further, because*$x,y\in \mathcal{\mathcal{X}}$

*,*$\ell (x)\le {L}_{\alpha}$

*and*$\ell (y)\le {L}_{\alpha}$

*. Therefore,*$\ell ({z}_{\gamma})\le {L}_{\alpha}$

*, implying that*${z}_{\gamma}\in \mathcal{\mathcal{X}}$

*. Hence,*$\mathcal{\mathcal{X}}$

*is convex and*(3)

*is convex.*

*Next, note that, for any *$\alpha \in (0,1)$*,*
${\chi}_{\alpha}^{2,-1}$
*is strictly larger zero. Therefore, the feasible set *$\mathcal{\mathcal{X}}$
*is not empty (it contains at least the minimizer of *$\ell $*).
In turn, problem *(3) *must have a optimum *$({\overrightarrow{\mathit{\theta}}}^{\text{\xb1}},{\overrightarrow{b}}^{\text{\xb1}})$
*with *$\ell ({\overrightarrow{\mathit{\theta}}}^{\text{\xb1}},{\overrightarrow{b}}^{\text{\xb1}})={L}_{\alpha}$
*because the objective function is strictly increasing.*

*Finally, we need to show that *${\ell}_{i}({\mathit{\theta}}_{i}^{\text{\xb1}})={L}_{\alpha}$*.
Assume this is *not *the case. If *${\ell}_{i}({\mathit{\theta}}_{i}^{\text{\xb1}})>{L}_{\alpha}$*,
then *$\underset{{\overrightarrow{\mathit{\theta}}}_{\neg i},\overrightarrow{b}}{\mathrm{min}}\ell (\overrightarrow{\mathit{\theta}},\overrightarrow{b})>\ell ({\overrightarrow{\mathit{\theta}}}^{\text{\xb1}},{\overrightarrow{b}}^{\text{\xb1}})$*,
which is a contradiction. If *${\ell}_{i}({\mathit{\theta}}_{i}^{\text{\xb1}})<{L}_{\alpha}$*,
we need to inspect the behavior of *${\ell}_{i}$
*in more detail. Let *$({\mathit{\theta}}^{\text{\u2217}},{b}^{\text{\u2217}})$
*be the solution to the unconstrained problem *$\underset{\overrightarrow{\mathit{\theta}},\overrightarrow{b}}{\mathrm{min}}\ell (\overrightarrow{\mathit{\theta}},\overrightarrow{b})$*.
Then, we know that *${\ell}_{i}$
*attains its minimum at *${\mathit{\theta}}_{i}^{\text{\u2217}}$
*with *${\ell}_{i}({\mathit{\theta}}_{i}^{\text{\u2217}})={\ell}^{\text{\u2217}}$*.
Further, because *${\ell}_{i}$
*is defined as the component-wise minimum of a convex function,
it is convex itself **[**2**]* *(p. 87). In turn, we know that for any*
${\mathit{\theta}}_{i}>{\mathit{\theta}}_{i}^{\text{\u2217}}$*,*
${\ell}_{i}$
*is increasing and for any *${\mathit{\theta}}_{i}<{\mathit{\theta}}_{i}^{\text{\u2217}}$*,*
${\ell}_{i}$
*is decreasing. Because *$C>0$*,
it is also guaranteed that *${\ell}_{i}({\mathit{\theta}}_{i})$
*exceeds *${L}_{\alpha}$
*for sufficiently large/small *${\mathit{\theta}}_{i}$*,
e.g., *${\ell}_{i}(\pm \sqrt{2\cdot C\cdot {L}_{\alpha}})>\frac{1}{2C}{\sqrt{2\cdot C\cdot {L}_{\alpha}}}^{2}={L}_{\alpha}$*.
Now, consider the minimization version of *(3)*. In that case,
we certainly have *${\mathit{\theta}}_{i}^{\text{\u2212}}\le {\mathit{\theta}}_{i}^{\text{\u2217}}$*,
otherwise *${\mathit{\theta}}_{i}^{\text{\u2212}}$
*would not be minimal. Now, because *${\ell}_{i}({\mathit{\theta}}_{i}^{\text{\u2212}})<{L}_{\alpha}$*,because *${\ell}_{i}$
*is decreasing for all values *${\mathit{\theta}}_{i}<{\mathit{\theta}}_{i}^{\text{\u2217}}$*,
and because *${\ell}_{i}$
*is continuous, there must exist some value *${\mathit{\theta}}_{i}<{\mathit{\theta}}_{i}^{\text{\u2212}}$
*with *${\ell}_{i}({\mathit{\theta}}_{i}^{\text{\u2212}})<{\ell}_{i}({\mathit{\theta}}_{i})\le {L}_{\alpha}$*.
Therefore, *$({\overrightarrow{\mathit{\theta}}}^{\text{\u2212}},{\overrightarrow{b}}^{\text{\u2212}})$
*is not a solution for the minimization version of *(3)*, which
is a contradiction. The argument for the maximization version
is analogous.* __

Fig. 1 provides a graphical intuition for the proof: We can find the two solutions to ${\ell}_{i}({\mathit{\theta}}_{i})={\ell}^{\text{\u2217}}+\frac{1}{2}{\chi}_{\alpha}^{2,-1}$ by starting at the minimum ${\mathit{\theta}}_{i}^{\text{\u2217}}$ of $\ell $ and then decreasing/increasing ${\mathit{\theta}}_{i}$ as much as possible while maintaining a loss that is at most ${\ell}^{\text{\u2217}}+\frac{1}{2}{\chi}_{\alpha}^{2,-1}$. To do so, we automatically need to adjust all other parameters to allow for as much slack as possible to increase ${\mathit{\theta}}_{i}$.

The key insight for our solver is that problem (3)
becomes much more efficient if we *only *optimize over
${\mathit{\theta}}_{i}$ and
not over any remaining parameters. In particular, we can
re-write the NLL (1) as:

where ${\ell}_{\neg i}$ is a constant term that does not depend on ${\mathit{\theta}}_{i}$. Computing ${\stackrel{~}{\ell}}_{i}$ only requires $\mathcal{\mathcal{O}}(n)$ computations, whereas the full NLL (1) requires $\mathcal{\mathcal{O}}(m\cdot n)$. Overall, we obtain the simplified problem:

$$\underset{{\mathit{\theta}}_{i}}{\mathrm{min}}\phantom{\rule{1em}{0ex}}\pm {\mathit{\theta}}_{i}\phantom{\rule{1em}{0ex}}\text{s.t.}\phantom{\rule{1em}{0ex}}{\stackrel{~}{\ell}}_{i}({\mathit{\theta}}_{i})\le {\ell}^{\text{\u2217}}+\frac{1}{2}{\chi}_{\alpha}^{2,-1}.$$(5)By extension of Theorem 1, solving (5) is equivalent to solving the equation ${\stackrel{~}{\ell}}_{i}({\mathit{\theta}}_{i})={\ell}^{\text{\u2217}}+\frac{1}{2}{\chi}_{\alpha}^{2,-1}$, which we can do efficiently via standard nonlinear equation solvers.

Importantly, our solution of (5) will be sub-optimal according to the original problem (3) because ${\stackrel{~}{\ell}}_{i}$ only approximates ${\ell}_{i}$. Therefore, we update ${\stackrel{~}{\ell}}_{i}$ by keeping ${\mathit{\theta}}_{i}$ fixed and minimizing over all other parameters ${\overrightarrow{\mathit{\theta}}}_{\neg i}$ and $\overrightarrow{b}$. Then, we can solve (5) again, and so forth, until convergence. This is our proposed alternating optimization (AO) solver.

Figure 2 shows an illustration of the algorithm. To infer the upper bound ${\mathit{\theta}}_{i}^{\text{+}}$, we start at ${\mathit{\theta}}_{0,i}^{\text{+}}={\mathit{\theta}}_{i}^{\text{\u2217}}$, i.e. the optimal value according to the NLL. We obtain the next estimate ${\mathit{\theta}}_{1,i}^{\text{+}}$ by solving the $-{\mathit{\theta}}_{i}$ version of (5) with respect to the current surrogate NLL ${\stackrel{~}{\ell}}_{0,i}^{\text{+}}$. Then, we update all other parameters ${\overrightarrow{\mathit{\theta}}}_{\neg i}$ and $\overrightarrow{b}$ by minimizing the NLL (1), yielding a new surrogate ${\stackrel{~}{\ell}}_{1,i}^{\text{+}}$. Solving (5) again yields the next estimate ${\mathit{\theta}}_{2,i}^{\text{+}}$. In this example, ${\mathit{\theta}}_{2,i}^{\text{+}}$ would already be indistinguishable from the optimal ${\mathit{\theta}}_{i}^{\text{+}}$ according to (3) because ${\stackrel{~}{\ell}}_{1,i}^{\text{+}}$ and ${\ell}_{i}$ strongly overlap. In general, we can prove that ${\mathit{\theta}}_{t,i}^{\text{+}}$ converges to ${\mathit{\theta}}_{i}^{\text{+}}$.

Theorem 2. *Let *${\mathit{\theta}}_{0,i}^{\text{\xb1}}={\mathit{\theta}}_{i}^{\text{\u2217}}$*,
where *$({\overrightarrow{\mathit{\theta}}}^{\text{\u2217}},{\overrightarrow{b}}^{\text{\u2217}})$
*minimizes the NLL *$\ell (\overrightarrow{\mathit{\theta}},\overrightarrow{b})$*.
Let *${\stackrel{~}{\ell}}_{t,i}^{\text{\xb1}}$
*be the surrogate NLL *(4) *for parameters *$\underset{{\overrightarrow{\mathit{\theta}}}_{\neg i},\overrightarrow{b}}{\mathrm{min}}\ell (\overrightarrow{\mathit{\theta}},\overrightarrow{b})$
*for fixed *${\mathit{\theta}}_{i}={\mathit{\theta}}_{t,i}^{\text{\xb1}}$*,
and let *${\mathit{\theta}}_{t+1,i}^{\text{\xb1}}$
*be the solutions of the *$\text{\u2213}$
*versions of *(5) *for *${\stackrel{~}{\ell}}_{i}={\stackrel{~}{\ell}}_{t,i}^{\text{\xb1}}$*.
Then, for *$C>0$
*and *$t\to \infty $*,*
${\mathit{\theta}}_{t,i}^{\text{\xb1}}$
*converge to solutions of the minimization/maximization version
of *(3)*.
*

Proof. *As a notational shorthand, let *${L}_{\alpha}:={\ell}^{\text{\u2217}}+\frac{1}{2}{\chi}_{\alpha}^{2,-1}$*.
For simplicity, we only consider *${\mathit{\theta}}_{t,i}^{\text{+}}$
*here; the proof for *${\mathit{\theta}}_{t,i}^{\text{\u2212}}$
*is analogous. First, observe that, for all *$t$*,
we have *${\stackrel{~}{\ell}}_{t,i}^{\text{+}}({\mathit{\theta}}_{t,i}^{\text{+}})={\ell}_{i}({\mathit{\theta}}_{t,i}^{\text{+}})$
*by construction, and we have *${\stackrel{~}{\ell}}_{t,i}^{\text{+}}({\mathit{\theta}}_{t+1,i}^{\text{+}})={L}_{\alpha}$*,
otherwise *${\mathit{\theta}}_{t+1,i}^{\text{+}}$
*would not be a solution to *(5)*. Further, by definition, we
have *${\stackrel{~}{\ell}}_{t,i}^{\text{+}}({\mathit{\theta}}_{i})\ge {\ell}_{i}({\mathit{\theta}}_{i})$
*for all *$t$
*and all *${\mathit{\theta}}_{i}\in \mathbb{R}$*.
Therefore, we obtain *${\stackrel{~}{\ell}}_{t,i}^{\text{+}}({\mathit{\theta}}_{t,i}^{\text{+}})={\ell}_{i}({\mathit{\theta}}_{t,i}^{\text{+}})\le {\stackrel{~}{\ell}}_{t-1,i}^{\text{+}}({\mathit{\theta}}_{t,i}^{\text{+}})={L}_{\alpha}$*.
Accordingly, whenever we solve *(5) *in step *$t+1$*,*
${\mathit{\theta}}_{t,i}^{\text{+}}$
*is a feasible initial point. Therefore, *${\mathit{\theta}}_{t+1,i}^{\text{+}}\ge {\mathit{\theta}}_{t,i}^{\text{+}}$*.*

*Whenever *${\mathit{\theta}}_{t+1,i}^{\text{+}}={\mathit{\theta}}_{t,i}^{\text{+}}$*,
we have *${\stackrel{~}{\ell}}_{t+1,i}^{\text{+}}={\stackrel{~}{\ell}}_{t,i}^{\text{+}}$
*and, thus, *${\mathit{\theta}}_{t,i}^{\text{+}}={\mathit{\theta}}_{t+1,i}^{\text{+}}={\mathit{\theta}}_{t+2,i}^{\text{+}}=\dots $*,
that is, we have a fixed point. Further, we have *${\ell}_{i}({\mathit{\theta}}_{t+1,i}^{\text{+}})={\stackrel{~}{\ell}}_{t+1,i}({\mathit{\theta}}_{t+1,i}^{\text{+}})={\stackrel{~}{\ell}}_{t+1,i}({\mathit{\theta}}_{t,i}^{\text{+}})={L}_{\alpha}$*,
which is a solution to the maximization version of *(3)*. Before
the fixed point, the sequence *${\mathit{\theta}}_{0,i}^{\text{+}},{\mathit{\theta}}_{1,i}^{\text{+}},\dots $
*is strictly increasing. Since *${\ell}_{i}$
*is convex with a minimum at *${\mathit{\theta}}_{0,i}^{\text{+}}$*,
the sequence *${\ell}_{i}({\mathit{\theta}}_{0,i}^{\text{+}}),{\ell}_{i}({\mathit{\theta}}_{1,i}^{\text{+}}),\dots $
*is also increasing. For *$C>0$*,
it would eventually grow beyond bounds due to the regularization
term, but since it is upper-bounded by *${L}_{\alpha}$*,
it must converge to *${L}_{\alpha}$
*and, hence, *${\mathit{\theta}}_{t,i}^{\text{+}}$
*must converge to *${\mathit{\theta}}_{i}^{\text{+}}$*.* __

While Theorem 2 only holds in the limit, very few steps $t$ suffice for a close-to-optimal solution in practice (e.g., Figure 2). We investigate this issue in more detail in our experiments.

# 3. EXPERIMENTS

In our experiments, we simulate synthetic data from a
ground-truth IRT model with ability and difficulties sampled from
a standard normal distribution. We varied the number of students
$m$ in the range
$\{30,50,100,500\}$ and the number
of items $n$ in
the range $\{10,20\}$
(similar to the protocol of [3]). For each model, we repeated the
sampling $10$
times to asses variation. In each repeat, we computed confidence
intervals for ${\mathit{\theta}}_{i}$
at $\alpha =0.05$
via Wald’s method, the likelihood profile-method via solving
${\ell}_{i}({\mathit{\theta}}_{i})={\ell}^{\text{\u2217}}+\frac{1}{2}{\chi}_{\alpha}^{2,-1}$ for
each $i$
(LP), a log-barrier solver (barrier) for (3), and our
proposed alternating optimization scheme (Theorem 2) for
$t=1$ (AO(1)),
$t=2$ (AO(2)),
and $t=3$
(AO(3)) steps. All experimental code is available at
`https://github.com/bpaassen/ability_bounds`

.

$m$ | $n$ | Wald | LP | barrier | AO(1) | AO(2) | AO(3) |
---|---|---|---|---|---|---|---|

$30$ | $10$ | $0.983\pm 0.022$ | $0.943\pm 0.054$ | $0.910\pm 0.056$ | $0.943\pm 0.054$ | $0.943\pm 0.054$ | $0.943\pm 0.054$ |

$30$ | $20$ | $1.000\pm 0.000$ | $0.943\pm 0.021$ | $0.083\pm 0.050$ | $0.940\pm 0.025$ | $0.943\pm 0.021$ | $0.943\pm 0.021$ |

$50$ | $10$ | $0.998\pm 0.006$ | $0.926\pm 0.041$ | $0.884\pm 0.047$ | $0.920\pm 0.041$ | $0.926\pm 0.041$ | $0.926\pm 0.041$ |

$50$ | $20$ | $0.998\pm 0.006$ | $0.942\pm 0.039$ | $0.098\pm 0.039$ | $0.938\pm 0.040$ | $0.942\pm 0.039$ | $0.942\pm 0.039$ |

$100$ | $10$ | $0.998\pm 0.004$ | $0.934\pm 0.028$ | $0.892\pm 0.029$ | $0.933\pm 0.027$ | $0.934\pm 0.028$ | $0.934\pm 0.028$ |

$100$ | $20$ | $0.999\pm 0.003$ | $0.937\pm 0.030$ | $0.059\pm 0.012$ | $0.933\pm 0.031$ | $0.937\pm 0.030$ | $0.937\pm 0.030$ |

$500$ | $10$ | $0.998\pm 0.002$ | $0.940\pm 0.011$ | $0.898\pm 0.015$ | $0.939\pm 0.010$ | $0.940\pm 0.011$ | $0.940\pm 0.011$ |

$500$ | $20$ | $1.000\pm 0.001$ | $0.949\pm 0.008$ | $0.097\pm 0.016$ | $0.948\pm 0.008$ | $0.949\pm 0.008$ | $0.949\pm 0.008$ |

Table 1 shows the coverage rates of all methods, that is, the rate at which the ground truth ${\mathit{\theta}}_{i}$ value was included in the confidence interval $[{\mathit{\theta}}_{i}^{\text{\u2212}},{\mathit{\theta}}_{i}^{\text{+}}]$ [3]. For $\alpha =0.05$, the coverage rate should be as close as possible to $0.95$. We observe that Wald’s method selects too large confidence intervals, yielding rates close to $100\%$. The barrier method selects smaller confidence intervals, yielding rates around $90\%$ for $n=10$ items. If we choose $n=20$ items, the barrier method becomes numerically unstable and the coverage rates degrade (smaller $10\%$). The likelihood profile (LP) method consistently achieves rates between $92\%$ and $95\%$ and thus is closest to the desired value of $95\%$. Alternating optimization achieves indistinguishable results to LP at 3 significant digits for $t\ge 2$ (AO(2) and AO(3)).

Figure 3 displays the time it took to compute confidence intervals for all students in every experimental settings in log-log plots. Dotted, gray lines are linear reference functions at .1, 1, 10, and 100ms per student. Using linear fits, we find that Wald’s method is fastest (at about $.3$ms per student), followed by AO(1) (about $.8$ms per student), barrier (about 4ms per student), AO(2) (about 9ms per student), AO(3) (about 12ms per student), and finally LP (about 37ms per student). Accordingly, AO(1) is roughly 50x faster than LP, and AO(2) is roughly 4x faster.

# 4. DISCUSSION AND CONCLUSION

In this paper, we introduced a new solver for confidence intervals on item response theory parameters via the likelihood profile method. In particular, we found that our alternating optimization solver was 4-50 times faster than the existing solver while achieving almost indistinguishable results. For anyone who seeks to compute confidence intervals, this should provide massive speedups in practice. More generally, though, we hope that our new solver can help the community to respond to an emerging need in educational data mining: more and more, policy makers and the general public expect machine learning models to provide explanations for their decisions. This is exemplified by recent policy initiatives in the European Union for a right to explanation and a risk-based approach to regulate artificial intelligence. Grading student ability—like in item response theory—will likely be under increasing scrutiny in the years to come. As such, we believe that it is crucial to quantify a model’s uncertainty precisely, which our solver can help to do.

We also provide a key theoretical insight in our paper: Traditionally, confidence intervals express the range of parameter values which is likely to contain the ’true’ value. We showed that the likelihood profile method for confidence intervals can also be interpreted as an optimization problem: We try to find the minimum/maximum ability value for a student which is still consistent with a high likelihood of the data. This interpretation provides a new way to explain an ability estimate: The upper bound of our confidence interval is the highest possible grade we can give to a student while still being consistent with the responses they provided.

Beyond this paper, there remains ample room for future work. Our evaluation has only covered synthetic data to validate the runtime advantage and the closeness to existing methods. Future work could investigate how large confidence intervals tend to be in practical scenarios. Further, our experiments indicated that the size of our confidence is still larger than it would need to be to cover the ’true’ ability value. Future work could try to find new methods to compute confidence intervals which are more precise. In particular, it may be promising to investigate combinations with recently proposed models for variational item response theory [14].

# Acknowledgements

Funding by the German Ministry for Education and Research (BMBF) under grant number 21INVI1403 (project KIPerWeb) is greatfully acknowledged.

# 5. REFERENCES

- F. Baker
and
S.-H.
Kim.
*Item Response Theory: Parameter Estimation Techniques*. CRC Press, Boca Raton, FL, USA, 2 edition, 2004. - S. Boyd
and
L. Vandenberghe.
*Convex Optimization*. Cambridge University Press, Cambridge, UK, 2009. - R. P.
Chalmers,
J. Pek,
and
Y. Liu.
Profile-likelihood
confidence
intervals
in
item
response
theory
models.
*Multivariate Behavioral Research*, 52(5):533–550, 2017. - A. Doebler,
P. Doebler,
and
H. Holling.
Optimal
and
most
exact
confidence
intervals
for
person
parameters
in
item
response
theory
models.
*Psychometrika*, 78(1):98–115, 2013. - S. Embretson
and
S. Reise.
*Item response theory for psychologists*. Psychology Press, New York, NY, USA, 2000. - European Commission. Laying down harmonised rules on artificial inteligence and amending certain union legislative acts, 2021.
- C. Göpfert,
L. Pfannschmidt, J. P.
Göpfert,
and
B. Hammer.
Interpretation
of
linear
classifiers
by
means
of
feature
relevance
bounds.
*Neurocomputing*, 298:69–79, 2018. - R. Hambleton
and
H. Swaminathan.
*Item response theory: Principles and applications*. Springer Science+Business Media, New York, NY, USA, 1985. - M. E.
Kaminski.
The
right
to
explanation,
explained.
*Berkeley Technology Law Journal*, 34:190–218, 2019. - R. P.
McDonald.
A
basis
for
multidimensional
item
response
theory.
*Applied Psychological Measurement*, 24(2):99–114, 2000. - K. Niemeijer,
R. Feskens,
G. Krempl,
J. Koops,
and
M. J. S.
Brinkhuis.
Constructing
and
predicting
school
advice
for
academic
achievement:
A
comparison
of
item
response
theory
and
machine
learning
techniques.
In
*Proceedings of the Tenth International Conference on Learning Analytics & Knowledge (LAK)*, page 462–471, New York, NY, USA, 2020. Association for Computing Machinery. - P. I. Pavlik,
H. Cen,
and
K. R.
Koedinger.
Performance
factors
analysis
–a
new
alternative
to
knowledge
tracing.
In
*Proceedings of the International Conference on Artificial Intelligence in Education (AIED)*, page 531–538, 2009. - K. Pliakos,
S.-H.
Joo,
J. Y.
Park,
F. Cornillie,
C. Vens,
and
W. Van
den
Noortgate.
Integrating
machine
learning
into
item
response
theory
for
addressing
the
cold
start
problem
in
adaptive
learning
systems.
*Computers & Education*, 137:91–103, 2019. - M. Wu,
R. L.
Davis,
B. W.
Domingue,
C. Piech,
and
N. Goodman.
Variational
item
response
theory:
Fast,
accurate,
and
expressive.
In
A. Rafferty,
J. Whitehill,
V. Cavalli-Sforza,
and
C. Romero,
editors,
*Proceedings of the 13th International Conference on Educational Data Mining (EDM)*, pages 257–268, 2020.

© 2022 Copyright is held by the author(s). This work is distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.