ABSTRACT
Item response theory models the probability of correct student responses based on two interacting parameters: student ability and item difficulty. Whenever we estimate student ability, students have a legitimate interest in knowing how certain the estimate is. Confidence intervals are a natural measure of uncertainty. Unfortunately, computing confidence intervals can be computationally demanding. In this paper, we show that confidence intervals can be expressed as the solution to a feature relevance optimization problem. We use this insight to develop a novel solver for confidence intervals and thus achieve speedups by 4-50x while retaining near-indistinguishable results to the state-of-the-art approach.
Keywords
1. INTRODUCTION
Item response theory (IRT) is a well-established method to model the responses of students on a test [1, 5, 8]. The basic version models the probability of student to answer correctly on item as , where is a parameter representing the ability of student , and is a parameter representing the difficulty of item . If a student’s ability exceeds the item’s difficulty, the success probability is larger than , and vice versa. Numerous extensions have been developed over the years, such as parameters for item discrimination (two-parameter IRT), base success chance due to random guessing (three-parameter IRT) [1, 5], IRT for multiple skills [10], performance factor analysis, which describes the increased success chance for repeated attempts [12], or combinations with machine learning [11, 13].
Whenever we use IRT to estimate student ability, students have a legitimate interest in making sure that our model does not underestimate them (in line with the European Union’s concept of a right to explanation [6, 9]). Accordingly, it is important that we not only estimate each student’s ability, but also our uncertainty. We can quantify uncertainty in IRT models via confidence intervals [3, 4]. Roughly speaking, an -confidence interval describes a range of possible parameter values such that the ‘true’ value is outside the range with probability at most . For example, Wald’s method assumes that parameters are Gaussian-distributed and uses the Gaussian’s standard deviation to estimate confidence intervals [3]. While this is efficient, the Gaussian approximation assumes symmetry of the distribution and is only valid for large numbers of items in the test[4]. This is often unrealistic, which is why Wald’s method tends to provide inaccurate confidence intervals in practice [4]. Doebler et al. have reacted by providing more exact methods under the assumption of constant item parameters but the resulting confidence intervals are disconnected regions [4], which are challenging to interpret.
More recently, Chalmers et al. [3] suggested the likelihood profile method, which is based on a likelihood ratio test. In particular, we look for a parameter range in which the log-likelihood is at least the optimal log-likelihood minus half the -quantile of the -distribution with one degree of freedom. Because this method does not assume symmetry and works on the ’true’ likelihood, it improves the accuracy of confidence intervals considerably. However, the computation requires a nested optimization scheme for every single parameter, which is computationally demanding, especially for large numbers of students.
In this paper, we provide a new perspective by showing that
likelihood profile confidence intervals are equivalent to feature
relevance intervals, which can be found via an optimization
problem [7]. Second, we utilize this theoretical insight to develop
a novel solver for confidence intervals which is considerably
faster (by factors of 4-50x), while the resulting confidence
intervals are almost indistinguishable from direct computation.
We evaluate our proposed solver on a range of synthetic
experimental conditions from 30-500 students and tests with
10 or 20 items. The experimental code can be found at
https://github.com/bpaassen/ability_bounds
.
2. METHOD
Our goal is to develop a faster solver for confidence bounds for ability parameters of IRT models via the likelihood profile technique [3]. For an illustration of the technique, consider Fig. 1. The blue curve illustrates the best negative log-likelihood (NLL) we can achieve when fixing the ability parameter to the value on the axis. The likelihood profile technique now asks for the values and , such that the NLL is exactly , where is the overall minimum NLL and is the quantile of the distribution with one degree of freedom. Our confidence interval is, then, given as .
More precisely, the NLL of an IRT model on a dataset with students and items is given as
where is if student answered correctly on item and is , otherwise, , and is the variance of a Gaussian prior on the ability parameters and item parameters [1] (chapter 7).
Now, let denote the vector without its th element and let . In other words, is the minimum NLL value we can achieve if we hold fixed but optimize all other parameters. Then, the likelihood profile method solves the equation , where . Because is convex in , this equation has exactly two solutions, which correspond to our interval bounds. The drawback of the likelihood profile method is its computational demand. For each student , we need to solve a nested optimization problem, where the outer optimization searches for a solution to the equation , and the inner optimization computes for eachvalue of probed by the outer optimization. We look for a way to speed up this computation.
Our key inspiration is the concept of feature relevance intervals (FRI) proposed by Göpfert et al. [7]. The FRI for some parameter is defined as the interval between the minimum and maximum value that still retains a loss of at most times the optimal loss, where is a user-defined hyperparameter. More precisely, given some loss function of some parameter vector and some specific parameter , the FRI for is computed by solving the following minimization/maximization problem:
(2)Note that this concept is more general than confidence intervals and is mostly intended for classifiers in machine learning [7]. However, for IRT, confidence intervals via the likelihood profile method and FRIs happen to be equivalent.
Theorem 1. Let for the NLL (1) and some . Then, for any and any , it holds: Problem (3) is convex with a global optimum , such that .
(3)Proof. As a notational shorthand, let . Note that the objective function of (3) is linear and hence convex. Now, consider the feasible set . Let and let for some , that is, is on the connecting line between and . Then, it must hold: , because is convex with respect to and . Further, because , and . Therefore, , implying that . Hence, is convex and (3) is convex.
Next, note that, for any , is strictly larger zero. Therefore, the feasible set is not empty (it contains at least the minimizer of ). In turn, problem (3) must have a optimum with because the objective function is strictly increasing.
Finally, we need to show that . Assume this is not the case. If , then , which is a contradiction. If , we need to inspect the behavior of in more detail. Let be the solution to the unconstrained problem . Then, we know that attains its minimum at with . Further, because is defined as the component-wise minimum of a convex function, it is convex itself [2] (p. 87). In turn, we know that for any , is increasing and for any , is decreasing. Because , it is also guaranteed that exceeds for sufficiently large/small , e.g., . Now, consider the minimization version of (3). In that case, we certainly have , otherwise would not be minimal. Now, because ,because is decreasing for all values , and because is continuous, there must exist some value with . Therefore, is not a solution for the minimization version of (3), which is a contradiction. The argument for the maximization version is analogous. __
Fig. 1 provides a graphical intuition for the proof: We can find the two solutions to by starting at the minimum of and then decreasing/increasing as much as possible while maintaining a loss that is at most . To do so, we automatically need to adjust all other parameters to allow for as much slack as possible to increase .
The key insight for our solver is that problem (3) becomes much more efficient if we only optimize over and not over any remaining parameters. In particular, we can re-write the NLL (1) as:
where is a constant term that does not depend on . Computing only requires computations, whereas the full NLL (1) requires . Overall, we obtain the simplified problem:
(5)By extension of Theorem 1, solving (5) is equivalent to solving the equation , which we can do efficiently via standard nonlinear equation solvers.
Importantly, our solution of (5) will be sub-optimal according to the original problem (3) because only approximates . Therefore, we update by keeping fixed and minimizing over all other parameters and . Then, we can solve (5) again, and so forth, until convergence. This is our proposed alternating optimization (AO) solver.
Figure 2 shows an illustration of the algorithm. To infer the upper bound , we start at , i.e. the optimal value according to the NLL. We obtain the next estimate by solving the version of (5) with respect to the current surrogate NLL . Then, we update all other parameters and by minimizing the NLL (1), yielding a new surrogate . Solving (5) again yields the next estimate . In this example, would already be indistinguishable from the optimal according to (3) because and strongly overlap. In general, we can prove that converges to .
Theorem 2. Let ,
where
minimizes the NLL .
Let
be the surrogate NLL (4) for parameters
for fixed ,
and let
be the solutions of the
versions of (5) for .
Then, for
and ,
converge to solutions of the minimization/maximization version
of (3).
Proof. As a notational shorthand, let .
For simplicity, we only consider
here; the proof for
is analogous. First, observe that, for all ,
we have
by construction, and we have ,
otherwise
would not be a solution to (5). Further, by definition, we
have
for all
and all .
Therefore, we obtain .
Accordingly, whenever we solve (5) in step ,
is a feasible initial point. Therefore, .
Whenever , we have and, thus, , that is, we have a fixed point. Further, we have , which is a solution to the maximization version of (3). Before the fixed point, the sequence is strictly increasing. Since is convex with a minimum at , the sequence is also increasing. For , it would eventually grow beyond bounds due to the regularization term, but since it is upper-bounded by , it must converge to and, hence, must converge to . __
While Theorem 2 only holds in the limit, very few steps suffice for a close-to-optimal solution in practice (e.g., Figure 2). We investigate this issue in more detail in our experiments.
3. EXPERIMENTS
In our experiments, we simulate synthetic data from a
ground-truth IRT model with ability and difficulties sampled from
a standard normal distribution. We varied the number of students
in the range
and the number
of items in
the range
(similar to the protocol of [3]). For each model, we repeated the
sampling
times to asses variation. In each repeat, we computed confidence
intervals for
at
via Wald’s method, the likelihood profile-method via solving
for
each
(LP), a log-barrier solver (barrier) for (3), and our
proposed alternating optimization scheme (Theorem 2) for
(AO(1)),
(AO(2)),
and
(AO(3)) steps. All experimental code is available at
https://github.com/bpaassen/ability_bounds
.
Wald | LP | barrier | AO(1) | AO(2) | AO(3) | ||
---|---|---|---|---|---|---|---|
Table 1 shows the coverage rates of all methods, that is, the rate at which the ground truth value was included in the confidence interval [3]. For , the coverage rate should be as close as possible to . We observe that Wald’s method selects too large confidence intervals, yielding rates close to . The barrier method selects smaller confidence intervals, yielding rates around for items. If we choose items, the barrier method becomes numerically unstable and the coverage rates degrade (smaller ). The likelihood profile (LP) method consistently achieves rates between and and thus is closest to the desired value of . Alternating optimization achieves indistinguishable results to LP at 3 significant digits for (AO(2) and AO(3)).
Figure 3 displays the time it took to compute confidence intervals for all students in every experimental settings in log-log plots. Dotted, gray lines are linear reference functions at .1, 1, 10, and 100ms per student. Using linear fits, we find that Wald’s method is fastest (at about ms per student), followed by AO(1) (about ms per student), barrier (about 4ms per student), AO(2) (about 9ms per student), AO(3) (about 12ms per student), and finally LP (about 37ms per student). Accordingly, AO(1) is roughly 50x faster than LP, and AO(2) is roughly 4x faster.
4. DISCUSSION AND CONCLUSION
In this paper, we introduced a new solver for confidence intervals on item response theory parameters via the likelihood profile method. In particular, we found that our alternating optimization solver was 4-50 times faster than the existing solver while achieving almost indistinguishable results. For anyone who seeks to compute confidence intervals, this should provide massive speedups in practice. More generally, though, we hope that our new solver can help the community to respond to an emerging need in educational data mining: more and more, policy makers and the general public expect machine learning models to provide explanations for their decisions. This is exemplified by recent policy initiatives in the European Union for a right to explanation and a risk-based approach to regulate artificial intelligence. Grading student ability—like in item response theory—will likely be under increasing scrutiny in the years to come. As such, we believe that it is crucial to quantify a model’s uncertainty precisely, which our solver can help to do.
We also provide a key theoretical insight in our paper: Traditionally, confidence intervals express the range of parameter values which is likely to contain the ’true’ value. We showed that the likelihood profile method for confidence intervals can also be interpreted as an optimization problem: We try to find the minimum/maximum ability value for a student which is still consistent with a high likelihood of the data. This interpretation provides a new way to explain an ability estimate: The upper bound of our confidence interval is the highest possible grade we can give to a student while still being consistent with the responses they provided.
Beyond this paper, there remains ample room for future work. Our evaluation has only covered synthetic data to validate the runtime advantage and the closeness to existing methods. Future work could investigate how large confidence intervals tend to be in practical scenarios. Further, our experiments indicated that the size of our confidence is still larger than it would need to be to cover the ’true’ ability value. Future work could try to find new methods to compute confidence intervals which are more precise. In particular, it may be promising to investigate combinations with recently proposed models for variational item response theory [14].
Acknowledgements
Funding by the German Ministry for Education and Research (BMBF) under grant number 21INVI1403 (project KIPerWeb) is greatfully acknowledged.
5. REFERENCES
- F. Baker and S.-H. Kim. Item Response Theory: Parameter Estimation Techniques. CRC Press, Boca Raton, FL, USA, 2 edition, 2004.
- S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2009.
- R. P. Chalmers, J. Pek, and Y. Liu. Profile-likelihood confidence intervals in item response theory models. Multivariate Behavioral Research, 52(5):533–550, 2017.
- A. Doebler, P. Doebler, and H. Holling. Optimal and most exact confidence intervals for person parameters in item response theory models. Psychometrika, 78(1):98–115, 2013.
- S. Embretson and S. Reise. Item response theory for psychologists. Psychology Press, New York, NY, USA, 2000.
- European Commission. Laying down harmonised rules on artificial inteligence and amending certain union legislative acts, 2021.
- C. Göpfert, L. Pfannschmidt, J. P. Göpfert, and B. Hammer. Interpretation of linear classifiers by means of feature relevance bounds. Neurocomputing, 298:69–79, 2018.
- R. Hambleton and H. Swaminathan. Item response theory: Principles and applications. Springer Science+Business Media, New York, NY, USA, 1985.
- M. E. Kaminski. The right to explanation, explained. Berkeley Technology Law Journal, 34:190–218, 2019.
- R. P. McDonald. A basis for multidimensional item response theory. Applied Psychological Measurement, 24(2):99–114, 2000.
- K. Niemeijer, R. Feskens, G. Krempl, J. Koops, and M. J. S. Brinkhuis. Constructing and predicting school advice for academic achievement: A comparison of item response theory and machine learning techniques. In Proceedings of the Tenth International Conference on Learning Analytics & Knowledge (LAK), page 462–471, New York, NY, USA, 2020. Association for Computing Machinery.
- P. I. Pavlik, H. Cen, and K. R. Koedinger. Performance factors analysis –a new alternative to knowledge tracing. In Proceedings of the International Conference on Artificial Intelligence in Education (AIED), page 531–538, 2009.
- K. Pliakos, S.-H. Joo, J. Y. Park, F. Cornillie, C. Vens, and W. Van den Noortgate. Integrating machine learning into item response theory for addressing the cold start problem in adaptive learning systems. Computers & Education, 137:91–103, 2019.
- M. Wu, R. L. Davis, B. W. Domingue, C. Piech, and N. Goodman. Variational item response theory: Fast, accurate, and expressive. In A. Rafferty, J. Whitehill, V. Cavalli-Sforza, and C. Romero, editors, Proceedings of the 13th International Conference on Educational Data Mining (EDM), pages 257–268, 2020.
© 2022 Copyright is held by the author(s). This work is distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.