Faster Confidence Intervals for Item Response Theory via an Approximate Likelihood Profile
Benjamin Paaßen
German Research Center for
Artificial Intelligence
Berlin, Germany
benjamin.paassen@dfki.de
Christina Göpfert
Faculty of Technology
Bielefeld University
Bielefeld, Germany
Niels Pinkwart
Insitute of Computer Science
Humboldt-University of Berlin
Berlin, Germany

ABSTRACT

Item response theory models the probability of correct student responses based on two interacting parameters: student ability and item difficulty. Whenever we estimate student ability, students have a legitimate interest in knowing how certain the estimate is. Confidence intervals are a natural measure of uncertainty. Unfortunately, computing confidence intervals can be computationally demanding. In this paper, we show that confidence intervals can be expressed as the solution to a feature relevance optimization problem. We use this insight to develop a novel solver for confidence intervals and thus achieve speedups by 4-50x while retaining near-indistinguishable results to the state-of-the-art approach.

Keywords

item response theory, confidence intervals, relevance intervals, approximation

1. INTRODUCTION

Item response theory (IRT) is a well-established method to model the responses of students on a test [158]. The basic version models the probability of student i to answer correctly on item j as pi,j = 1(1 + exp [bj 𝜃i]), where 𝜃i is a parameter representing the ability of student i, and bj is a parameter representing the difficulty of item j. If a student’s ability exceeds the item’s difficulty, the success probability is larger than 0.5, and vice versa. Numerous extensions have been developed over the years, such as parameters for item discrimination (two-parameter IRT), base success chance due to random guessing (three-parameter IRT) [15], IRT for multiple skills [10], performance factor analysis, which describes the increased success chance for repeated attempts [12], or combinations with machine learning [1113].

Whenever we use IRT to estimate student ability, students have a legitimate interest in making sure that our model does not underestimate them (in line with the European Union’s concept of a right to explanation [69]). Accordingly, it is important that we not only estimate each student’s ability, but also our uncertainty. We can quantify uncertainty in IRT models via confidence intervals [34]. Roughly speaking, an α-confidence interval describes a range of possible parameter values such that the ‘true’ value is outside the range with probability at most α. For example, Wald’s method assumes that parameters are Gaussian-distributed and uses the Gaussian’s standard deviation to estimate confidence intervals [3]. While this is efficient, the Gaussian approximation assumes symmetry of the distribution and is only valid for large numbers of items in the test[4]. This is often unrealistic, which is why Wald’s method tends to provide inaccurate confidence intervals in practice [4]. Doebler et al. have reacted by providing more exact methods under the assumption of constant item parameters but the resulting confidence intervals are disconnected regions [4], which are challenging to interpret.

More recently, Chalmers et al. [3] suggested the likelihood profile method, which is based on a likelihood ratio test. In particular, we look for a parameter range in which the log-likelihood is at least the optimal log-likelihood minus half the 1 α-quantile of the χ2-distribution with one degree of freedom. Because this method does not assume symmetry and works on the ’true’ likelihood, it improves the accuracy of confidence intervals considerably. However, the computation requires a nested optimization scheme for every single parameter, which is computationally demanding, especially for large numbers of students.

In this paper, we provide a new perspective by showing that likelihood profile confidence intervals are equivalent to feature relevance intervals, which can be found via an optimization problem [7]. Second, we utilize this theoretical insight to develop a novel solver for confidence intervals which is considerably faster (by factors of 4-50x), while the resulting confidence intervals are almost indistinguishable from direct computation. We evaluate our proposed solver on a range of synthetic experimental conditions from 30-500 students and tests with 10 or 20 items. The experimental code can be found at https://github.com/bpaassen/ability_bounds.

2. METHOD

Our goal is to develop a faster solver for confidence bounds for ability parameters of IRT models via the likelihood profile technique [3]. For an illustration of the technique, consider Fig. 1. The blue curve illustrates the best negative log-likelihood (NLL) we can achieve when fixing the ability parameter 𝜃i to the value on the x axis. The likelihood profile technique now asks for the values 𝜃i and 𝜃i+, such that the NLL is exactly + 1 2χα2,1, where is the overall minimum NLL and χα2,1 is the 1 α quantile of the χ2 distribution with one degree of freedom. Our α confidence interval is, then, given as [𝜃i,𝜃 i+].

A line plot of minimum NLL versus theta. Lower and upper bounds fo
Figure 1: Illustration of the optimal NLL we can achieve for a certain value of 𝜃i in a single-student, single-item setup with C = 1 and y1,1 = 1. The dashed line shows the loss bound for α = .05, the solid lines mark the corresponding confidence interval

More precisely, the NLL of an IRT model on a dataset with m students and n items is given as

(𝜃,b) = i=1m j=1n y i,j log [pi,j] (1 yi,j) log [1 pi,j] + 1 2C (𝜃2 + b2), (1) 

where yi,j is 1 if student i answered correctly on item j and is 0, otherwise, pi,j = 1(1 + exp [bj 𝜃i]), and C is the variance of a Gaussian prior on the ability parameters 𝜃 and item parameters b [1] (chapter 7).

Now, let 𝜃¬i denote the vector 𝜃 without its ith element and let i(𝜃i) := min 𝜃¬i,b(𝜃,b). In other words, i(𝜃i) is the minimum NLL value we can achieve if we hold 𝜃i fixed but optimize all other parameters. Then, the likelihood profile method solves the equation i(𝜃i) = + 1 2χα2,1, where = min 𝜃,b(𝜃,b). Because is convex in 𝜃i, this equation has exactly two solutions, which correspond to our interval bounds. The drawback of the likelihood profile method is its computational demand. For each student i, we need to solve a nested optimization problem, where the outer optimization searches for a solution to the equation i(𝜃i) = + 1 2χα2,1, and the inner optimization computes i(𝜃i) for eachvalue of 𝜃i probed by the outer optimization. We look for a way to speed up this computation.

Our key inspiration is the concept of feature relevance intervals (FRI) proposed by Göpfert et al. [7]. The FRI for some parameter 𝜃 is defined as the interval between the minimum and maximum value that still retains a loss of at most (1 δ) times the optimal loss, where δ is a user-defined hyperparameter. More precisely, given some loss function of some parameter vector 𝜃 and some specific parameter 𝜃i, the FRI for 𝜃i is computed by solving the following minimization/maximization problem:

min 𝜃 max 𝜃𝜃is.t.(𝜃) (1 + δ). (2)

Note that this concept is more general than confidence intervals and is mostly intended for classifiers in machine learning [7]. However, for IRT, confidence intervals via the likelihood profile method and FRIs happen to be equivalent.

Theorem 1. Let = min 𝜃,b(𝜃,b) for the NLL (1) and some C > 0. Then, for any α (0, 1) and any i {1,,m}, it holds: Problem (3) is convex with a global optimum 𝜃±,b±, such that (𝜃±,b±) = i(𝜃i±) = + 1 2χα2,1.

min 𝜃,b max 𝜃,b𝜃is.t.(𝜃,b) + 1 2χα2,1. (3)
Proof. As a notational shorthand, let Lα := + 1 2χα2,1. Note that the objective function of (3) is linear and hence convex. Now, consider the feasible set 𝒳 = {(𝜃,b)|(𝜃,b) Lα}. Let x,y 𝒳 and let zγ = γ x + (1 γ) y for some γ [0, 1], that is, zγ is on the connecting line between x and y. Then, it must hold: (zγ) γ (x) + (1 γ) (y), because is convex with respect to 𝜃 and b. Further, because x,y 𝒳, (x) Lα and (y) Lα. Therefore, (zγ) Lα, implying that zγ 𝒳. Hence, 𝒳 is convex and (3) is convex.

Next, note that, for any α (0, 1), χα2,1 is strictly larger zero. Therefore, the feasible set 𝒳 is not empty (it contains at least the minimizer of ). In turn, problem (3) must have a optimum (𝜃±,b±) with (𝜃±,b±) = L α because the objective function is strictly increasing.

Finally, we need to show that i(𝜃i±) = L α. Assume this is not the case. If i(𝜃i±) > L α, then min 𝜃¬i,b(𝜃,b) > (𝜃±,b±), which is a contradiction. If i(𝜃i±) < L α, we need to inspect the behavior of i in more detail. Let (𝜃,b) be the solution to the unconstrained problem min 𝜃,b(𝜃,b). Then, we know that i attains its minimum at 𝜃i with i(𝜃i) = . Further, because i is defined as the component-wise minimum of a convex function, it is convex itself [2] (p. 87). In turn, we know that for any 𝜃i > 𝜃i, i is increasing and for any 𝜃i < 𝜃i, i is decreasing. Because C > 0, it is also guaranteed that i(𝜃i) exceeds Lα for sufficiently large/small 𝜃i, e.g., i(±2 C Lα) > 1 2C2 C Lα2 = L α. Now, consider the minimization version of (3). In that case, we certainly have 𝜃i𝜃 i, otherwise 𝜃i would not be minimal. Now, because i(𝜃i) < L α,because i is decreasing for all values 𝜃i < 𝜃i, and because i is continuous, there must exist some value 𝜃i < 𝜃i with i(𝜃i) < i(𝜃i) Lα. Therefore, (𝜃,b) is not a solution for the minimization version of (3), which is a contradiction. The argument for the maximization version is analogous. __

Fig. 1 provides a graphical intuition for the proof: We can find the two solutions to i(𝜃i) = + 1 2χα2,1 by starting at the minimum 𝜃i of and then decreasing/increasing 𝜃i as much as possible while maintaining a loss that is at most + 1 2χα2,1. To do so, we automatically need to adjust all other parameters to allow for as much slack as possible to increase 𝜃i.

The key insight for our solver is that problem (3) becomes much more efficient if we only optimize over 𝜃i and not over any remaining parameters. In particular, we can re-write the NLL (1) as:

~i(𝜃i) = j=1n y i,j log [pi,j] (1 yi,j) log [1 pi,j] + 1 2C 𝜃i2 + ¬i, (4)

where ¬i is a constant term that does not depend on 𝜃i. Computing ~i only requires 𝒪(n) computations, whereas the full NLL (1) requires 𝒪(m n). Overall, we obtain the simplified problem:

min 𝜃i ±𝜃is.t.~i(𝜃i) + 1 2χα2,1. (5)

By extension of Theorem 1, solving (5) is equivalent to solving the equation ~i(𝜃i) = + 1 2χα2,1, which we can do efficiently via standard nonlinear equation solvers.

Importantly, our solution of (5) will be sub-optimal according to the original problem (3) because ~i only approximates i. Therefore, we update ~i by keeping 𝜃i fixed and minimizing over all other parameters 𝜃¬i and b. Then, we can solve (5) again, and so forth, until convergence. This is our proposed alternating optimization (AO) solver.

Figure 2 shows an illustration of the algorithm. To infer the upper bound 𝜃i+, we start at 𝜃0,i+ = 𝜃 i, i.e. the optimal value according to the NLL. We obtain the next estimate 𝜃1,i+ by solving the 𝜃i version of (5) with respect to the current surrogate NLL ~0,i+. Then, we update all other parameters 𝜃¬i and b by minimizing the NLL (1), yielding a new surrogate ~1,i+. Solving (5) again yields the next estimate 𝜃2,i+. In this example, 𝜃2,i+ would already be indistinguishable from the optimal 𝜃i+ according to (3) because ~1,i+ and i strongly overlap. In general, we can prove that 𝜃t,i+ converges to 𝜃i+.

A line plot of surrogate NLL and actual NLL versus theta for successive iteration
Figure 2: Illustration of the alternating optimization algorithm. We start with the optimum value for 𝜃0,i+ = 𝜃 1, then maximize 𝜃i via (5), yielding 𝜃1,i+, then update ~i, then maximize 𝜃i via (5) again, and so forth.

Theorem 2. Let 𝜃0,i± = 𝜃 i, where (𝜃,b) minimizes the NLL (𝜃,b). Let ~t,i± be the surrogate NLL (4) for parameters min 𝜃¬i,b(𝜃,b) for fixed 𝜃i = 𝜃t,i±, and let 𝜃t+1,i± be the solutions of the versions of (5) for ~i = ~t,i±. Then, for C > 0 and t , 𝜃t,i± converge to solutions of the minimization/maximization version of (3).

Proof. As a notational shorthand, let Lα := + 1 2χα2,1. For simplicity, we only consider 𝜃t,i+ here; the proof for 𝜃t,i is analogous. First, observe that, for all t, we have ~t,i+(𝜃 t,i+) = i(𝜃t,i+) by construction, and we have ~t,i+(𝜃 t+1,i+) = L α, otherwise 𝜃t+1,i+ would not be a solution to (5). Further, by definition, we have ~t,i+(𝜃 i) i(𝜃i) for all t and all 𝜃i . Therefore, we obtain ~t,i+(𝜃 t,i+) = i(𝜃t,i+) ~ t1,i+(𝜃 t,i+) = L α. Accordingly, whenever we solve (5) in step t + 1, 𝜃t,i+ is a feasible initial point. Therefore, 𝜃t+1,i+ 𝜃 t,i+.

Whenever 𝜃t+1,i+ = 𝜃 t,i+, we have ~t+1,i+ = ~ t,i+ and, thus, 𝜃t,i+ = 𝜃 t+1,i+ = 𝜃 t+2,i+ = , that is, we have a fixed point. Further, we have i(𝜃t+1,i+) = ~ t+1,i(𝜃t+1,i+) = ~ t+1,i(𝜃t,i+) = L α, which is a solution to the maximization version of (3). Before the fixed point, the sequence 𝜃0,i+,𝜃 1,i+, is strictly increasing. Since i is convex with a minimum at 𝜃0,i+, the sequence i(𝜃0,i+), i(𝜃1,i+), is also increasing. For C > 0, it would eventually grow beyond bounds due to the regularization term, but since it is upper-bounded by Lα, it must converge to Lα and, hence, 𝜃t,i+ must converge to 𝜃i+. __

While Theorem 2 only holds in the limit, very few steps t suffice for a close-to-optimal solution in practice (e.g., Figure 2). We investigate this issue in more detail in our experiments.

3. EXPERIMENTS

In our experiments, we simulate synthetic data from a ground-truth IRT model with ability and difficulties sampled from a standard normal distribution. We varied the number of students m in the range {30, 50, 100, 500} and the number of items n in the range {10, 20} (similar to the protocol of [3]). For each model, we repeated the sampling 10 times to asses variation. In each repeat, we computed confidence intervals for 𝜃i at α = 0.05 via Wald’s method, the likelihood profile-method via solving i(𝜃i) = + 1 2χα2,1 for each i (LP), a log-barrier solver (barrier) for (3), and our proposed alternating optimization scheme (Theorem 2) for t = 1 (AO(1)), t = 2 (AO(2)), and t = 3 (AO(3)) steps. All experimental code is available at https://github.com/bpaassen/ability_bounds.

Table 1: Coverage rates of all methods across experimental conditions
mn Wald LP barrier AO(1) AO(2) AO(3)
30100.983 ± 0.0220.943 ± 0.0540.910 ± 0.0560.943 ± 0.0540.943 ± 0.0540.943 ± 0.054
30201.000 ± 0.0000.943 ± 0.0210.083 ± 0.0500.940 ± 0.0250.943 ± 0.0210.943 ± 0.021
50100.998 ± 0.0060.926 ± 0.0410.884 ± 0.0470.920 ± 0.0410.926 ± 0.0410.926 ± 0.041
50200.998 ± 0.0060.942 ± 0.0390.098 ± 0.0390.938 ± 0.0400.942 ± 0.0390.942 ± 0.039
100100.998 ± 0.0040.934 ± 0.0280.892 ± 0.0290.933 ± 0.0270.934 ± 0.0280.934 ± 0.028
100200.999 ± 0.0030.937 ± 0.0300.059 ± 0.0120.933 ± 0.0310.937 ± 0.0300.937 ± 0.030
500100.998 ± 0.0020.940 ± 0.0110.898 ± 0.0150.939 ± 0.0100.940 ± 0.0110.940 ± 0.011
500201.000 ± 0.0010.949 ± 0.0080.097 ± 0.0160.948 ± 0.0080.949 ± 0.0080.949 ± 0.008

Table 1 shows the coverage rates of all methods, that is, the rate at which the ground truth 𝜃i value was included in the confidence interval [𝜃i,𝜃 i+] [3]. For α = 0.05, the coverage rate should be as close as possible to 0.95. We observe that Wald’s method selects too large confidence intervals, yielding rates close to 100%. The barrier method selects smaller confidence intervals, yielding rates around 90% for n = 10 items. If we choose n = 20 items, the barrier method becomes numerically unstable and the coverage rates degrade (smaller 10%). The likelihood profile (LP) method consistently achieves rates between 92% and 95% and thus is closest to the desired value of 95%. Alternating optimization achieves indistinguishable results to LP at 3 significant digits for t 2 (AO(2) and AO(3)).

Figure 3 displays the time it took to compute confidence intervals for all students in every experimental settings in log-log plots. Dotted, gray lines are linear reference functions at .1, 1, 10, and 100ms per student. Using linear fits, we find that Wald’s method is fastest (at about .3ms per student), followed by AO(1) (about .8ms per student), barrier (about 4ms per student), AO(2) (about 9ms per student), AO(3) (about 12ms per student), and finally LP (about 37ms per student). Accordingly, AO(1) is roughly 50x faster than LP, and AO(2) is roughly 4x faster.

Plots of runtime versus number of students for 10 items and 20 items
Figure 3: Log-Log plots for the runtime in seconds versus m for n = 10 (top) and n = 20 (bottom). Dotted, gray lines show linear functions at .1, 1, 10, and 100ms per student.

4. DISCUSSION AND CONCLUSION

In this paper, we introduced a new solver for confidence intervals on item response theory parameters via the likelihood profile method. In particular, we found that our alternating optimization solver was 4-50 times faster than the existing solver while achieving almost indistinguishable results. For anyone who seeks to compute confidence intervals, this should provide massive speedups in practice. More generally, though, we hope that our new solver can help the community to respond to an emerging need in educational data mining: more and more, policy makers and the general public expect machine learning models to provide explanations for their decisions. This is exemplified by recent policy initiatives in the European Union for a right to explanation and a risk-based approach to regulate artificial intelligence. Grading student ability—like in item response theory—will likely be under increasing scrutiny in the years to come. As such, we believe that it is crucial to quantify a model’s uncertainty precisely, which our solver can help to do.

We also provide a key theoretical insight in our paper: Traditionally, confidence intervals express the range of parameter values which is likely to contain the ’true’ value. We showed that the likelihood profile method for confidence intervals can also be interpreted as an optimization problem: We try to find the minimum/maximum ability value for a student which is still consistent with a high likelihood of the data. This interpretation provides a new way to explain an ability estimate: The upper bound of our confidence interval is the highest possible grade we can give to a student while still being consistent with the responses they provided.

Beyond this paper, there remains ample room for future work. Our evaluation has only covered synthetic data to validate the runtime advantage and the closeness to existing methods. Future work could investigate how large confidence intervals tend to be in practical scenarios. Further, our experiments indicated that the size of our confidence is still larger than it would need to be to cover the ’true’ ability value. Future work could try to find new methods to compute confidence intervals which are more precise. In particular, it may be promising to investigate combinations with recently proposed models for variational item response theory [14].

Acknowledgements

Funding by the German Ministry for Education and Research (BMBF) under grant number 21INVI1403 (project KIPerWeb) is greatfully acknowledged.

5. REFERENCES

  1. F. Baker and S.-H. Kim. Item Response Theory: Parameter Estimation Techniques. CRC Press, Boca Raton, FL, USA, 2 edition, 2004.
  2. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2009.
  3. R. P. Chalmers, J. Pek, and Y. Liu. Profile-likelihood confidence intervals in item response theory models. Multivariate Behavioral Research, 52(5):533–550, 2017.
  4. A. Doebler, P. Doebler, and H. Holling. Optimal and most exact confidence intervals for person parameters in item response theory models. Psychometrika, 78(1):98–115, 2013.
  5. S. Embretson and S. Reise. Item response theory for psychologists. Psychology Press, New York, NY, USA, 2000.
  6. European Commission. Laying down harmonised rules on artificial inteligence and amending certain union legislative acts, 2021.
  7. C. Göpfert, L. Pfannschmidt, J. P. Göpfert, and B. Hammer. Interpretation of linear classifiers by means of feature relevance bounds. Neurocomputing, 298:69–79, 2018.
  8. R. Hambleton and H. Swaminathan. Item response theory: Principles and applications. Springer Science+Business Media, New York, NY, USA, 1985.
  9. M. E. Kaminski. The right to explanation, explained. Berkeley Technology Law Journal, 34:190–218, 2019.
  10. R. P. McDonald. A basis for multidimensional item response theory. Applied Psychological Measurement, 24(2):99–114, 2000.
  11. K. Niemeijer, R. Feskens, G. Krempl, J. Koops, and M. J. S. Brinkhuis. Constructing and predicting school advice for academic achievement: A comparison of item response theory and machine learning techniques. In Proceedings of the Tenth International Conference on Learning Analytics & Knowledge (LAK), page 462–471, New York, NY, USA, 2020. Association for Computing Machinery.
  12. P. I. Pavlik, H. Cen, and K. R. Koedinger. Performance factors analysis –a new alternative to knowledge tracing. In Proceedings of the International Conference on Artificial Intelligence in Education (AIED), page 531–538, 2009.
  13. K. Pliakos, S.-H. Joo, J. Y. Park, F. Cornillie, C. Vens, and W. Van den Noortgate. Integrating machine learning into item response theory for addressing the cold start problem in adaptive learning systems. Computers & Education, 137:91–103, 2019.
  14. M. Wu, R. L. Davis, B. W. Domingue, C. Piech, and N. Goodman. Variational item response theory: Fast, accurate, and expressive. In A. Rafferty, J. Whitehill, V. Cavalli-Sforza, and C. Romero, editors, Proceedings of the 13th International Conference on Educational Data Mining (EDM), pages 257–268, 2020.


© 2022 Copyright is held by the author(s). This work is distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.