Item response theory (IRT) is a popular method to infer student abilities and item difficulties from observed test responses. However, IRT struggles with two challenges: How to map items to skills if multiple skills are present? And how to infer the ability of new students that have not been part of the training data? Inspired by recent advances in variational autoencoders for IRT, we propose a novel method to tackle both challenges: The Sparse Factor Autoencoder (SparFAE). SparFAE maps from test responses to abilities via a linear operator and from abilities to test responses via an IRT model. All parameters of the model offer an interpretation and can be learned in an efficient manner. In experiments on synthetic and real data, we show that SparFAE is similar in accuracy to other autoencoder approaches while being faster to learn and more accurate in recovering item-skill associations.
A foundational problem in educational data mining is to automatically infer students’ ability from their observed responses in a test. Item response theory (IRT) addresses this problem by fitting a logistic model that describes how student ability and item difficulty interact to generate an observed response . However, IRT faces at least two challenges. First, whenever a test involves multiple skills, we need to model the relation between skills and items, which standard IRT does not do . Second, an IRT model contains student-specific parameters which are fitted to a specific population. For any new student, we need to fit at least one new parameter.
The former challenge can be addressed via automatic methods for item-skill association learning, such as the -matrix method of Barnes et al. , the alternating least squares method , or sparse factor analysis . The second challenge requires a student-independent parametrization of the model, which is offered by variants like performance factor analysis  or variational autoencoders . In the present paper, we propose to address both challenges at once by combining sparse factor analysis with autoencoders, yielding a new method which we call sparse factor autoencoder (SparFAE).
In more detail, our contributions are: We introduce SparFAE, a
sparse factor autoencoding method for IRT. We provide an
interpretation for all parameters in the SparFAE model, as well
as an efficient learning scheme. Further, we empirically compare
SparFAE to sparse factor analysis  as well as variational autoencoders  on synthetic and real data and show
that SparFAE is similar in accuracy to other encoders but
is much faster to learn and more accurate in recovering
item-skill associations. Finally, we use SparFAE to analyze
an expert-designed math test and verify the identified
-matrix against the
The source code for all experiments can be found at
2. BACKGROUND AND RELATED WORK
IRT models the responses of students on a test with items. In particular, let be a random variable, which is if student answered item correctly and , otherwise. We assume that is Bernoulli-distributed, where the success probability is given as , where is the logistic link function, is an ability parameter for student , and is a difficulty parameter for item . The parameters and need to be fitted to observed training data, for example, via likelihood maximization or Bayesian parameter estimation . In particular, the negative log likelihood of the data (also known as crossentropy loss) is expressed by the formula(1)
This loss is convex in the parameters and , meaning that an optimal model can be found efficiently via nonlinear optimization algorithms.
Over the decades, numerous extensions of this basic scheme have been proposed, such as a discrimination parameter for each item (two-parameter model), a minimum probability of correct answers for each item (three-parameter model), partial credit, and hierarchical models [1, 4, 5]. In this paper, we care particularly about the extension to multiple underlying skills, sometimes called multidimensional IRT . In such a model, we represent a student’s ability by a -dimensional vector , where models the ability of student for skill . A consequence of including multiple skills is that we need to model the relationship between skills and items. In this paper, we assume a linear relationship that is captured by an matrix , where models how important skill is to answer item correctly. Overall, our model is described by the two equations:(2)
where is the vector of response logits for student , and is the vector of all item difficulties.
Our setup begs the question: how to learn the matrix ? Such coupling matrices between items and skills have been popularized by Tatsuoka , who imposed if skill is required for item and , otherwise. Traditionally, such -matrices have been hand-designed by domain experts , but recently, automatic methods to learn have emerged, such as the method of Barnes  or the alternating recursive method . Crucially, finding an optimal binary -matrix is challenging due to the discrete search space. To simplify the search, Lan et al.  have relaxed the problem by assuming continuous, non-negative entries of and applying methods from sparse coding, resulting in a method called Sparse Factor Analysis (SPARFA).
SPARFA applies an alternating optimization scheme. First, we initialize student abilities randomly, for example with Gaussian noise. Second, for each item , we adapt the th row of and the difficulty by solving the following optimization problem:
where is the crossentropy loss (1), is the -norm of , is the squared Euclidean norm of , and as well as are hyperparameters of the method. The squared Euclidean norm is intended to regularize the model parameters with a Gaussian prior, as usual in IRT  (chapter 7). The 1-norm is motivated by sparse coding and encourages sparsity in , meaning that the optimization process tends to find solutions where many of the entries in are zero . In other words, the model is encouraged to connect any item only to a few skills instead of all skills. This is reminiscent of traditional -matrices, where is only nonzero if skill is required to answer item correctly . Finally, SPARFA enforces that no entry can become negative, because a negative would imply that a higher ability in skill reduces my chance to answer item correctly, which does not make sense . Note that problem (3) is convex, such that it can be solved efficiently with nonlinear optimizers.
The third step of SPARFA is to optimize the ability parameters for each student . This is done by minimizing the crossentropy (1) plus a regularization term . We now iterate steps two and three of the SPARFA algorithm until the parameters converge.
Just as in standard IRT, a challenge of SPARFA is that we can not immediately apply a learned model to new students. For every new student , we need to fit new parameters . Many extensions of IRT have circumvented this problem by removing ability parameters altogether and only using item parameters. For example, performance factor analysis replaces the ability parameter by a weighted count of correct and wrong responses on past items for the same skill . More recently, Converse et al.  proposed a variational autoencoder model to simplify the application of IRT models to new students.
A variational autoencoder (VAE) views the student abilities as a compressed representation of the student’s response vector . More precisely, a VAE tries to learn an encoder function which compresses to abilities , and a decoder function which de-compresses back into estimated responses , such that and are close and such that is standard normal distributed . As decoder, we use a multi-dimensional IRT model (2), whereas the encoder could be a multi-layer artificial neural network . In contrast to traditional IRT models, a VAE model is typically non-convex and multi-layered, and thus needs to be optimized with deep learning methods [3, 6]. Wu et al.  have further extended the VAE version of IRT by analyzing the theory more closely and including the difficulty parameters as an additional input to the encoder. Fig. 1 illustrates the approach for a single-layer encoder. The encoder is given as for some bias (Fig. 1, left, in orange), whereas the decoder is a multi-dimensional IRT model like in (2) (Fig. 1, right, in blue). Note that we obtain all models in this section as special cases of this diagram. If we set the connections to zero, we obtain the IRT-VAE of . If we, further, remove the connections and treat as parameters, we obtain SPARFA. Finally, if we set and for all , we obtain a classic IRT model.
Interestingly, the state-of-the-art VAE approaches do not apply a sparsity penalty to facilitate interpretability of . Further, deep learning can be quite slow. To address these limitations, we propose an autoencoder model based on the SPARFA loss, which we describe in the next section.
Our proposed model is a single-layer autoencoder as illustrated in Fig. 1. More formally, our model can be concisely expressed in the following equations:
where the first equation expresses the encoder and the second and third equation the decoder.
Our interpretation of the parameters is as follows. maps from responses to student ability, with modeling the amount of ability in skill that is expressed by answering item correctly. Conversely, maps from abilities to responses, with modeling how much skill helps to answer item correctly. models the difficulty of item , as before. Note that our model requires no student-specific parameters, such that it can be directly applied to new students.
Note that we do not include “backward” connections or encoder bias parameters in our model because they do not contribute to the model’s expressive power in the single-layer case. Consider a “full” model with . If we plug this expression into our equation for , we obtain
We now absorb and into by re-defining as , yielding Equations (4). Accordingly, our model requires only parameters per item.
We can train the parameters of our model by solving the following minimization problem, inspired by SPARFA.
where denotes the entry-wise -norm, and where denotes the squared Frobenius norm. Since the resulting model is an autoencoder-variant of Sparse Factor Analysis, we call it Sparse Factor Autoencoder (SparFAE). We denote the objective function as . As in SPARFA, the Frobenius norm applies a Gaussian prior on the parameters, whereas the -norm encourages sparsity. We also apply the same non-negativity constraints as in SPARFA to ensure a meaningful interpretation of and . Additionally, the non-negativity constraints are likely to further enhance sparsity, as indicated by non-negative matrix factorization .
In contrast to SPARFA, we can not decompose this problem into independent problems for each item because there are item-to-item-dependencies: Manipulating also influences the abilities , which in turn influence the probability for any item with . Accordingly, we need to perform a joint optimization of all parameters. However, we do not need to resort to deep learning. Instead, we propose a standard L-BFGS solver, as implemented in the minimize method of scipy . This is facilitated by the surprisingly simple expression for the gradients:
where is the matrix of all responses, where is the matrix with entries , where is the matrix of only ones, and where is an -dimensional vector of ones. Regarding computational complexity, notice that the matrix products in (6) require operations, such that each optimization step is in for constant . We can simplify our optimization further by inspecting the relationship between and .
3.1 Single Matrix Variant
Note that the matrices and have related interpretations. Intuitively, if skill helps more with item (high ), we would also expect that answering item correctly is an indicator for skill (high ). Accordingly, it stands to reason that .
We can also motivate this setting mathematically. In particular, is optimal if is orthogonal, meaning equals the identity matrix . In that case, is the orthogonal projection of onto the hyperplane spanned by . In other words, is the most similar point to we can achieve with the decoder .
However, is it plausible that is orthogonal? Indeed, becomes a diagonal matrix (orthogonal up to scaling) if every item tests exactly one skill. Let be the set of items which test skill . Then, we obtain: along the diagonal and zero off the diagonal. In other words, the sparser becomes, the closer is to optimal.
When we plug into problem (5), we obtain:
where . The gradient becomes:
This concludes our description of the proposed method.
In our experiments, we evaluate our proposed approach, Sparse
Factor Autoencoder (SparFAE), on both synthetic and
real-world data. We compare Sparse Factor Analysis (SPARFA)
, Variational item response theory with a novel lower
bound (VIBO) , the two-matrix version of SparFAE
(SparFAE2), as well as the single-matrix version (SparFAE1).
As optimizers, we used L-BFGS for SPARFA and both
SparFAE versions, and an Adam optimizer with learning rate
for VIBO (these settings are as similar as possible to the
original work of ). The experimental source code with all
details is available at
4.1 Synthetic Experiments
First, we consider synthetic data, which we sample from a multivariate IRT model with skills, standard normally distributed abilities , and standard normally distributed difficulties . We introduce two different sampling conditions for : A) We sample a unique skill for each item and set , whereas all other entries of remain zero. B) We first sample a number of skills for each item with probability . Then, we draw skills without replacement and uniform probability for item and set to a uniform random number in the range .
As evaluation measures, we use the area under the
receiver-operator-curve in predicting correct responses
(AUC), the correlation between the learned difficulties
and the actual
the correlation between the learned abilities
and the actual
fraction of agreeing nonzero entries between the learned
matrix and the
the time needed for training, and the time needed for prediction
on new students. Since the ordering of skills is undefined, we
allow arbitrary permutations of the skills in the learned
In practice, we re-order the columns of
linear_sum_assignment function of scipy with the ground-truth
matrix . We evaluate all measures on a separate sample of
new students. We repeat all experiments
times for each of the hyperparameter settings in Table 1.
First, we inspect the effect of hyperparameters for students and items. Fig. 2 shows, from left to right, how AUC, , , and change for higher regularization in conditions A (top) and B (bottom). For AUC, we observe a slight degradation of all methods for higher regularization, with a notable decline for VIBO at the last setting. generally rises for higher regularization, with the exception of SparFAE1, which stays relatively stable around . appears stable across regularization and improves only for SPARFA. improves for all methods with higher regularization in condition A (top), and remains roughly stable in condition B (bottom). For the remaining synthetic experiments, we report the results using hyperparameter setting 6 for SparFAE2 and SparFAE1, and hyperparameter setting 5 for SPARFA and VIBO. These settings maximize , , and while retaining high AUC.
Fig. 3 displays the performance measures for varying numbers of students. We observe that AUC, , and tend to slightly increase for more students across methods and conditions, with only slight deviances for small numbers of students. The most striking impact is on , which increases for SparFAE1 and VIBO, but decreases for SPARFA and SparFAE2.
Fig. 4 displays the performance measures for varying numbers of items. Across methods, AUC decreases, whereas and increase and remains roughly stable for higher number of items. The decrease in AUC is likely explained by the fact that the models need to compress the information of more items into the same number of skills, which is bound to decrease performance. Conversely, it becomes easier to tease apart the difficulty of each single item for a higher number of items per skill (hence the improvement in ). Further, the more items we have in a test, the more accurate we can estimate the underlying ability, which is reflected in better values.
Finally, Fig. 5 summarizes the effect of hyperparameter setting, number of students, and number of items on training time in logarithmic plots. We observe that stronger regularization reduces the training time for both SparFAE variants, whereas it stays roughly constant for VIBO and SPARFA. This is likely because training time for SPARFA and VIBO is driven by the repeated optimization steps over students, whereas the training time for SparFAE is dominated by a single optimization process. Hence, SparFAE profits more from the simpler loss surface offered by higher regularization. As one would expect, all methods scale roughly linearly with the number of students, SPARFA with roughly 18 ms per student, VIBO with roughly 6 ms per student, and both SparFAE variants with roughly 1.5 ms per student (refer to the gray dashed reference lines). For the number of items, the runtime for SPARFA and VIBO remains roughly constant, whereas it increases almost quadratically for both SparFAE variants. This is because the optimization for SPARFA and VIBO is dominated by iterations over students. By contrast, for SparFAE, every single gradient computation already depends linearly on , and more items may also increase the number of required gradient computations until convergence, thus yielding the super-linear behavior.
4.2 NeurIPS 2020 education data
Next, we consider the NeurIPS 2020 education challenge data by Wang et al. . The data set consists of multiple choice questions to assess mathematics knowledge. Items are grouped into different quizzes. We restricted the data set to the items from task of the challenge and quizzes with at least students1, which left quizzes. On average, these quizzes contained items and had students responding. To estimate the number of skills , the first author analyzed all items and assigned them to skills. This yielded distinct skills, the most common ones being fractions (190 items), basic algebra (140 items), and algebra with variables (127 items). On average, quizzes involved skills. For each quiz, we set to the first-author estimate, but we upper-bounded to be at most half the number of items in the quiz.
Based on the pre-defined -matrix by the first author, we included two more baselines: A VIBO model, where we fixed the decoder matrix to the pre-defined -matrix, and a SparFAE1 model, where we fixed the -matrix and only trained the difficulty parameter for each item using logistic regression. We denote these methods as VIBO and SparFAE, respectively.
To perform hyperparameter optimization, we randomly set aside quizzes and evaluated the AUC of all methods for all hyperparameter settings in Table 1 in a -fold crossvalidation over students, that is, in each fold we used of students as training data and as test data. The hyperparameter settings which maximized AUC were 2 for SPARFA and SparFAE2, 3 for VIBO, and 4 for SparFAE1.
|method||AUC||sparsity||training time [s]||prediction time [ms]|
Next, we performed a -fold crossvalidation over students for the remaining quizzes. Note that we can not evaluate , , or , because we have no access to a ground truth for , , and . However, we can evaluate the sparsity of , that is, the fraction of zero entries. Sparsity is a rough proxy for the plausibility of a learned -matrix because high sparsity indicates that assigns items to distinct skills.
Table 2 reports the average performance measures across quizzes. Regarding AUC, Wilcoxon signed-rank tests revealed that SPARFA had the highest AUC, followed by SparFAE2, VIBO, SparFAE1, SparFAE, and finally VIBO ( for all tests after Bonferroni correction). That being said, the AUC of all methods except SPARFA is very close (at most difference between means). In terms of sparsity, SparFAE1 clearly outperforms SPARFA, VIBO, and SparFAE2 (). Note that VIBO does not achieve any sparsity, as it does not encourage sparsity during training. The sparsity of the pre-defined -matrix was very high () as it assigned each item to only one skill.
With respect to training time, SparFAE1 is considerably faster than SPARFA (ca. 15x), VIBO (ca. 4x), and SparFAE2 (ca. 8x). In terms of prediction time, SparFAE1, SparFAE2, and VIBO perform similarly as their prediction scheme is almost the same (although SparFAE1 is still significantly faster, ). Only SPARFA is much slower (ca. 300x) because it needs to fit new ability parameters to new students for each prediction.
Finally, we analyzed the relation of AUC to the numbers of students, items, and skills, as well as the amount of missing data in quizzes. Fig. 6 displays scatter plots, where each dot represents one quiz and lines show linear fits. Interestingly, the behavior is very similar for all methods. The linear correlation is with number of students ( for VIBO; ), with number of items (), with number of skills (), and around zero with the amount of missing data (insignificant). This is in line with our results on synthetic data. The strong correlation with the number of skills is explained by the fact the methods have more parameters to fit the data when we increase .
4.3 Math assessment data
In a final experiment, we evaluated the ability of SparFAE1 to identify a fitting -matrix in comparison to an expert-designed -matrix on real data. To that end, we used data from students (ages 16-19) on a math assessment test for vocational education in chemistry2. The test consisted of questions, covering topics, namely basic algebra, fractions, equation solving for a single variable, text tasks with two variables, and (linear) functions. Fig. 7 (top) shows the assignment of items (x-axis) to these five topics (y-axis) as provided by the test designers.
We applied a slightly adapted variant of SparFAE1 with the regularization , that is, we punished deviations of the column sums from , thereby encouraging orthogonality in . As regularization strength, we set . We performed repeats of SparFAE1 and then selected the -matrix which maximized accuracy in a leave-one-out crossvalidation over students (the resulting best accuracy was ).
The learned -matrix is shown in Fig. 7 (bottom). We observe that the matrix assigns every item to only one skill, in line with the expert prediction. We further observe that—in line with the experts—the learned tends to group items for the basic topics (basic algebra and fractions) together and tends to avoid grouping items for basic topics with items for advanced topics. However, there are also notable differences to the expert -matrix. In particular, SparFAE1 merges basic algebra and fractions into one skill (except for item 8, which is in skill 4), and includes items 13 and 14. Overall, skill 1 accumulates relatively easy tasks without text- and function components. All other skills contains items which required text comprehension and/or understanding of functions, but the correspondence to expert-defined skills is less obvious.
To gain deeper insight into the learned -matrix, we inspected the item-to-item correlations , which are shown in Fig. 8. We observe that items 1-7, 9, and 13-14 exhibit relatively high pairwise correlation, explaining why SparFAE1 grouped them together in skill 1.
Skill 2 groups items 16 and 18, which are both text problems covering variable solution problems, but it also includes item 21, which is a question on functions. Inspecting the correlation matrix, we observe that item 21 generally exhibits low correlation, except for items 16 and 18, which explains the grouping.
Skill 3 groups items without obvious mathematical connection. Item 10 is a fraction problem, item 12 is a variable algebra problem, and item 20 is a function problem. Further, these items exhibit only moderate pairwise correlation. However, the only items with higher correlations are already contained in skill 1 and are, thus, unavailable for skill 3, thus indirectly explaining the grouping.
Skill 4 contains a variable algebra item (8), an equation solving problem (17), and a function problem (19). General variable algebra capacity (8) plausibly enhances equation solving (17) but the function question (19) seems less connected. The correlation matrix reveals that item 19 has generally low correlations, except for items 3, 7, 14, and 17, explaining its grouping with item 17.
Skill 5 contains two equation problems, one symbolic (11) and one text-based (15). Further, item 15 has very low correlations with any other item, except for items 9 and 11, and 20, which explains the grouping with item 11.
Overall, we observe that the learned -matrix tended to group more basic items together and more advanced items together, in line with expert opinion. Sometimes, the learned matrix groups items which do not have an obvious connection, content-wise. In such cases, we could explain the grouping by inspecting the item-to-item correlation matrix.
5. DISCUSSION AND CONCLUSION
We proposed a novel method for factor analysis which extends Sparse Factor Analysis (SPARFA)  to an autoencoder approach. Hence, we call our proposed method Sparse Factor Autoencoder (SparFAE). More specifically, our approach encodes student responses to abilities via a linear map and decodes it again to predicted responses via a multi-dimensional item response theory model with a linear skill-to-item map . Like SPARFA, our approach encourages sparsity in the -matrix via non-negativity constraints and L1 regularization. In contrast to SPARFA, we do not need to fit new ability parameters for new students. Instead, we can simply apply , which automatically yields the desired ability parameters. We investigated two versions of SparFAE: One with separate matrices and for encoding and decoding (SparFAE2), and one where we set , that is, we use the -matrix for both encoding and decoding (SparFAE1).
In experiments on synthetic as well as real data, we showed that SparFAE1 is considerably faster than SPARFA, variational autoencoding , and SparFAE2. SparFAE1 also achieves higher sparsity, and higher correlation with ground truth -matrices and student abilities. This comes at the price of slightly lower AUC and less accuracy in recovering ground truth difficulties. We also observed that AUC differences between autoencoder variants were quite small, whereas SPARFA achieved noticeably higher AUC, indicating that student-specific ability parameters allow for a better fit of the data than autoencoding. We also compared the learned -matrix via SparFAE1 with an expert matrix on a math assessment test, revealing some overlap but also meaningful differences which could be explained by item-to-item correlations.
Overall, our results indicate that SparFAE1 is a promising method for fast factor analysis, especially when each item in a test only refers to a single skill. As such, we believe that it can be an interesting tool for test designers who wish to analyze the factor structure of their test on a sample of students. While the learned -matrix should still be interpreted with care, it can uncover latent item relationships (as we saw on the math assessment data). Our results also motivate the use of -matrices for both decoding and encoding, which can serve as a starting point for future research.
Limitations of SparFAE1 lie in the slightly lower AUC compared to other autoencoders, the ability to recover ground truth difficulty parameters, and the superlinear scaling with respect to the number of items. Future work could address each of these shortcomings. Further, our experimental evaluation is limited to multiple choice m math assessment questions. Future work should include further data sets from other educational domains to ensure that SparFAE1 generalizes. Finally, just as any autoencoders, SparFAE1 makes the assumption that abilities do not change during a test. Future work may consider more dynamic settings, e.g. by incorporating concepts from performance factor analysis or knowledge tracing models.
Funding by the German Ministry for Education and Research (BMBF) under grant number 21INVI1403 (project KIPerWeb) is greatfully acknowledged.
- F. Baker and S.-H. Kim. Item Response Theory: Parameter Estimation Techniques. CRC Press, Boca Raton, FL, USA, 2 edition, 2004.
- T. Barnes. The -matrix method: Mining student response data for knowledge. In Proceedings of the AAAI 2005 Educational Data Mining Workshop, pages 1–8, 2005.
- G. Converse, M. Curi, and S. Oliveira. Autoencoders for educational assessment. In S. Isotani, E. Millán, A. Ogan, P. Hastings, B. McLaren, and R. Luckin, editors, Proceedings of the International Conference on Artificial Intelligence in Education (AIED), pages 41–45, 2019.
- S. Embretson and S. Reise. Item response theory for psychologists. Psychology Press, New York, NY, USA, 2000.
- R. Hambleton and H. Swaminathan. Item response theory: Principles and applications. Springer Science+Business Media, New York, NY, USA, 1985.
- D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv, 1312.6114, 2014.
- A. S. Lan, A. E. Waters, C. Studer, and R. G. Baraniuk. Sparse factor analysis for learning and content analytics. Journal of Machine Learning Research, 15(57):1959–2008, 2014.
- D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
- R. Liu, A. C. Huggins-Manley, and L. Bradshaw. The impact of -matrix designs on diagnostic classification accuracy in the presence of attribute hierarchies. Educational and Psychological Measurement, 77(2):220–240, 2017.
- R. P. McDonald. A basis for multidimensional item response theory. Applied Psychological Measurement, 24(2):99–114, 2000.
- P. I. Pavlik, H. Cen, and K. R. Koedinger. Performance factors analysis –a new alternative to knowledge tracing. In Proceedings of the International Conference on Artificial Intelligence in Education (AIED), page 531–538, 2009.
- Y. Sun, S. Ye, S. Inoue, and Y. Sun. Alternating recursive method for -matrix learning. In Proceedings of the 7th International Conference on Educational Data Mining (EDM), pages 14–20, 2017.
- K. K. Tatsuoka. Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20(4):345–354, 1983.
- P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261–272, 2020.
- Z. Wang, A. Lamb, E. Saveliev, P. Cameron, Y. Zaykov, J. M. Hernández-Lobato, R. E. Turner, R. G. Baraniuk, C. Barton, S. P. Jones, S. Woodhead, and C. Zhang. Diagnostic questions: The NeurIPS 2020 education challenge. arXiv, 2007.12061, 2020.
- M. Wu, R. L. Davis, B. W. Domingue, C. Piech, and N. Goodman. Variational item response theory: Fast, accurate, and expressive. In A. Rafferty, J. Whitehill, V. Cavalli-Sforza, and C. Romero, editors, Proceedings of the 13th International Conference on Educational Data Mining (EDM), pages 257–268, 2020.
- H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
1We also excluded one outlier quiz with more than items and a lot of missing data.
© 2022 Copyright is held by the author(s). This work is distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.