ABSTRACT
Academic grades in assessments are predicted to determine if a student is at risk of failing a course. Sequential models or graph neural networks that have been employed for grade prediction do not consider relationships between course descriptions. We propose the use of text mining to extract semantic, syntactic, and frequencybased features from course content. In addition, we classify intended learning outcomes according to their higher or lowerorder thinking skills. A learning parameter is then formulated to model the impact of these cognitive levels (that are expected for each course) on student performance. These features are then embedded and represented as graphs. Past academic achievements are then fused with the above features for grade prediction. We validate the performance of the above approach via datasets corresponding to three engineering departments collected from a university. Results obtained highlight that the proposed technique generates meaningful feature representations and outperforms existing methods for grade prediction.
Keywords
1. INTRODUCTION
Detecting students at the risk of failing university courses based on predicted grades is essential for administering early intervention strategies. From a regression problem perspective, grades obtained from prior courses in previous semesters are used to predict grades for pilot courses registered in the upcoming semester.
1.1 Related Models for Grade Prediction
Existing techniques for grade prediction using past academic records include conventional regression models such as random forest, support vector machine, and Knearest neighbor [10, 15, 1] as well as the factorization machine in a collaborative filtering setting [33]. In addition to the use of past examination results, information derived from online clickstream data on learning management systems has been used to augment the prediction capability of a model [26, 25]. More recently, sequential models such as the long shortterm memory (LSTM) network have been developed to capture the temporal dynamics of past academic performance [12]. While such deep learning models have achieved reasonable success in grade prediction, existing temporalbased approaches do not take the relationships among courses and among students into account. Consideration of these relationships is essential since information pertaining to courses with similar content and students with similar cognitive levels would aid in grade prediction. In addition, the performance trend of an academicallyinclined student or a wellperformed course in the current semester may continue for the upcoming semesters [23].
Notwithstanding the above, graph neural networks have recently been employed to generate meaningful feature representations which model the transitions of grade distributions between courses across semesters [11]. Similar to social multirelational networks [14] with nodes representing either students or courses, three graphs—studentcourse, studentstudent, and coursecourse graphs—consisting of edge links computed via grade distribution similarities or correlations have been constructed [23, 21]. Modeling the studentcourse relations have also been achieved via knowledge graphs to extract course and student embeddings as well as to encode temporal student behavioral data [17]. Pre or corequisites between courses have also been considered for grade prediction [27].
Despite adopting multidimensional approaches toward analyzing prior course grades to predict student performance [35], existing models assume that the relationship among courses depends solely on the grade distribution; these models do not consider topics covered and the intended learning outcomes defined by the course instructors. These aspects are important since the process of knowledge acquisition often involves assimilating and discerning information from myriad sources [29], i.e., academic performance has shown to be dependent on prior experience and how the student has understood certain concepts. Moreover, course content that overlap or are highly interdependent may influence how well the student can achieve the intended learning outcomes for the upcoming semesters [38]. While course syllabus has recently been used to extract frequencybased features for the determination of course similarities [16], it does not analyze the intended learning outcomes nor capture the relationship between courses holistically. It is also not surprising to expect that students who are less academically inclined often struggle in courses that require higherorder thinking skills. Information pertaining to the thinking skills required for prior courses will, therefore, allow the gradeprediction model to better represent grades achieved from previous semesters.
1.2 Grade Prediction From Curriculum Development Perspective
From a curriculum development perspective, course descriptions comprise topics to be covered and the intended learning outcomes for each course designed by the course instructor [34]. The importance of identifying suitable topics is motivated by an earlier study where firstyear university students who had been exposed to fundamental concepts in high school have shown to perform better than those who had not studied similar content before [13]. In today’s context, this highlights the intrinsic (and often intimate) relationships including prerequisites, recommended literature, and course content that define dependencies between courses. Coupled with the fact that course instructors often adopt the constructivist approach in curriculum design [6], analysis of course content is important for grade prediction.
Apart from course content, outcomebased teaching and learning require course instructors to identify suitable intended learning outcomes and assessments that measure those learning outcomes [4, 30]. In this regard, learning activities with various cognitive complexity levels should be designed and aligned with the learning outcomes constructively throughout the course [32, 2, 5, 9]. Alignment of learning activities can be achieved via the revised Bloom’s Taxonomy with the recollection of information being associated with the lowestorder thinking skill to generating creative outcomes being associated with the highestorder thinking skill [20]. Given that less academicallyinclined students often face challenges in higherorder thinking skills [39], it is important to consider the influence of learning outcomes on student performance for grade prediction.
1.3 Contribution of This Work
In this work, we propose a course descriptionbased grade prediction (CODEGP) model that employs text mining techniques for extracting features associated with (i) course content similarities and (ii) higher or lowerorder thinking skills required for each course. With regards to the first dimension highlighted in Table 1, we propose three types of course similarities extracted from topic outlines and learning outcomes found within course descriptions. These similarities include semantic [22], syntactic [3], and frequencybased features [36]. The use of these features is in contrast to the use of grade distributions as edge weights for generating similarities [11]. The basis for our proposed architecture is motivated by the need to consider both course outlines and learning outcomes, since both the intended learning outcomes and syllabus are important for the development and implementation of teaching programs [28]. In addition, we also consider past performance of each student from the perspective of thinking skills required for each course. In particular, the proposed model employs a document classification approach that tags each course with higher or lowerorder thinking skills according to the revised Bloom’s Taxonomy. A learnable parameter is then used to aggregate the respective grades achieved for both lower and higherorder thinking skill courses. This allows the proposed model to establish the relationship between the complexity of courses and academic performance.
As shown in Figure 1, we adopt graph neural networks to generate representations of the above text mining features. These features are represented as course and studentsimilarity graphs with nodes corresponding to courses and students, respectively. The edge weights for the former are computed based on the proposed three text features. For the latter, past academic grades are aggregated, and the similarity related to JensenShannon Divergence (JSD) is then computed among the grade distributions [23]. These graphs are subsequently embedded and trained using a graph convolutional network (GCN) layer.
In addition and similar to [12], we incorporate temporal information extracted from past examination records for each student across semesters. Grade embeddings, the corresponding student vector, and prior course vectors acquired from the GCN for each semester are then concatenated as a representation vector. This temporal representation serves as the input to LSTM, which exploits the sequential relationships and predicts the grade for a course to be taken in the coming semester.
2. THE PROPOSED CODEGP MODEL
The task of grade prediction involves predicting the grade for student ${s}_{i}$ who has registered a pilot course. Given ${N}_{C}$ number of prior courses and ${N}_{S}$ students, the set of prior courses is defined as $\u2102=\{{c}_{1},{c}_{2},\dots ,{c}_{{N}_{C}}\}$ and the set of students as $\mathbb{S}=\{{s}_{1},{s}_{2},\dots ,{s}_{{N}_{S}}\}$. We define ${\hat{g}}_{{s}_{i}}$ as the predicted grade of a given pilot course for the student ${s}_{i}$.
2.1 Construction of Course Similarities Graph Based on Course Descriptions
The CODEGP model incorporates semantic, syntactic, and frequencybased features extracted from course descriptions that comprise topic outlines and intended learning outcomes. These features are subsequently used for constructing the coursesimilarity graph. We first preprocess the text by removing symbols, diagrams, equations, numbers, punctuation marks, and stopwords (e.g., “and", “or"). All remaining characters are set to lower case [32].
Semantic similarity based on word embeddings has been employed to assess student capability for the recommendation of similar courses [24]. In the context of CODEGP, we first define topic outline as ${q}_{i}$ corresponding to course ${c}_{i}$. A topic outline vector ${v}_{{q}_{i}}$ is then generated from ${q}_{i}$ based on the bidirectional encoder representations from transformer (BERT) embeddings [7, 8]. The cosine similarity between ${q}_{i}$ and ${q}_{j}$ is then computed between two course outlines via
$$\mathrm{cos}(\mathit{\theta}({c}_{i},{c}_{j}))=\frac{{v}_{{q}_{i}}\cdot {v}_{{q}_{j}}}{\left\right{v}_{{q}_{i}}\left\right\left\right{v}_{{q}_{j}}\left\right}.$$(1)With $0\le \mathrm{cos}(\mathit{\theta}({c}_{i},{c}_{j}))\le 1$, a value of 1 implies an almost semantically similar pair of courses ${c}_{i}$ and ${c}_{j}$.
Syntactic features for CODEGP comprise phrase types (i.e., regular expressions (regexes)) that are extracted from statements associated with the intended learning outcomes. In this context, we first extract noun and verbphrases from the intended learning outcome document ${l}_{i}$ corresponding to course ${c}_{i}$. These (multiple) phrases are then associated with their partsofspeech tags resulting in the set of regexes ${R}_{{l}_{i}}$ [31]. Overlaps between the regex sets are then computed via the Jaccard similarity given by
$$\Phi ({c}_{i},{c}_{j})=\frac{{R}_{{l}_{i}}\cap {R}_{{l}_{j}}}{{R}_{{l}_{i}}\cup {R}_{{l}_{j}}},$$(2)where $0\le \Phi ({c}_{i},{c}_{j})\le 1$. The number of common occurrences is denoted by ${R}_{{l}_{i}}\cap {R}_{{l}_{j}}$ while ${R}_{{l}_{i}}\cup {R}_{{l}_{j}}$ refers to the total number of regexes. A high Jaccard similarity, therefore, implies a high proportion of similar phrase types occuring between a course pair regardless of whether the topics covered are identical.
The term frequencyinverse document frequency (TFIDF) determines the uniqueness of a word within a set of documents [37]. To account for word appearance similarity, we include TFIDF weighting on both the topic outlines ${q}_{i}$ and intended learning outcomes ${l}_{i}$ for course ${c}_{i}$. These features are extracted from each concatenated course document ${d}_{i}={q}_{i}^{\text{\u2322}}{l}_{i}$, where $\u2322$ denotes the concatenation of two texts. We then compute the cosine similarity $\mathrm{\Omega}({c}_{i},{c}_{j})\propto {v}_{{d}_{i}}\cdot {v}_{{d}_{j}}$ similar to (1) but between bag of words (BoW) vectors ${v}_{{d}_{i}}$ and ${v}_{{d}_{j}}$ corresponding to each course document. Here, ${v}_{{d}_{i}}=[\alpha ({w}_{1},{d}_{i}),\alpha ({w}_{2},{d}_{i}),\dots ,\alpha ({w}_{{N}_{W}},{d}_{i})]$ with ${w}_{k}$ denoting the $k$th word in document ${d}_{i}$. The BoW vector length is based on the word vocabulary size ${N}_{W}$ across the entire corpus. The value of each element corresponding to the TFIDF weight for word ${w}_{k}$ is given by [37]
$$\alpha ({w}_{k},{d}_{i})=\frac{{N}_{{w}_{k},{d}_{i}}}{L({d}_{i})}\times \mathit{log}(\frac{{N}_{D}}{{N}_{{w}_{i}}+1}),$$(3)where ${N}_{{w}_{k},{d}_{i}}$ is the number of times ${w}_{k}$ occurs in ${d}_{i}$, $L({d}_{i})$ denotes the length of that document, ${N}_{D}$ the total number of documents, and ${N}_{{w}_{i}}$ the number of documents in which ${w}_{i}$ occurs. The obtained TFIDF values are subsequently normalized to prevent bias in the term frequency variable due to document length $L({d}_{i})$.
With nodes of the coursesimilarity graph denoted by each course ${c}_{i}\in \u2102$, the edge weights are determined via
where ${\beta}_{\mathit{semantic}}$, ${\beta}_{\mathit{syntactic}}$, and ${\beta}_{\mathit{frequency}}$ are the trainable weights. Each of the variable ${a}_{\mathit{ij}}^{\mathcal{\mathcal{C}}}$ is used within the adjacency matrix
$${\mathbf{\text{A}}}_{\mathcal{\mathcal{C}}}=\left[\begin{array}{ccc}\hfill {a}_{11}^{\mathcal{\mathcal{C}}}\hfill & \hfill \cdots \hfill & \hfill {a}_{1{N}_{C}}^{\mathcal{\mathcal{C}}}\hfill \\ \hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ \hfill {a}_{{N}_{C}1}^{\mathcal{\mathcal{C}}}\hfill & \hfill \cdots \hfill & \hfill {a}_{{N}_{C}{N}_{C}}^{\mathcal{\mathcal{C}}}\hfill \end{array}\right]$$(5)corresponding to the coursesimilarity graph.
2.2 Temporal Grade Information
Before attempting the pilot course in the current semester, we assume, for each student ${s}_{i}$, availability of prior course grades in $\u2102$ across semesters $t\in \{1,\dots ,{N}_{T}\}$ , where ${N}_{T}$ is the total number of semesters, ${g}_{{s}_{i},{c}_{i}}^{t}$ denotes the grade that student achieves for ${c}_{i}$ in semester $t$. Hence, the grade vector for student ${s}_{i}$ in semester $t$ is given by
$${\mathbf{\text{g}}}_{{s}_{i}}^{t}=[{g}_{{s}_{i},{c}_{1}}^{t},\cdots \phantom{\rule{0.17em}{0ex}},{g}_{{s}_{i},{c}_{{N}_{C}}}^{t}],$$(6)where ${N}_{c}$ is the total number of prior courses across all ${N}_{T}$ semesters. It is important to note that for a given semester, only a subset of these ${N}_{c}$ prior courses are attempted, i.e., ${g}_{{s}_{i}}^{t}$ is not a full vector and null elements will be assigned for courses not attempted during that semester. Across all previous ${N}_{T}$ semesters, we acquire the temporal grade information for each student, as shown in Figure 1. Such temporal grade information would be used in two ways—(i) being aggregated according to the thinking skills required for each course and to generate student similarity as will be described in Section 2.3 and (ii) being concatenated with the course and student embeddings as input for LSTM.
2.3 Construction of Student Similarities Graph Based on Cognitive Levels
Construction of the studentsimilarity graph is based on cognitive levels associated with each course according to Table 1. Each of the prior courses is first categorized as one that requires highorder thinking skills $\mathcal{\mathscr{H}}$ or lowerorder thinking skills $\mathcal{\mathcal{L}}$. This is achieved by first classifying each course intended learning outcome statement via document classification described in [31] with classes being defined according to Bloom’s Taxonomy. Each course is then tagged as $\mathcal{\mathscr{H}}$ (or $\mathcal{\mathcal{L}}$) if more statements are classified as labels associated with highorder (or lowerorder) thinking skills.
For each student, we compute the frequency distribution ${p}_{{s}_{i}}^{\mathcal{\mathcal{L}}}$ and ${p}_{{s}_{i}}^{\mathcal{\mathscr{H}}}$ corresponding to courses that require lower and higherorder thinking skills. This is achieved by first dividing the grade range (1100) into five bins of twentypoint intervals before determining the number of courses (in each $\mathcal{\mathscr{H}}$ and $\mathcal{\mathcal{L}}$ category) that falls under each bin. Contributions of these two distributions are then learned via
$${\mathbf{\text{p}}}_{{s}_{i}}={\beta}_{g}\times {p}_{{s}_{i}}^{\mathcal{\mathcal{L}}}+(1{\beta}_{g})\times {p}_{{s}_{i}}^{\mathcal{\mathscr{H}}},$$(7)where ${\beta}_{g}$ is a learnable weight for ${p}_{{s}_{i}}$. With the above, student similarities are obtained via the JSD between the grade distribution for each pair of students, i.e.,
$${\text{a}}_{\mathit{ij}}^{\mathcal{\mathcal{S}}}=1\mathit{JSD}({p}_{{s}_{i}}\left\right{p}_{{s}_{j}}).$$(8)Therefore, a higher ${a}_{\mathit{ij}}^{\mathcal{\mathcal{S}}}$ implies that the two students possess similar higher or lowerorder skills (measured by how they perform in the prior courses). With the student similarity graph shown in Figure 1 comprising students as nodes, the corresponding adjacency matrix ${A}_{\mathcal{\mathcal{S}}}$ is generated based on ${a}_{\mathit{ij}}^{\mathcal{\mathcal{S}}}$ similar to (5).
Mean Squared Error (MSE) 


Methods  Department 1  Department 2  Department 3 
Average  
${c}_{1}$  ${c}_{2}$  Ave.  ${c}_{3}$  ${c}_{4}$  Ave.  ${c}_{5}$  ${c}_{6}$  Ave.  
LR  0.0360  0.0199  0.0280  0.0262  0.0247  0.0255  0.0264  0.0576  0.0420  0.0318

LSTM [12]  0.0309  0.0210  0.0260  0.0191  0.0259  0.0252  0.0164  0.0377  0.0270  0.0252

GCN [19]  0.0356  0.0214  0.0285  0.0259  0.0251  0.0245  0.0224  0.0276  0.0250  0.0263

Proposed CODEGP  0.0296  0.0203  0.0250  0.0159  0.0184  0.0172  0.0188  0.0299  0.0244  0.0222

2.4 GCN and Embeddings
After constructing the course and studentsimilarity graphs, we employ a twolayer GCN to embed each graph. Both course and student nodes are encoded with onehot vectors to obtain encoded matrices ${X}_{\mathcal{\mathcal{C}}}$ and ${X}_{\mathcal{\mathcal{S}}}$. The embedding vector ${E}_{\mathcal{\mathcal{C}}}$ for the coursesimilarity graph is generated via
$${\text{E}}_{\mathcal{\mathcal{C}}}={W}_{\mathcal{\mathcal{C}}}{X}_{\mathcal{\mathcal{C}}}$$(9)such that the onehot vectors are represented as dense vectors of lower dimensions. Here, ${W}_{\mathcal{\mathcal{C}}}$ is the weight matrix. With ${E}_{\mathcal{\mathcal{S}}}$ being generated similarly, and with ${A}_{}C$ and ${A}_{}S$ derived from Sections 2.1 and 2.3, two GCN layers [18] are then applied to obtain latent representations of all nodes in coursesimilarity graph $\mathcal{\mathcal{C}}$ and studentsimilarity graph $\mathcal{\mathcal{S}}$. In particular, the $(\mathcal{\mathcal{G}}+1)$th layer for $\mathcal{\mathcal{C}}$ is computed via
$${\text{Z}}_{\mathcal{\mathcal{C}}}^{(\mathcal{\mathcal{G}}+1)}=\sigma \left({D}_{\mathcal{\mathcal{C}}}^{\frac{1}{2}}{A}_{\mathcal{\mathcal{C}}}{D}_{\mathcal{\mathcal{C}}}^{\frac{1}{2}}{Z}_{\mathcal{\mathcal{C}}}^{(\mathcal{\mathcal{G}})}{W}_{\mathcal{\mathcal{C}}}^{(\mathcal{\mathcal{G}})}\right),$$(10)where ${D}_{\mathcal{\mathcal{C}}}={\sum}_{{c}_{i}}{A}_{\mathcal{\mathcal{C}}}$ is the degree matrix, ${Z}_{\mathcal{\mathcal{C}}}^{(0)}={E}_{\mathcal{\mathcal{C}}}$, and ${W}_{\mathcal{\mathcal{C}}}^{(\mathcal{\mathcal{G}})}$ is the weight matrix. The output of the GCN for coursesimilarity graph is denoted as matrix ${R}_{\mathcal{\mathcal{C}}}={Z}_{\mathcal{\mathcal{C}}}^{(2)}$ with each row vector ${r}_{{c}_{i}}$ being associated with course ${c}_{i}$. The above computation is also applied on studentsimilarity graph$\mathcal{\mathcal{S}}$ to obtain the graph embedding matrix ${R}_{\mathcal{\mathcal{S}}}$ with each row vector being defined as ${\mathbf{\text{r}}}_{{s}_{j}}$ for student ${s}_{j}$.
To generate representations for the prior grades achieved, embedding is applied for each grade. With a onehot vector representing a unique value of prior grade ${g}_{{s}_{i},{c}_{j}}^{t}$, the embedding vector for a student prior grade is learned via
$${\mathbf{\text{e}}}_{{s}_{i},{c}_{j}}^{t}={W}_{G}\text{Onehot}\left({g}_{{s}_{i},{c}_{j}}^{t}\right),$$(11)where ${W}_{G}$ is the weight matrix. The three embedding vectors from coursesimilarity graph, studentsimilarity graph, and temporal grade information are then concatenated for each semester to form a $({l}_{a}+{N}_{C}\times ({l}_{b}+{l}_{c}))\times 1$ vector
$${\mathbf{\text{e}}}_{{s}_{i}}^{t}={[{\mathbf{\text{r}}}_{{s}_{i}},{\mathbf{\text{r}}}_{{c}_{1}},{e}_{{s}_{i},{c}_{1}}^{t},\dots ,{\mathbf{\text{r}}}_{{c}_{{N}_{C}}},{e}_{{s}_{i},{c}_{{N}_{C}}}^{t}]}^{T},$$(12)where ${l}_{a}$, ${l}_{b}$, and ${l}_{c}$ denote the embedding length for ${r}_{{s}_{i}}$, ${r}_{{c}_{j}}$, and ${e}_{{s}_{i},{c}_{j}}^{t}$, respectively, and $T$ denotes transpose. Each of these vectors are then concatenated to form a feature matrix
$${\mathbf{\text{E}}}_{{s}_{i}}=[{\mathbf{\text{e}}}_{{s}_{i}}^{1},\dots ,{\mathbf{\text{e}}}_{{s}_{i}}^{{N}_{T}}]$$(13)of each student ${s}_{i}$ for the subsequent prediction model.
2.5 Grade Prediction using Long Shortterm Memory Network
LSTM models timeseries representations and is used to predict the pilot grade based on sequential matrix ${E}_{{s}_{i}}$ for each student. Through the use of input, output, and forget gate, LSTM aggregates important and permutes less significant representations to achieve prediction of pilot grades in semester ${N}_{T}+1$. LSTM is employed for grade prediction via the hidden state
$${\mathbf{\text{h}}}_{{s}_{i}}^{t}=\text{LSTM}({e}_{{s}_{i}}^{t},{h}_{{s}_{i}}^{t1}),$$(14)where ${h}_{{s}_{i}}^{t}$ denotes the hidden state for semester $t$. The predicted grade ${\hat{g}}_{{s}_{i}}$ for student ${s}_{i}$ obtained from the last hidden state is then given by
$${\text{g^}}_{{s}_{i}}={w}_{L}\cdot {h}_{{s}_{i}}^{{N}_{T}}+b,$$(15)where ${w}_{L}$ and $b$ are defined, respectively, as the weight vector and bias scalar for the predictor.
3. RESULTS AND DISCUSSION
3.1 Datasets and Implementation Details
Opensource datasets employed for grade prediction do not include course descriptions. We collected data that include both academic records and course descriptions (comprising both course outlines and intended learning outcomes). These are obtained from three engineering departments in a university to evaluate the models. Each dataset is obtained with the student name and identity being hashed by another office (authorized to handle such data) to protect privacy. Table 2 summarizes details for each dataset used. In particular, ${N}_{T}$ for each dataset is determined by the maximum number of semesters the students within the cohort take to complete all courses under consideration. The prior course list for each pilot course consists of the core courses corresponding to the department’s curriculum. In addition, ${N}_{C}$ and ${N}_{S}$ are distinct for each dataset. In our experiments, the training, validation, and testing ratio are set as 6:2:2.
We employed the mean squared error (MSE)
$$\mathit{MSE}=\frac{1}{{N}_{S}}\sum _{i=1}^{{N}_{S}}{({\hat{g}}_{{s}_{i}}{g}_{{s}_{i}})}^{2}$$(16)for performance evaluation, where ${g}_{{s}_{i}}$ denotes the actual grade obtained by student ${s}_{i}$ for a given pilot course. In terms of hyperparameter selection, course description document embeddings are trained using BERT with a dimension of 768. During GCN training, the dropout rate was set as 0.5, while the Adam optimizer with a learning rate of 0.001 was used. A weight decay parameter was set to $5\times 1{0}^{4}$ to prevent overfitting.
3.2 Performance Analysis
We take pilot course ${c}_{5}$ from Department $3$ as an example to illustrate the impact of considering the semantic, syntactic, and frequency aspects of words used in course outlines and intended learning outcomes. Three heatmaps with colors depicting the similarity values described in Section 2.2 are provided while details pertaining to prior course information are shown in Table 3.
Figure 2(a) illustrates the semantic cosine similarity where high similarities in terms of the closeness of course content are indicated by the dark shades. It can be seen that the mathematicsbased prior course EC180 exhibits high semantic similarity with other prior courses EC280, EA206, EC181, and EA304, which have high mathematical content. On the other hand, computing course CS108 exhibits lower semantic similarity with the most of other nonprogramming courses. Figure 2(b) highlights how (dis)similar phrase types are between the course outlines and the intended learning outcomes of two prior courses. We note that EC181 exhibits higher Jaccard similarity with courses that require fundamental scientific and mathematical knowledge such as EC180, IC102, and EC280. TFIDF weighting, on the other hand, indicates the choice and uniqueness of words being used in the course outlines and intended learning outcomes. Figure 2(c) highlights the high variability in words used between the courses being considered—only a few pairs of course outlines and intended learning outcomes exhibit high TFIDF similarity. In addition, we also note that the similarity between content is irrelevant. This can be observed from the fact that even though EC180 and EC181 are mathematicsrelated, their frequencybased TFIDF similarity is relatively low.
We next compare the performance of the proposed CODEGP model with LSTM based gradeprediction model [12], GCN [19], and the conventional logistic regression (LR) model. While LR and LSTM focus on temporal information and GCN exploits the interrelationship between courses and students, the proposed model considers both aspects. We note from Table 4 that the proposed CODEGP model achieves the highest grade prediction capability than the LR, LSTM, and GCN. While the proposed model requires higher complexity than these three baseline models, CODEGP achieves the lowest mean MSE of 0.0222 (11.9% improvement compared to LSTM), across the three departments as seen in Table 4. These results highlight the importance of course descriptions when constructing student and coursesimilarity graphs with time series information. Features extracted from course descriptions enhance the grade prediction capability instead of using only a single modality.
We further performed an ablation test by excluding each input graph/temporal representations. Table 5 summarizes the MSE and mean absolute error (MAE) across all three departments. We note that the use of all three aspects in CODEGP is vital to provide a holistic perspective for grade prediction. It is interesting to note that grade prediction performance is more sensitive to coursesimilarity graph (compared to studentsimilarity graph). This suggest that information derived from course descriptions can assist in grade prediction since performance is closely related to achieving the set of intended learning outcomes depicted in course descriptions. These results also highlight that temporal information and graphs provide complementary features which contribute jointly to the success of grade prediction.
4. CONCLUSIONS
We propose a grade prediction model that considers course descriptions and prior academic results. Text mining techniques determine the edge weights of the course and studentsimilarity graphs. A threepronged model that constitutes the semantic, syntactic, and frequencybased feature extraction methods is formulated for course similarities. Student performance in terms of their achievements in courses associated with low or highorder thinking skills have also been incorporated to construct the studentsimilarity graph. The LSTM synthesizes these aspects before performing prediction.
An accurate and justintime prediction of performance enables course instructors to administer early interventions. Once the predicted results indicate a tendency of a student in failing a course, student support staff can respond and plan for a personalized intervention strategy for each student. Moreover, early detection of atrisk students can potentially reduce the dropout rate. Future work may include techniques that incorporate other data modalities such as student demographic or online learning behavior while protecting student privacy.
5. REFERENCES
 M. Adnan, A. Habib, J. Ashraf, S. Mussadiq, A. A. Raza, M. Abid, M. Bashir, and S. U. Khan. Predicting atrisk students at different percentages of course length for early intervention using machine learning models. IEEE Access, 9:7519–7539, 2020.
 T. Andre. Does answering higherlevel questions while reading facilitate productive learning? Review Edu. Research, 49:280–318, 1979.
 X. Bai, P. Liu, and Y. Zhang. Investigating typed syntactic dependencies for targeted sentiment classification using graph attention neural network. IEEE/ACM Trans. Audio Speech Lang. Proc., 29:503–514, 2021.
 D. Boud and N. Falchikov. Aligning assessment with longterm learning. Assessment, Evaluation Higher Edu., 31:399–413, 2006.
 S. G. Bull. The role of questions in maintaining attention to textual material. Review Edu. Research, 43:83–88, 1973.
 H. Bydžovská. A comparative analysis of techniques for predicting student performance. In Proc. Int. Conf. Edu. Data Mining (EDM), pages 306–311, 2016.
 J. Devlin, M. W. Chang, K. Lee, and K. Toutanova. BERT: pretraining of deep bidirectional transformers for language understanding. In Proc. Human Lang. Tech.: Annual Conf. North American Chap.(NAACLHLT), pages 4171–4186, 2019.
 M. Fateen and T. Mine. Predicting student performance using teacher observation reports. In Proc. Int. Conf. Edu. Data Mining (EDM), pages 481–486, 2021.
 R. M. Felder and R. Brent. Designing and teaching courses to satisfy the ABET engineering criteria. J. Eng. Edu., 92:7–25, 2003.
 Q. Hu and H. Rangwala. Coursespecific Markovian models for grade prediction. In Proc. Int. PacificAsia Conf. Knowledge Discovery Data Mining, pages 29–41. Springer, 2018.
 Q. Hu and H. Rangwala. Academic performance estimation with attentionbased graph convolutional networks. In Proc. Int. Conf. Educational Data Mining, pages 69–78, 2019.
 Q. Hu and H. Rangwala. Reliable deep grade prediction with uncertainty estimation. In Proc. Int. Conf. Learn. Anal. & Knowl., pages 76–85, 2019.
 T. Hunt. Overlapping in high school and college again. J. Edu. Research, 13(3):197–207, 1926.
 V. N. Ioannidis, A. G. Marques, and G. B. Giannakis. A recurrent graph neural network for multirelational data. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), pages 8157–8161, 2019.
 Z. Iqbal, J. Qadir, A. N. Mian, and F. Kamiran. Machine learning based student grade prediction: A case study. arXiv, pages 1–22, 2017.
 W. Jiang and Z. A. Pardos. Evaluating sources of course information and models of representation on a variety of institutional prediction tasks. In Proc. Int. Conf. Edu. Data Mining (EDM), pages 115–125, 2020.
 H. Karimi, T. Derr, J. Huang, and J. Tang. Online academic course performance prediction using relational graph convolutional neural network. In Proc. Int. Conf. Edu. Data Mining (EDM), pages 444–450, 2020.
 T. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. In Proc. Int. Conf. Learn. Representations (ICLR), pages 1–14, 2017.
 T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. In Proc. Int. Conf. Learn. Representations, pages 76–85, 2017.
 D. Krathwohl. A revision of Bloom’s Taxonomy: An overview. Theory into Practice, 41:212–218, 2002.
 D. D. Leeds, T. Zhang, and G. M. Weiss. Mining course groupings using academic performance. In Proc. Int. Conf. Edu. Data Mining (EDM), pages 1–5, 2021.
 X. Liu, X. You, X. Zhang, J. Wu, and P. Lv. Tensor graph convolutional networks for text classification. In Proc. AAAI Conf. Artificial Intell., pages 8409–8416, 2020.
 X. Lu, Y. Zhu, Y. Xu, and J. Yu. Learning from multiple dynamic graphs of student and course interactions for student grade predictions. Neurocomputing, 431:23–33, 2021.
 H. Ma, X. Wang, J. Hou, and Y. Lu. Course recommendation based on semantic similarity analysis. In Proc. IEEE Int. Conf. Control Sci. Syst. Engg., pages 638–641, 2017.
 K. H. R. Ng, S. Tatinati, and A. W. H. Khong. Online education evaluation for signal processing course through student learning pathways. In Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process., pages 6458–6462, 2018.
 K. H. R. Ng, S. Tatinati, and A. W. H. Khong. Grade prediction from multivalued clickstream traces via Bayesianregularized deep neural networks. IEEE Trans. Signal Process., 69:1477–1491, 2021.
 Z. Ren, X. Ning, A. S. Lan, and H. Rangwala. Grade prediction based on cumulative knowledge and cotaken courses. In Proc. Int. Conf. Educational Data Mining, pages 158–167, 2019.
 J. C. Richards. Curriculum approaches in language teaching: Forward, central, and backward design. RELC J., 44(1):5–33, 2013.
 S. H. Seyyedrezaie and G. Barani. Constructivism and curriculum development. J. Humanities Insights, 1(3):119–124, 2017.
 S. Supraja, K. Hartman, S. Tatinati, and A. W. H. Khong. Toward the automatic labeling of course questions for ensuring their alignment with learning outcomes. In Proc. 10th Int. Conf. Educational Data Mining (EDM), pages 56–63, 2017.
 S. Supraja, A. W. H. Khong, and S. Tatinati. Regularized phrasebased topic model for automatic question classification with domainagnostic class labels. IEEE/ACM Trans. Audio Speech Lang. Proc., 29:3604–3616, 2021.
 S. Supraja, S. Tatinati, K. Hartman, and A. W. H. Khong. Automatically linking digital signal processing assessment questions to key engineering learning outcomes. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), pages 6996–7000, 2018.
 M. Sweeney, J. Lester, and H. Rangwala. Nextterm student grade prediction. In Proc. IEEE Int. Conf. Big Data, pages 970–975, 2015.
 R. Tang and W. SaeLim. Data science programs in U.S. higher education: An exploratory content analysis of program description, curriculum structure, and course focus. Edu. Info., 32(3):269–290, 2016.
 J. Valenchon and M. Coates. Multiplegraph recurrent graph convolutional neural network architectures for predicting disease outcomes. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), pages 3157–3161, 2019.
 P. Wei, J. Zhao, and W. Mao. A graphtosequence learning framework for summarizing opinionated texts. IEEE/ACM Trans. Audio Speech Lang. Proc., 29:1650–1660, 2021.
 A. A. Yahya, A. Osman, A. Taleb, and A. A. Alattab. Analyzing the cognitive level of classroom questions using machine learning techniques. In Proc. 9th Int. Conf. Cognitive Sci., pages 587–595, 2013.
 Y. Zhang, R. An, S. Liu, J. Cui, and X. Shang. Predicting and understanding student learning performance using multisource sparse attention convolutional neural networks. IEEE Transactions on Big Data, 2021.
 A. Zohar and Y. J. Dori. Higher order thinking skills and lowachieving students: Are they mutually exclusive? J. Learn. Sci., 12(2):145–181, 2003.
© 2022 Copyright is held by the author(s). This work is distributed under the Creative Commons AttributionNonCommercialNoDerivatives 4.0 International (CC BYNCND 4.0) license.