MOOC-Rec: Instructional Video Clip Recommendation for MOOC Forum Questions

Zhu, Peide; Hauff, Claudia; Yang, Jie

doi:10.5281/zenodo.6853055

Peide Zhu

Delft University of Technology

p.zhu-1@tudelft.nl

Jie Yang

Delft University of Technology

j.yang-3@tudelft.nl

Claudia Hauff

Delft University of Technology

c.hauff@tudelft.nl

ABSTRACT

In this work, we address the information overload issue that learners in Massive Open Online Courses (MOOCs) face when attempting to close their knowledge gaps via the use of MOOC discussion forums. To this end, we investigate the recommendation of one-minute-resolution video clips given the textual similarity between captions and MOOC discussions. We first create a large-scale dataset from Khan Academy video transcripts and their forum discussions. We then investigate the effectiveness of applying pre-trained transformers-based neural retrieval models to make a ranked list of video clips for a forum discussion. The retrieval models are trained with supervised learning and distant supervision to effectively leverage the unlabeled data(over 80% of all data). Experimental results demonstrate that the proposed method is effective for this task, by outperforming baseline by 0.208 in terms of precision. To the best of our knowledge, this is the first systematic research applying pre-trained transformer-based ranking models on MOOC clip recommendation.

Keywords

MOOC, Discussion Forum, Video Clip Transcripts, Clip Recommendation

1. INTRODUCTION

Massive Open Online Courses (MOOCs) provide open access to world class courses for the public, which greatly improves the opportunities in online learning. The discussion forum is a major component of a MOOC as it is the primary communication platform among learners and instructors [1] to moderate the lack of physical access in MOOCs. It can help learners build a sense of belonging and learn from peers, or help instructors monitor learner affect and academic progress [2]. However, since questions targeting the same video content are scattering among discussion threads, without supporting navigation facilities, learners cannot effectively retrieve valuable discussions for a particular content. In addition, learners’ posts seeking help may be drowned out by the many other competing posts, making it hard for learners to get attention from instructors and peers. The unstructured, unorganized forums with a large amount of discussions (leading to information overload [19]) are hindering instructors and learners from their benefits, inhibit community interaction, reduce responsiveness in forums and in the end lead to low MOOC retention rates [20, 13].

Existing works directed at addressing the information overload issue in MOOC forums have proposed more effective navigation tools to identify instructional video contents and make recommendations of a ranked list of video clips. For example, [2] classifies posts that need help and employ bag-of-words based retrieval techniques to map them to minute-resolution course video clips. The clip recommendation algorithm is evaluated on posts from one course. [17] built a recommender system to generate a ranked list of video clips giving a student’s question with a deep neural network; they evaluate the system with 50 questions. However, prior work on video clip recommendation suffers from the lack of training data, and reports evaluations on small-scale data. It remains a challenge to develop and evaluate a system that can scale to thousands of MOOCs, across different domains.

In this work, we first address the lack of training data issue by creating MOOC-CLIP, a novel large-scale dataset from Khan Academy ¹, that includes video captions and forum posts (both questions and answers) using raw data available from LearningQ [3], an open source tool and dataset for educational question generation. Second, we propose MOOC-Rec , a dense retrieval based instructional video clip recommendation system for MOOC forum questions. For each content-related thread, MOOC-Rec recommends a ranked list of video clips that are likely relevant and helpful for answering the question. Although dense retriever has been applied in various retrieval tasks such as DPR [6] and ColBERT [7], it is unknown whether it is an effective approach for MOOC video clip recommendation. Furthermore, only 11.57% of all discussions are labeled with target video clip, which poses challenges for training MOOC-Rec with limited labeled data and abundant unlabeled resources. In this paper, we first investigate the effectiveness of MOOC-Rec and then we address the scarcity of labeled data by using distant supervision and in-batch negatives to train the ranker.

To the best of our knowledge, this is the first work that systematically investigate applying state-of-the-art pre-trained transformer based neural ranking models no MOOC clip recommendation problem. The comprehensive experiments on our large scale dataset show that our systems significantly improve the clip recommendation performance.

2. THE MOOC-CLIP DATASET

To address the lack of research data, we create a large-scale dataset using raw data crawled with LearningQ² from Khan Academy, a MOOC platform which allows learners to ask and answer questions about the learning materials during learning. We keep video transcripts, forum questions and answers of MOOCs which have both transcripts and discussions available.

Learners use discussion forums in different ways. Besides asking questions related to the course materials, they may also discuss irrelevant topics [14], such as socializing, posting spams, or expressing the appreciation for the video. Some questions posted by learners also suffer from lack of proper context, or be too generic. Therefore, it is necessary to remove the relatively low quality questions. In the same with LearningQ, we consider a user-generated question to be as useful for learning when all of the following conditions hold: (i) the question is concept-relevant, i.e., it seeks for information on knowledge concepts taught in lecture videos or articles; (ii) the question is context-complete, which means sufficient context information is provided to enable other learners to answer the question; and (iii) the question is not generic (e.g., a question asks for learning advice). We labeled 13,290 questions over 8 topics totally. We found 60.9% of them are labeled as useful and 39.1% of them as non-useful. We keep all items belonging to 3 topics (2344 in total ) as unknown set for Cross-Topics evaluation. We split 8766 questions on other 5 topics for training, and the rest 2186 as known topic test set. We train BERT

Table 1: Useful question classifier results.
	Same Topic			Cross-Topics
Method	Acc	Rc	F1	Acc	Rc	F1
Q	89.40	96.68	92.90	77.20	74.49	75.82
Q+C	89.75	96.54	93.02	73.30	82.68	77.71

During preprocessing, we first remove noisy discussions which contain only meaningless tokens, videos which have no discussions. Then we apply the useful question classifier on all items and retrain only items are classified as true. In the end, we retain 273,887 discussions from 7,349 videos and forums across 1991 MOOC courses. We use regular expressions to retrieve discussions where learners label post exact timestamps in questions or answers. We split video transcripts to snippets with a 1 minute length. The discussions and the snippets which cover the timestamp are labeled as positive items. Other discussions are treated as unlabeled. Table 2 shows the labeled and unlabeled data statistics. In summary, there are 31,680 positive labeled items and 240,551 unlabeled items, i.e. 11.57% of all discussions are labeled.

Table 2: Dataset overview, in terms of the number of courses (#C), videos (#V), snippets (#S) per video, discussions (#D) per video, clip (#W), the number of words per question (Q) and the number of words per answer (A)
Split	#C	#V	#S/V	#W/S	#W/Q	#W/A
Train	1714	4590	7.91	198.51	39.96	80.89
Dev	641	895	8.37	199.04	40.02	79.26
Test	774	1126	8.14	198.64	39.67	81.92
Unlabeled	1985	7283	7.70	197.96	38.46	78.58

Dataset overview A — Figure 2: Dataset overview regarding the number of labeled and unlabeled questions in each topic. We can see the unbalanced distribution questions in each topic.

This dataset also covers a series of educational topics including math, science, careers, humanties, etc. We conduct an exploratory analysis along each topic dimension which is shown in Fig 2. We observe a topic imbalance, e.g. discussions under math and science topics account for 78.88% of labeled items and 76.82% of all items. The labeled data is then split into 80% and 20% for training and test sets respectively based on the number of discussions in each set.

3. METHODOLOGY

The problem of MOOC video clip recommendation studied in this paper can be described as follows. Given a forum discussion question, the system is supposed to find a ranking list of most relevant video clips represented by their captions. We assume the questions filtered by the content-relevant question classifier are relevant to the course materials, and the most relevant video clips should be instructional for learners. Assume a MOOC video $𝒱$ lasts for $T$ seconds, then we split it to $s$ $t$ -seconds clips, where $s = ⌈ \frac{T}{t} ⌉$ . We choose to split each video to one-minute-resolution clips, i.e. we set $t = 60$ . Then the video $𝒞$ contains clips $c_{1}, c_{2}, \dots, c_{s}$ . Each clip $c_{i}$ is represented with its caption, which can be viewed as a sequence of tokens $w_{1}^{i}, w_{2}^{i}, \dots, w_{| c_{i} |}^{i}$ . Given a discussion $d_{i} = [q_{i}, {a_{i}}]$ , where ${a_{i}}$ means the answers to the question $i$ , in some cases the question has not been answered yet, which is common in MOOC forums. Then the task is to make a ranking list of clips $c_{i, 1}, c_{i, 2}, \dots, c_{i, s}$ given each dicussion $d_{i}$ . Notice that the corpus covers courses from different domains, and answers are not always available. As a result, the video clip recommender needs to work effectively for MOOCs in different domains and unanswered questions. Formally speaking, the recommender $ℛ : (d, C) \to C_{ℛ}$ is a function that takes a discussion $d$ and video clip list $C$ as the input and returns a ranked list of clips $C_{ℛ}$ . We can also choose to only return the top- $K$ most relevant clips.

3.1 Dual-Encoder

We use the neural IR architecture [6] for the ranker. It uses a dense encoder $E_{C} (\cdot)$ which encodes the video clip transcripts to an $m$ -dimensional real-valued vectors. At run-time, MOOC-Rec maps the input discussion $d = [q, a]$ to another $m$ -dimensional vector using the query encoder $E_{Q} (\cdot)$ , and retrieves top- $k$ most closest video clip vectors. We define the similarity between the clip and the discussion using the following function of the two vectors:

s im (d, c) = E_{Q} {(d)}^{T} E_{C} (c),

(1)

In this paper, cosine similarity is used for modeling the similarity between the discussion and the clip vectors.

The goal of training is to learn better embedding function for both the clips and discussions which can map relevant pairs of discussions and clips to vectors with smaller distance, i.e. higher similarity, so that the similarity function $sim (d, c)$ becomes a good ranking function for MOOC video clip recommendation. This is essentially a metric learning problem [9, 11, 6]. Let $ℳ = {⟨ d_{i}, c_{i}^{+}, c_{i, 1}^{−}, \dots, c_{i, n}^{−} ⟩}_{i = 1}^{m}$ be the training MOOC discussion corpus that contains $m$ instances. Each example has one discussion $d_{i} = [q_{i}, a_{i}]$ , one relevant (positive) video clip transcripts $c_{i}^{+}$ , and $n$ irrelevant (negative) clips $c_{i, j}^{−}$ . We train the retrieval model by optimizing the negative log likelihood of the positive clip:

L (d_{i}, c_{i}^{+}, c_{i, 1}^{−}, \dots, c_{i, n}^{−}) = - \log \frac{e^{sim (d_{i}, c_{i}^{+})}}{e^{sim (d_{i}, c_{i}^{+})} + ∑_{j = 1}^{n} e^{sim (d_{i}, c_{i, j}^{−})}}

For labeled discussions, positive and negative video examples are explicit. We use the video clip whose time duration contains timestamp of the discussion as the positive example. While all other video clips from the same video can be treated as negatives, on the one hand, MOOC videos varies in the number of clips; on the other hand, to boost the model training and balance the number of positive and negative examples, we choose to select $n$ from them as the training negative examples. Just the same as other The selection of negative clips is essential for training a high-quality ranker. We apply in-batch negatives [5, 6] for training. Then the positive clips for other questions are also treated as the negatives.

As we show in Table 2, over 80% of all discussion are unlabeled. It would be labor-intensive and expensive to make human annotations. We adopt the “distant supervision” [10] to effectively utilize the rich unlabeled data and train a better model with them. This process involves training the model with noisy “weakly” labeled data. MOOC-Rec is able to achieve over 50% precision on the top-1 prediction and over 70% in top-3 with Recall@3 over 80%. Therefore, we use the ranker trained on labeled training set as the scorer, use top-1 clips with highest $sim (d, c)$ as positives, and the clips with least $sim (d, c)$ (besides top-3) as negatives. The weakly labeled data are then used to train the ranker.

During inference time, we pre-compute all clip embedding $v_{c}$ by applying the clip encoder $E_{C}$ to all MOOC video clips offline. Given a discussion $d = [q, a]$ at run-time, we concatenate the question and answers if $a$ is available and compute the discussion embedding $v_{d} = E_{Q} (d)$ . Top- $k$ clips are retrieved with highest $sim (d, c)$ score.

Although encoders can be implemented in many different ways [10], in this work, we use two independent BERT [4] variant models as encoders and the mean value of all token embeddings is used as the final representation. We tokenize clip transcripts and truncate the token list to max length 512 (starts with [CLS] and ends with [SEP] token). In this research, the discussion encoder works as the ‘query’ encoder in typical neural IR systems. Instead of using separate encoders for questions and answers of the discussion, in our design both of them share the same encoder. In this way, we train a better query encoder for questions by taking advantage of answer information.

3.2 Cross-Encoder

Both cross-encoder and dual-encoder are two common approaches for matching sentence pairs. Different from dual encoder, in contrast to producing sentence embedding vectors for clips and discussions independently, the cross encoder treats the clip recommendation for discussions as a sequence classification task and perform full self-attention over the whole sequence. We concatenate the video clip transcripts and the discussions (question and answers) with the [SEP] token as the input to the transformer network. The [CLS] token embedding is then passed to a binary classifier to predict the binary relevance between them. To find the most relevant MOOC clip for the question, the encoder should compute the relevance score on all question-clip pairs and lead to massive computational overhead.

4. EXPERIMENTS AND RESULTS

4.1 Experiment Settings

We implement dual encoders using pre-trained weights of BERT variants: MPNet [16] (embedding size: 768) and MiniLM [18] (embedding size: 384) provided by Sentence-Transformers library ³ [15]. Both models are pre-trained on a large and diverse dataset of over 1 billion training query-paragraphs pairs for semantic search task. Adam optimizer [8] with warming-up and cosine schedule is used for training and we set the maximum learning rate ( $lr$ ) as $lr = 2 e^{- 5}$ , $𝜖 = 1 e^{- 8}$ and the warmup steps as 1000. For cross-encoder baseline, we follow previous research [10, 12]. We using the same base model with the dual encoder, i.e. the 6-layer MiniLM and MPNet. The BM25 baseline is based on Okapi BM25 implementation of the rank_bm25 library ⁴. We train our model using 8 GTX-1080 GPUs for 10 iterations with a batch size of 32. As Figure 3 shows, after one iteration, both of clip recommendation systems outperform the BM25 baseline. After several iterations, model performance first improves gradually and then become steady, which shows the effectiveness of the training system and the effectiveness of the proposed models.

Table 3: Performance of the proposed `MOOC-Rec` ranker and baselines on the test set in terms of rank-aware metrics. MLM/MPNet $_{dual}$ represents the `MiniLM` or `MPNet` based dual encoder and MLM/MPNet $_{cross}$ represents the `MiniLM` or `MPNet` based cross encoder. “PT” represents ranker performance using pre-trained encoders without fine-tuning. “FT” means fine-tuned model performance. “WL” means the model performance after training with weakly labeled data.
Method		P@1	MRR	MRR@3	nDCG	nDCG@3
	BM25	0.417	0.600	0.550	0.696	0.593
PT	MLM $_{cross}$	0.132	0.346	0.254	0.497	0.297
	MLM $_{dual}$	0.422	0.614	0.568	0.707	0.617
	MPNet $_{cross}$	0.135	0.344	0.248	0.495	0.288
	MPNet $_{dual}$	0.386	0.583	0.529	0.683	0.576
FT	MLM $_{cross}$	0.511	0.677	0.641	0.755	0.683
	MLM $_{dual}$	0.529	0.692	0.658	0.767	0.700
	MPNet $_{cross}$	0.613	0.745	0.716	0.807	0.750
	MPNet $_{dual}$	0.570	0.720	0.690	0.788	0.730
WL	MLM $_{cross}$	0.540	0.696	0.661	0.770	0.700
	MLM $_{dual}$	0.520	0.683	0.646	0.760	0.687
	MPNet $_{cross}$	0.625	0.751	0.722	0.812	0.754
	MPNet $_{dual}$	0.557	0.711	0.680	0.782	0.720

4.2 Effectiveness of Dense Retrieval

Table 3 summaries model performance on the test set. We use BM25 as the baseline. The popular sparse vector-space models such as TF-IDF/BM25 retrieval methods have been widely used in instructional clip recommendation systems. Its performance in terms of Precision@1 (P@1) and MRR is 0.417 and 0.60 respectively, which shows queries possess more lexical similarity to related MOOC clips than others and BM25 is an effective and strong baseline for this application. First, we find that without fine-tuning, pre-trained dual encoder can achieve similar(MPNet), or even better (MiniLM-L6) performance compared with BM25 baseline, while the cross-encoders cannot make clip recommendation for discussions without training. Second, we observe significant gains ( $p = 1.95 e^{- 7}$ ) when using the MOOC-Rec neural ranker after being trained on the data, with gains of over 0.15 in P@1 and over 0.19 in nDCG scores compared to the BM25 baseline on the test set. Thus, dense retrieval is an effective instructional MOOC clip recommendation approach for forum discussions which can model the relevance between discussions and clip transcripts.

To compare the impacts of model size, we use one distilled transformer model MiniLM which contains $22 M$ parameters and one BERT size model MPNet which contains $109 M$ parameters. As Table 3 shows, in both cross-encoder and cross-encoder setting, the larger model MPNet achieves better performance after training, which shows that the transformer model with more parameters may have better potential to model the relevance between clips and discussions.

Both cross-encoder and dual-encoder are commonly used for sentence pair matching problems. In Table 3, we can observe on the small transformer model, dual-encoder outperforms the cross-encoder by around 0.01, while the cross encoder using large pre-trained model by 0.043 on P@1, and around 0.02 in other metrics. Despite the performance advantage of cross-encoder with the large model, we observe massive computational overhead with the cross-encoder. The dual-encoder is more time efficient because the MOOC course clips can be encoded in advance, and each time the clip recommender make the prediction, it only need encode the discussion, and compute cosine similarity of the discussion encoding vector and the stored MOOC clip embeddings.

In "WL" section of Table 3, we summarize the different models’ performance after distant training with weakly labeled data. Cross-encoders perform better(+0.029 for MiniLM and +0.012 for MPNet in terms P@1), but dual-encoders perform worse(-0.009 for MiniLM and -0.013 for MPNet in terms P@1). One explanation is although MOOC-Rec after initial training can achieve good performance, the weakly labeled data created with it still contain considerable noisy content. The dual-encoders are trained by optimizing listwise loss and are more liked effected by the noisy data.

5. CONCLUSION AND FUTURE RESEARCH

We studied the task of video clip recommendation in the context of MOOC forms which has the eventual goal to reduce learners’ information overload. We created a novel dataset including video transcripts and discussions, systematically investigated how to incorporate the state-of-art pre-trained neural IR models for MOOC clip recommendation, and proposed a framework including data preparation, useful question classification, clip ranker and weak supervision training for this task. We conduct the experiments with both cross encoders and dual-encoders. The results on our dataset show the effectiveness of the proposed method. In future work, on the one hand, we will further analysis intentions of forum discussions and their relevance to course videos. On the other hand we will investigate factors that affect MOOC-Rec performance such as the clip duration and methods of creating weak labels.

6. ACKNOWLEDGMENTS

This research is partly supported by the China Scholarships Council (CSC).

References

P. Adamopoulos. What makes a great mooc? an interdisciplinary analysis of student retention in online courses. 2013.
A. Agrawal, J. Venkatraman, S. Leonard, and A. Paepcke. Youedu: addressing confusion in mooc discussion forums by recommending instructional video clips. 2015.
G. Chen, J. Yang, C. Hauff, and G.-J. Houben. Learningq: a large-scale dataset for educational question generation. In Proceedings of the International AAAI Conference on Web and Social Media, volume 12, 2018.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
M. Henderson, R. Al-Rfou, B. Strope, Y.-H. Sung, L. Lukács, R. Guo, S. Kumar, B. Miklos, and R. Kurzweil. Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652, 2017.
V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
O. Khattab and M. Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020.
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
B. Kulis et al. Metric learning: A survey. Foundations and Trends® in Machine Learning, 5(4):287–364, 2013.
J. Lin, R. Nogueira, and A. Yates. Pretrained transformers for text ranking: Bert and beyond. arXiv preprint arXiv:2010.06467, 2020.
B. Mitra, N. Craswell, et al. An introduction to neural information retrieval. Now Foundations and Trends, 2018.
R. Nogueira and K. Cho. Passage re-ranking with bert. arXiv preprint arXiv:1901.04085, 2019.
A. Ntourmas, N. Avouris, S. Daskalaki, and Y. Dimitriadis. Evaluation of a massive online course forum: design issues and their impact on learners’ support. In IFIP conference on human-computer interaction, pages 197–206. Springer, 2019.
A. Ntourmas, S. Daskalaki, Y. Dimitriadis, and N. Avouris. Classifying mooc forum posts using corpora semantic similarities: a study on transferability across different courses. Neural Computing and Applications, pages 1–15, 2021.
N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.
P. Trirat, S. Noree, and M. Y. Yi. Intellimooc: Intelligent online learning framework for mooc platforms. In Proceedings of The 13th International Conference on Educational Data Mining (EDM 2020), pages 682–685, 2020.
W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
D. A. Wiley and E. K. Edwards. Online self-organizing social systems: The decentralized future of online learning. Quarterly review of distance education, 3(1):33–46, 2002.
D. Yang, M. Wen, I. Howley, R. Kraut, and C. Rose. Exploring the effect of confusion in discussion forums of massive open online courses. In Proceedings of the second (2015) ACM conference on learning@ scale, pages 121–130, 2015.

APPENDIX

iteration mrr — Figure 3: System performance along each training iteration.

alpha mrr — Figure 4: Performance along different topics.

¹https://www.khanacademy.org/

²https://github.com/AngusGLChen/LearningQ

³https://github.com/UKPLab/sentence-transformers

⁴https://github.com/dorianbrown/rank_bm25

[1] P. Adamopoulos. What makes a great mooc? an interdisciplinary analysis of student retention in online courses. 2013.

[2] A. Agrawal, J. Venkatraman, S. Leonard, and A. Paepcke. Youedu: addressing confusion in mooc discussion forums by recommending instructional video clips. 2015.

[3] G. Chen, J. Yang, C. Hauff, and G.-J. Houben. Learningq: a large-scale dataset for educational question generation. In Proceedings of the International AAAI Conference on Web and Social Media, volume 12, 2018.

[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[5] M. Henderson, R. Al-Rfou, B. Strope, Y.-H. Sung, L. Lukács, R. Guo, S. Kumar, B. Miklos, and R. Kurzweil. Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652, 2017.

[6] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.

[7] O. Khattab and M. Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020.

[8] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[9] B. Kulis et al. Metric learning: A survey. Foundations and Trends® in Machine Learning, 5(4):287–364, 2013.

[10] J. Lin, R. Nogueira, and A. Yates. Pretrained transformers for text ranking: Bert and beyond. arXiv preprint arXiv:2010.06467, 2020.

[11] B. Mitra, N. Craswell, et al. An introduction to neural information retrieval. Now Foundations and Trends, 2018.

[12] R. Nogueira and K. Cho. Passage re-ranking with bert. arXiv preprint arXiv:1901.04085, 2019.

[13] A. Ntourmas, N. Avouris, S. Daskalaki, and Y. Dimitriadis. Evaluation of a massive online course forum: design issues and their impact on learners’ support. In IFIP conference on human-computer interaction, pages 197–206. Springer, 2019.

[14] A. Ntourmas, S. Daskalaki, Y. Dimitriadis, and N. Avouris. Classifying mooc forum posts using corpora semantic similarities: a study on transferability across different courses. Neural Computing and Applications, pages 1–15, 2021.

[15] N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.

[16] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.

[17] P. Trirat, S. Noree, and M. Y. Yi. Intellimooc: Intelligent online learning framework for mooc platforms. In Proceedings of The 13th International Conference on Educational Data Mining (EDM 2020), pages 682–685, 2020.

[18] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.

[19] D. A. Wiley and E. K. Edwards. Online self-organizing social systems: The decentralized future of online learning. Quarterly review of distance education, 3(1):33–46, 2002.

[20] D. Yang, M. Wen, I. Howley, R. Kraut, and C. Rose. Exploring the effect of confusion in discussion forums of massive open online courses. In Proceedings of the second (2015) ACM conference on learning@ scale, pages 121–130, 2015.