MOOC-Rec: Instructional Video Clip Recommendation for MOOC Forum Questions
Peide Zhu
Delft University of Technology
Jie Yang
Delft University of Technology
Claudia Hauff
Delft University of Technology


In this work, we address the information overload issue that learners in Massive Open Online Courses (MOOCs) face when attempting to close their knowledge gaps via the use of MOOC discussion forums. To this end, we investigate the recommendation of one-minute-resolution video clips given the textual similarity between captions and MOOC discussions. We first create a large-scale dataset from Khan Academy video transcripts and their forum discussions. We then investigate the effectiveness of applying pre-trained transformers-based neural retrieval models to make a ranked list of video clips for a forum discussion. The retrieval models are trained with supervised learning and distant supervision to effectively leverage the unlabeled data(over 80% of all data). Experimental results demonstrate that the proposed method is effective for this task, by outperforming baseline by 0.208 in terms of precision. To the best of our knowledge, this is the first systematic research applying pre-trained transformer-based ranking models on MOOC clip recommendation.


MOOC, Discussion Forum, Video Clip Transcripts, Clip Recommendation


Massive Open Online Courses (MOOCs) provide open access to world class courses for the public, which greatly improves the opportunities in online learning. The discussion forum is a major component of a MOOC as it is the primary communication platform among learners and instructors [1] to moderate the lack of physical access in MOOCs. It can help learners build a sense of belonging and learn from peers, or help instructors monitor learner affect and academic progress [2]. However, since questions targeting the same video content are scattering among discussion threads, without supporting navigation facilities, learners cannot effectively retrieve valuable discussions for a particular content. In addition, learners’ posts seeking help may be drowned out by the many other competing posts, making it hard for learners to get attention from instructors and peers. The unstructured, unorganized forums with a large amount of discussions (leading to information overload [19]) are hindering instructors and learners from their benefits, inhibit community interaction, reduce responsiveness in forums and in the end lead to low MOOC retention rates [2013].

Figure 1: Overview of MOOC-Rec.

Existing works directed at addressing the information overload issue in MOOC forums have proposed more effective navigation tools to identify instructional video contents and make recommendations of a ranked list of video clips. For example, [2] classifies posts that need help and employ bag-of-words based retrieval techniques to map them to minute-resolution course video clips. The clip recommendation algorithm is evaluated on posts from one course. [17] built a recommender system to generate a ranked list of video clips giving a student’s question with a deep neural network; they evaluate the system with 50 questions. However, prior work on video clip recommendation suffers from the lack of training data, and reports evaluations on small-scale data. It remains a challenge to develop and evaluate a system that can scale to thousands of MOOCs, across different domains.

In this work, we first address the lack of training data issue by creating MOOC-CLIP, a novel large-scale dataset from Khan Academy 1, that includes video captions and forum posts (both questions and answers) using raw data available from LearningQ [3], an open source tool and dataset for educational question generation. Second, we propose MOOC-Rec , a dense retrieval based instructional video clip recommendation system for MOOC forum questions. For each content-related thread, MOOC-Rec recommends a ranked list of video clips that are likely relevant and helpful for answering the question. Although dense retriever has been applied in various retrieval tasks such as DPR [6] and ColBERT [7], it is unknown whether it is an effective approach for MOOC video clip recommendation. Furthermore, only 11.57% of all discussions are labeled with target video clip, which poses challenges for training MOOC-Rec with limited labeled data and abundant unlabeled resources. In this paper, we first investigate the effectiveness of MOOC-Rec and then we address the scarcity of labeled data by using distant supervision and in-batch negatives to train the ranker.

To the best of our knowledge, this is the first work that systematically investigate applying state-of-the-art pre-trained transformer based neural ranking models no MOOC clip recommendation problem. The comprehensive experiments on our large scale dataset show that our systems significantly improve the clip recommendation performance.


To address the lack of research data, we create a large-scale dataset using raw data crawled with LearningQ2 from Khan Academy, a MOOC platform which allows learners to ask and answer questions about the learning materials during learning. We keep video transcripts, forum questions and answers of MOOCs which have both transcripts and discussions available.

Learners use discussion forums in different ways. Besides asking questions related to the course materials, they may also discuss irrelevant topics [14], such as socializing, posting spams, or expressing the appreciation for the video. Some questions posted by learners also suffer from lack of proper context, or be too generic. Therefore, it is necessary to remove the relatively low quality questions. In the same with LearningQ, we consider a user-generated question to be as useful for learning when all of the following conditions hold: (i) the question is concept-relevant, i.e., it seeks for information on knowledge concepts taught in lecture videos or articles; (ii) the question is context-complete, which means sufficient context information is provided to enable other learners to answer the question; and (iii) the question is not generic (e.g., a question asks for learning advice). We labeled 13,290 questions over 8 topics totally. We found 60.9% of them are labeled as useful and 39.1% of them as non-useful. We keep all items belonging to 3 topics (2344 in total ) as unknown set for Cross-Topics evaluation. We split 8766 questions on other 5 topics for training, and the rest 2186 as known topic test set. We train BERT

Table 1: Useful question classifier results.
Same Topic
Method Acc Rc F1 Acc Rc F1
Q 89.40 96.68 92.90 77.20 74.49 75.82
Q+C 89.75 96.54 93.02 73.30 82.68 77.71

During preprocessing, we first remove noisy discussions which contain only meaningless tokens, videos which have no discussions. Then we apply the useful question classifier on all items and retrain only items are classified as true. In the end, we retain 273,887 discussions from 7,349 videos and forums across 1991 MOOC courses. We use regular expressions to retrieve discussions where learners label post exact timestamps in questions or answers. We split video transcripts to snippets with a 1 minute length. The discussions and the snippets which cover the timestamp are labeled as positive items. Other discussions are treated as unlabeled. Table 2 shows the labeled and unlabeled data statistics. In summary, there are 31,680 positive labeled items and 240,551 unlabeled items, i.e. 11.57% of all discussions are labeled.

Table 2: Dataset overview, in terms of the number of courses (#C), videos (#V), snippets (#S) per video, discussions (#D) per video, clip (#W), the number of words per question (Q) and the number of words per answer (A)
Split #C #V #S/V #W/S #W/Q #W/A
Train 1714 4590 7.91 198.51 39.96 80.89
Dev 641 895 8.37 199.04 40.02 79.26
Test 774 1126 8.14 198.64 39.67 81.92
Unlabeled 1985 7283 7.70 197.96 38.46 78.58
 Dataset overview A
Figure 2: Dataset overview regarding the number of labeled and unlabeled questions in each topic. We can see the unbalanced distribution questions in each topic.

This dataset also covers a series of educational topics including math, science, careers, humanties, etc. We conduct an exploratory analysis along each topic dimension which is shown in Fig 2. We observe a topic imbalance, e.g. discussions under math and science topics account for 78.88% of labeled items and 76.82% of all items. The labeled data is then split into 80% and 20% for training and test sets respectively based on the number of discussions in each set.


The problem of MOOC video clip recommendation studied in this paper can be described as follows. Given a forum discussion question, the system is supposed to find a ranking list of most relevant video clips represented by their captions. We assume the questions filtered by the content-relevant question classifier are relevant to the course materials, and the most relevant video clips should be instructional for learners. Assume a MOOC video 𝒱 lasts for T seconds, then we split it to s t-seconds clips, where s = T t . We choose to split each video to one-minute-resolution clips, i.e. we set t = 60. Then the video 𝒞 contains clips c1,c2,,cs. Each clip ci is represented with its caption, which can be viewed as a sequence of tokens w1i,w 2i,,w |ci|i. Given a discussion di = [qi,{ai}], where {ai} means the answers to the question i, in some cases the question has not been answered yet, which is common in MOOC forums. Then the task is to make a ranking list of clips ci,1,ci,2,,ci,s given each dicussion di. Notice that the corpus covers courses from different domains, and answers are not always available. As a result, the video clip recommender needs to work effectively for MOOCs in different domains and unanswered questions. Formally speaking, the recommender : (d,C) C is a function that takes a discussion d and video clip list C as the input and returns a ranked list of clips C. We can also choose to only return the top-K most relevant clips.

3.1 Dual-Encoder

We use the neural IR architecture [6] for the ranker. It uses a dense encoder EC()which encodes the video clip transcripts to an m-dimensional real-valued vectors. At run-time, MOOC-Rec maps the input discussion d = [q,a] to another m-dimensional vector using the query encoder EQ(), and retrieves top-k most closest video clip vectors. We define the similarity between the clip and the discussion using the following function of the two vectors:

sim(d,c) = EQ(d)TE C(c), (1)

In this paper, cosine similarity is used for modeling the similarity between the discussion and the clip vectors.

The goal of training is to learn better embedding function for both the clips and discussions which can map relevant pairs of discussions and clips to vectors with smaller distance, i.e. higher similarity, so that the similarity function sim(d,c) becomes a good ranking function for MOOC video clip recommendation. This is essentially a metric learning problem [9116]. Let = {di,ci+,c i,1,,c i,n} i=1m be the training MOOC discussion corpus that contains m instances. Each example has one discussion di = [qi,ai], one relevant (positive) video clip transcripts ci+, and n irrelevant (negative) clips ci,j. We train the retrieval model by optimizing the negative log likelihood of the positive clip:

L(di,ci+,c i,1,,c i,n) = log esim(di,ci+) esim(di,ci+) + j=1nesim(di,ci,j)

For labeled discussions, positive and negative video examples are explicit. We use the video clip whose time duration contains timestamp of the discussion as the positive example. While all other video clips from the same video can be treated as negatives, on the one hand, MOOC videos varies in the number of clips; on the other hand, to boost the model training and balance the number of positive and negative examples, we choose to select n from them as the training negative examples. Just the same as other The selection of negative clips is essential for training a high-quality ranker. We apply in-batch negatives [56] for training. Then the positive clips for other questions are also treated as the negatives.

As we show in Table 2, over 80% of all discussion are unlabeled. It would be labor-intensive and expensive to make human annotations. We adopt the “distant supervision” [10] to effectively utilize the rich unlabeled data and train a better model with them. This process involves training the model with noisy “weakly” labeled data. MOOC-Rec is able to achieve over 50% precision on the top-1 prediction and over 70% in top-3 with Recall@3 over 80%. Therefore, we use the ranker trained on labeled training set as the scorer, use top-1 clips with highest sim(d,c) as positives, and the clips with least sim(d,c) (besides top-3) as negatives. The weakly labeled data are then used to train the ranker.

During inference time, we pre-compute all clip embedding vc by applying the clip encoder EC to all MOOC video clips offline. Given a discussion d = [q,a] at run-time, we concatenate the question and answers if a is available and compute the discussion embedding vd = EQ(d). Top-k clips are retrieved with highest sim(d,c) score.

Although encoders can be implemented in many different ways [10], in this work, we use two independent BERT [4] variant models as encoders and the mean value of all token embeddings is used as the final representation. We tokenize clip transcripts and truncate the token list to max length 512 (starts with [CLS] and ends with [SEP] token). In this research, the discussion encoder works as the ‘query’ encoder in typical neural IR systems. Instead of using separate encoders for questions and answers of the discussion, in our design both of them share the same encoder. In this way, we train a better query encoder for questions by taking advantage of answer information.

3.2 Cross-Encoder

Both cross-encoder and dual-encoder are two common approaches for matching sentence pairs. Different from dual encoder, in contrast to producing sentence embedding vectors for clips and discussions independently, the cross encoder treats the clip recommendation for discussions as a sequence classification task and perform full self-attention over the whole sequence. We concatenate the video clip transcripts and the discussions (question and answers) with the [SEP] token as the input to the transformer network. The [CLS] token embedding is then passed to a binary classifier to predict the binary relevance between them. To find the most relevant MOOC clip for the question, the encoder should compute the relevance score on all question-clip pairs and lead to massive computational overhead.


4.1 Experiment Settings

We implement dual encoders using pre-trained weights of BERT variants: MPNet [16] (embedding size: 768) and MiniLM [18] (embedding size: 384) provided by Sentence-Transformers library 3 [15]. Both models are pre-trained on a large and diverse dataset of over 1 billion training query-paragraphs pairs for semantic search task. Adam optimizer [8] with warming-up and cosine schedule is used for training and we set the maximum learning rate (lr) as lr = 2e5, 𝜖 = 1e8 and the warmup steps as 1000. For cross-encoder baseline, we follow previous research [1012]. We using the same base model with the dual encoder, i.e. the 6-layer MiniLM and MPNet. The BM25 baseline is based on Okapi BM25 implementation of the rank_bm25 library 4. We train our model using 8 GTX-1080 GPUs for 10 iterations with a batch size of 32. As Figure 3 shows, after one iteration, both of clip recommendation systems outperform the BM25 baseline. After several iterations, model performance first improves gradually and then become steady, which shows the effectiveness of the training system and the effectiveness of the proposed models.

Table 3: Performance of the proposed MOOC-Rec ranker and baselines on the test set in terms of rank-aware metrics. MLM/MPNetdual represents the MiniLM or MPNet based dual encoder and MLM/MPNetcross represents the MiniLM or MPNet based cross encoder. “PT” represents ranker performance using pre-trained encoders without fine-tuning. “FT” means fine-tuned model performance. “WL” means the model performance after training with weakly labeled data.
BM25 0.417 0.600 0.550 0.696 0.593
MLMcross 0.132 0.346 0.254 0.497 0.297
MLM dual 0.422 0.614 0.568 0.707 0.617
MPNetcross 0.135 0.344 0.248 0.495 0.288
MPNet dual 0.386 0.583 0.529 0.683 0.576
MLMcross 0.511 0.677 0.641 0.755 0.683
MLM dual 0.529 0.692 0.658 0.767 0.700
MPNetcross 0.613 0.745 0.716 0.807 0.750
MPNet dual 0.570 0.720 0.690 0.788 0.730
MLMcross 0.540 0.696 0.661 0.770 0.700
MLM dual 0.520 0.683 0.646 0.760 0.687
MPNetcross 0.625 0.751 0.722 0.812 0.754
MPNet dual 0.557 0.711 0.680 0.782 0.720

4.2 Effectiveness of Dense Retrieval

Table 3 summaries model performance on the test set. We use BM25 as the baseline. The popular sparse vector-space models such as TF-IDF/BM25 retrieval methods have been widely used in instructional clip recommendation systems. Its performance in terms of Precision@1 (P@1) and MRR is 0.417 and 0.60 respectively, which shows queries possess more lexical similarity to related MOOC clips than others and BM25 is an effective and strong baseline for this application. First, we find that without fine-tuning, pre-trained dual encoder can achieve similar(MPNet), or even better (MiniLM-L6) performance compared with BM25 baseline, while the cross-encoders cannot make clip recommendation for discussions without training. Second, we observe significant gains (p = 1.95e7) when using the MOOC-Rec neural ranker after being trained on the data, with gains of over 0.15 in P@1 and over 0.19 in nDCG scores compared to the BM25 baseline on the test set. Thus, dense retrieval is an effective instructional MOOC clip recommendation approach for forum discussions which can model the relevance between discussions and clip transcripts.

To compare the impacts of model size, we use one distilled transformer model MiniLM which contains 22M parameters and one BERT size model MPNet which contains 109M parameters. As Table 3 shows, in both cross-encoder and cross-encoder setting, the larger model MPNet achieves better performance after training, which shows that the transformer model with more parameters may have better potential to model the relevance between clips and discussions.

Both cross-encoder and dual-encoder are commonly used for sentence pair matching problems. In Table 3, we can observe on the small transformer model, dual-encoder outperforms the cross-encoder by around 0.01, while the cross encoder using large pre-trained model by 0.043 on P@1, and around 0.02 in other metrics. Despite the performance advantage of cross-encoder with the large model, we observe massive computational overhead with the cross-encoder. The dual-encoder is more time efficient because the MOOC course clips can be encoded in advance, and each time the clip recommender make the prediction, it only need encode the discussion, and compute cosine similarity of the discussion encoding vector and the stored MOOC clip embeddings.

In "WL" section of Table 3, we summarize the different models’ performance after distant training with weakly labeled data. Cross-encoders perform better(+0.029 for MiniLM and +0.012 for MPNet in terms P@1), but dual-encoders perform worse(-0.009 for MiniLM and -0.013 for MPNet in terms P@1). One explanation is although MOOC-Rec after initial training can achieve good performance, the weakly labeled data created with it still contain considerable noisy content. The dual-encoders are trained by optimizing listwise loss and are more liked effected by the noisy data.


We studied the task of video clip recommendation in the context of MOOC forms which has the eventual goal to reduce learners’ information overload. We created a novel dataset including video transcripts and discussions, systematically investigated how to incorporate the state-of-art pre-trained neural IR models for MOOC clip recommendation, and proposed a framework including data preparation, useful question classification, clip ranker and weak supervision training for this task. We conduct the experiments with both cross encoders and dual-encoders. The results on our dataset show the effectiveness of the proposed method. In future work, on the one hand, we will further analysis intentions of forum discussions and their relevance to course videos. On the other hand we will investigate factors that affect MOOC-Rec performance such as the clip duration and methods of creating weak labels.


This research is partly supported by the China Scholarships Council (CSC).


  1. P. Adamopoulos. What makes a great mooc? an interdisciplinary analysis of student retention in online courses. 2013.
  2. A. Agrawal, J. Venkatraman, S. Leonard, and A. Paepcke. Youedu: addressing confusion in mooc discussion forums by recommending instructional video clips. 2015.
  3. G. Chen, J. Yang, C. Hauff, and G.-J. Houben. Learningq: a large-scale dataset for educational question generation. In Proceedings of the International AAAI Conference on Web and Social Media, volume 12, 2018.
  4. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  5. M. Henderson, R. Al-Rfou, B. Strope, Y.-H. Sung, L. Lukács, R. Guo, S. Kumar, B. Miklos, and R. Kurzweil. Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652, 2017.
  6. V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
  7. O. Khattab and M. Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020.
  8. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  9. B. Kulis et al. Metric learning: A survey. Foundations and Trends® in Machine Learning, 5(4):287–364, 2013.
  10. J. Lin, R. Nogueira, and A. Yates. Pretrained transformers for text ranking: Bert and beyond. arXiv preprint arXiv:2010.06467, 2020.
  11. B. Mitra, N. Craswell, et al. An introduction to neural information retrieval. Now Foundations and Trends, 2018.
  12. R. Nogueira and K. Cho. Passage re-ranking with bert. arXiv preprint arXiv:1901.04085, 2019.
  13. A. Ntourmas, N. Avouris, S. Daskalaki, and Y. Dimitriadis. Evaluation of a massive online course forum: design issues and their impact on learners’ support. In IFIP conference on human-computer interaction, pages 197–206. Springer, 2019.
  14. A. Ntourmas, S. Daskalaki, Y. Dimitriadis, and N. Avouris. Classifying mooc forum posts using corpora semantic similarities: a study on transferability across different courses. Neural Computing and Applications, pages 1–15, 2021.
  15. N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  16. K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.
  17. P. Trirat, S. Noree, and M. Y. Yi. Intellimooc: Intelligent online learning framework for mooc platforms. In Proceedings of The 13th International Conference on Educational Data Mining (EDM 2020), pages 682–685, 2020.
  18. W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
  19. D. A. Wiley and E. K. Edwards. Online self-organizing social systems: The decentralized future of online learning. Quarterly review of distance education, 3(1):33–46, 2002.
  20. D. Yang, M. Wen, I. Howley, R. Kraut, and C. Rose. Exploring the effect of confusion in discussion forums of massive open online courses. In Proceedings of the second (2015) ACM conference on learning@ scale, pages 121–130, 2015.


 iteration mrr
Figure 3: System performance along each training iteration.
 alpha mrr
Figure 4: Performance along different topics.





© 2022 Copyright is held by the author(s). This work is distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.