In this work, we address the information overload issue that learners in Massive Open Online Courses (MOOCs) face when attempting to close their knowledge gaps via the use of MOOC discussion forums. To this end, we investigate the recommendation of one-minute-resolution video clips given the textual similarity between captions and MOOC discussions. We first create a large-scale dataset from Khan Academy video transcripts and their forum discussions. We then investigate the effectiveness of applying pre-trained transformers-based neural retrieval models to make a ranked list of video clips for a forum discussion. The retrieval models are trained with supervised learning and distant supervision to effectively leverage the unlabeled data(over 80% of all data). Experimental results demonstrate that the proposed method is effective for this task, by outperforming baseline by 0.208 in terms of precision. To the best of our knowledge, this is the first systematic research applying pre-trained transformer-based ranking models on MOOC clip recommendation.
Massive Open Online Courses (MOOCs) provide open access to world class courses for the public, which greatly improves the opportunities in online learning. The discussion forum is a major component of a MOOC as it is the primary communication platform among learners and instructors  to moderate the lack of physical access in MOOCs. It can help learners build a sense of belonging and learn from peers, or help instructors monitor learner affect and academic progress . However, since questions targeting the same video content are scattering among discussion threads, without supporting navigation facilities, learners cannot effectively retrieve valuable discussions for a particular content. In addition, learners’ posts seeking help may be drowned out by the many other competing posts, making it hard for learners to get attention from instructors and peers. The unstructured, unorganized forums with a large amount of discussions (leading to information overload ) are hindering instructors and learners from their benefits, inhibit community interaction, reduce responsiveness in forums and in the end lead to low MOOC retention rates [20, 13].
Existing works directed at addressing the information overload issue in MOOC forums have proposed more effective navigation tools to identify instructional video contents and make recommendations of a ranked list of video clips. For example,  classifies posts that need help and employ bag-of-words based retrieval techniques to map them to minute-resolution course video clips. The clip recommendation algorithm is evaluated on posts from one course.  built a recommender system to generate a ranked list of video clips giving a student’s question with a deep neural network; they evaluate the system with 50 questions. However, prior work on video clip recommendation suffers from the lack of training data, and reports evaluations on small-scale data. It remains a challenge to develop and evaluate a system that can scale to thousands of MOOCs, across different domains.
In this work, we first address the lack of training data issue by
MOOC-CLIP, a novel large-scale dataset from Khan
that includes video captions and forum posts (both questions
and answers) using raw data available from
an open source tool and dataset for educational question
generation. Second, we propose
MOOC-Rec , a dense retrieval
based instructional video clip recommendation system for
MOOC forum questions. For each content-related thread,
MOOC-Rec recommends a ranked list of video clips that are likely
relevant and helpful for answering the question. Although dense
retriever has been applied in various retrieval tasks such as
DPR  and ColBERT , it is unknown whether it is an
effective approach for MOOC video clip recommendation.
Furthermore, only 11.57% of all discussions are labeled with
target video clip, which poses challenges for training
with limited labeled data and abundant unlabeled resources. In
this paper, we first investigate the effectiveness of
and then we address the scarcity of labeled data by using
distant supervision and in-batch negatives to train the
To the best of our knowledge, this is the first work that systematically investigate applying state-of-the-art pre-trained transformer based neural ranking models no MOOC clip recommendation problem. The comprehensive experiments on our large scale dataset show that our systems significantly improve the clip recommendation performance.
2. THE MOOC-CLIP DATASET
To address the lack of research data, we create a
large-scale dataset using raw data crawled with
from Khan Academy, a MOOC platform which allows learners
to ask and answer questions about the learning materials during
learning. We keep video transcripts, forum questions and
answers of MOOCs which have both transcripts and discussions
Learners use discussion forums in different ways. Besides asking questions related to the course materials, they may also discuss irrelevant topics , such as socializing, posting spams, or expressing the appreciation for the video. Some questions posted by learners also suffer from lack of proper context, or be too generic. Therefore, it is necessary to remove the relatively low quality questions. In the same with LearningQ, we consider a user-generated question to be as useful for learning when all of the following conditions hold: (i) the question is concept-relevant, i.e., it seeks for information on knowledge concepts taught in lecture videos or articles; (ii) the question is context-complete, which means sufficient context information is provided to enable other learners to answer the question; and (iii) the question is not generic (e.g., a question asks for learning advice). We labeled 13,290 questions over 8 topics totally. We found 60.9% of them are labeled as useful and 39.1% of them as non-useful. We keep all items belonging to 3 topics (2344 in total ) as unknown set for Cross-Topics evaluation. We split 8766 questions on other 5 topics for training, and the rest 2186 as known topic test set. We train BERT
During preprocessing, we first remove noisy discussions which contain only meaningless tokens, videos which have no discussions. Then we apply the useful question classifier on all items and retrain only items are classified as true. In the end, we retain 273,887 discussions from 7,349 videos and forums across 1991 MOOC courses. We use regular expressions to retrieve discussions where learners label post exact timestamps in questions or answers. We split video transcripts to snippets with a 1 minute length. The discussions and the snippets which cover the timestamp are labeled as positive items. Other discussions are treated as unlabeled. Table 2 shows the labeled and unlabeled data statistics. In summary, there are 31,680 positive labeled items and 240,551 unlabeled items, i.e. 11.57% of all discussions are labeled.
This dataset also covers a series of educational topics including math, science, careers, humanties, etc. We conduct an exploratory analysis along each topic dimension which is shown in Fig 2. We observe a topic imbalance, e.g. discussions under math and science topics account for 78.88% of labeled items and 76.82% of all items. The labeled data is then split into 80% and 20% for training and test sets respectively based on the number of discussions in each set.
The problem of MOOC video clip recommendation studied in this paper can be described as follows. Given a forum discussion question, the system is supposed to find a ranking list of most relevant video clips represented by their captions. We assume the questions filtered by the content-relevant question classifier are relevant to the course materials, and the most relevant video clips should be instructional for learners. Assume a MOOC video lasts for seconds, then we split it to -seconds clips, where . We choose to split each video to one-minute-resolution clips, i.e. we set . Then the video contains clips . Each clip is represented with its caption, which can be viewed as a sequence of tokens . Given a discussion , where means the answers to the question , in some cases the question has not been answered yet, which is common in MOOC forums. Then the task is to make a ranking list of clips given each dicussion . Notice that the corpus covers courses from different domains, and answers are not always available. As a result, the video clip recommender needs to work effectively for MOOCs in different domains and unanswered questions. Formally speaking, the recommender is a function that takes a discussion and video clip list as the input and returns a ranked list of clips . We can also choose to only return the top- most relevant clips.
We use the neural IR architecture  for the ranker. It uses a dense
encoder which encodes the video clip transcripts to an
real-valued vectors. At run-time,
MOOC-Rec maps the input discussion
using the query encoder ,
and retrieves top-
most closest video clip vectors. We define the similarity between
the clip and the discussion using the following function of the
In this paper, cosine similarity is used for modeling the similarity between the discussion and the clip vectors.
The goal of training is to learn better embedding function for both the clips and discussions which can map relevant pairs of discussions and clips to vectors with smaller distance, i.e. higher similarity, so that the similarity function becomes a good ranking function for MOOC video clip recommendation. This is essentially a metric learning problem [9, 11, 6]. Let be the training MOOC discussion corpus that contains instances. Each example has one discussion , one relevant (positive) video clip transcripts , and irrelevant (negative) clips . We train the retrieval model by optimizing the negative log likelihood of the positive clip:
For labeled discussions, positive and negative video examples are explicit. We use the video clip whose time duration contains timestamp of the discussion as the positive example. While all other video clips from the same video can be treated as negatives, on the one hand, MOOC videos varies in the number of clips; on the other hand, to boost the model training and balance the number of positive and negative examples, we choose to select from them as the training negative examples. Just the same as other The selection of negative clips is essential for training a high-quality ranker. We apply in-batch negatives [5, 6] for training. Then the positive clips for other questions are also treated as the negatives.
As we show in Table 2, over 80% of all discussion are unlabeled.
It would be labor-intensive and expensive to make human
annotations. We adopt the “distant supervision”  to
effectively utilize the rich unlabeled data and train a better
model with them. This process involves training the model with
noisy “weakly” labeled data.
MOOC-Rec is able to achieve over
50% precision on the top-1 prediction and over 70% in top-3
with Recall@3 over 80%. Therefore, we use the ranker trained
on labeled training set as the scorer, use top-1 clips with highest
as positives, and
the clips with least
(besides top-3) as negatives. The weakly labeled data are then
used to train the ranker.
During inference time, we pre-compute all clip embedding by applying the clip encoder to all MOOC video clips offline. Given a discussion at run-time, we concatenate the question and answers if is available and compute the discussion embedding . Top- clips are retrieved with highest score.
Although encoders can be implemented in many different
ways , in this work, we use two independent BERT 
variant models as encoders and the mean value of all token
embeddings is used as the final representation. We tokenize clip
transcripts and truncate the token list to max length 512 (starts
[CLS] and ends with
[SEP] token). In this research, the
discussion encoder works as the ‘query’ encoder in typical neural
IR systems. Instead of using separate encoders for questions
and answers of the discussion, in our design both of them
share the same encoder. In this way, we train a better
query encoder for questions by taking advantage of answer
Both cross-encoder and dual-encoder are two common
approaches for matching sentence pairs. Different from dual
encoder, in contrast to producing sentence embedding vectors
for clips and discussions independently, the cross encoder treats
the clip recommendation for discussions as a sequence
classification task and perform full self-attention over the whole
sequence. We concatenate the video clip transcripts and the
discussions (question and answers) with the
[SEP] token as the
input to the transformer network. The
[CLS] token embedding
is then passed to a binary classifier to predict the binary
relevance between them. To find the most relevant MOOC clip
for the question, the encoder should compute the relevance score
on all question-clip pairs and lead to massive computational
4. EXPERIMENTS AND RESULTS
4.1 Experiment Settings
We implement dual encoders using pre-trained weights of BERT
variants: MPNet  (embedding size: 768) and MiniLM 
(embedding size: 384) provided by Sentence-Transformers
library 3 .
Both models are pre-trained on a large and diverse dataset of over
1 billion training query-paragraphs pairs for semantic search
task. Adam optimizer  with warming-up and cosine schedule
is used for training and we set the maximum learning rate
the warmup steps as 1000. For cross-encoder baseline, we follow
previous research [10, 12]. We using the same base model with
the dual encoder, i.e. the 6-layer
MPNet. The BM25
baseline is based on Okapi BM25 implementation of the
We train our model using 8 GTX-1080 GPUs for 10 iterations
with a batch size of 32. As Figure 3 shows, after one iteration,
both of clip recommendation systems outperform the BM25
baseline. After several iterations, model performance first
improves gradually and then become steady, which shows the
effectiveness of the training system and the effectiveness of the
4.2 Effectiveness of Dense Retrieval
Table 3 summaries model performance on the test set. We use
BM25 as the baseline. The popular sparse vector-space models
such as TF-IDF/BM25 retrieval methods have been widely used
in instructional clip recommendation systems. Its performance
in terms of Precision@1 (P@1) and MRR is 0.417 and
0.60 respectively, which shows queries possess more lexical
similarity to related MOOC clips than others and BM25 is an
effective and strong baseline for this application. First, we
find that without fine-tuning, pre-trained dual encoder
can achieve similar(MPNet), or even better (MiniLM-L6)
performance compared with BM25 baseline, while the
cross-encoders cannot make clip recommendation for discussions
without training. Second, we observe significant gains
when using the
MOOC-Rec neural ranker after being trained on
the data, with gains of over 0.15 in P@1 and over 0.19
in nDCG scores compared to the BM25 baseline on the
test set. Thus, dense retrieval is an effective instructional
MOOC clip recommendation approach for forum discussions
which can model the relevance between discussions and clip
To compare the impacts of model size, we use one
distilled transformer model
MiniLM which contains
parameters and one BERT size model
MPNet which contains
parameters. As Table 3 shows, in both cross-encoder and
cross-encoder setting, the larger model
MPNet achieves better
performance after training, which shows that the transformer
model with more parameters may have better potential to
model the relevance between clips and discussions.
Both cross-encoder and dual-encoder are commonly used for sentence pair matching problems. In Table 3, we can observe on the small transformer model, dual-encoder outperforms the cross-encoder by around 0.01, while the cross encoder using large pre-trained model by 0.043 on P@1, and around 0.02 in other metrics. Despite the performance advantage of cross-encoder with the large model, we observe massive computational overhead with the cross-encoder. The dual-encoder is more time efficient because the MOOC course clips can be encoded in advance, and each time the clip recommender make the prediction, it only need encode the discussion, and compute cosine similarity of the discussion encoding vector and the stored MOOC clip embeddings.
In "WL" section of Table 3, we summarize the different models’
performance after distant training with weakly labeled
data. Cross-encoders perform better(+0.029 for
MPNet in terms P@1), but dual-encoders perform
MiniLM and -0.013 for
MPNet in terms P@1).
One explanation is although
MOOC-Rec after initial training can
achieve good performance, the weakly labeled data created with
it still contain considerable noisy content. The dual-encoders are
trained by optimizing listwise loss and are more liked effected by
the noisy data.
5. CONCLUSION AND FUTURE RESEARCH
We studied the task of video clip recommendation in the
context of MOOC forms which has the eventual goal to reduce
learners’ information overload. We created a novel dataset
including video transcripts and discussions, systematically
investigated how to incorporate the state-of-art pre-trained
neural IR models for MOOC clip recommendation, and
proposed a framework including data preparation, useful
question classification, clip ranker and weak supervision training
for this task. We conduct the experiments with both cross
encoders and dual-encoders. The results on our dataset show
the effectiveness of the proposed method. In future work, on
the one hand, we will further analysis intentions of forum
discussions and their relevance to course videos. On the
other hand we will investigate factors that affect
performance such as the clip duration and methods of creating
This research is partly supported by the China Scholarships Council (CSC).
- P. Adamopoulos. What makes a great mooc? an interdisciplinary analysis of student retention in online courses. 2013.
- A. Agrawal, J. Venkatraman, S. Leonard, and A. Paepcke. Youedu: addressing confusion in mooc discussion forums by recommending instructional video clips. 2015.
- G. Chen, J. Yang, C. Hauff, and G.-J. Houben. Learningq: a large-scale dataset for educational question generation. In Proceedings of the International AAAI Conference on Web and Social Media, volume 12, 2018.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- M. Henderson, R. Al-Rfou, B. Strope, Y.-H. Sung, L. Lukács, R. Guo, S. Kumar, B. Miklos, and R. Kurzweil. Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652, 2017.
- V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
- O. Khattab and M. Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- B. Kulis et al. Metric learning: A survey. Foundations and Trends® in Machine Learning, 5(4):287–364, 2013.
- J. Lin, R. Nogueira, and A. Yates. Pretrained transformers for text ranking: Bert and beyond. arXiv preprint arXiv:2010.06467, 2020.
- B. Mitra, N. Craswell, et al. An introduction to neural information retrieval. Now Foundations and Trends, 2018.
- R. Nogueira and K. Cho. Passage re-ranking with bert. arXiv preprint arXiv:1901.04085, 2019.
- A. Ntourmas, N. Avouris, S. Daskalaki, and Y. Dimitriadis. Evaluation of a massive online course forum: design issues and their impact on learners’ support. In IFIP conference on human-computer interaction, pages 197–206. Springer, 2019.
- A. Ntourmas, S. Daskalaki, Y. Dimitriadis, and N. Avouris. Classifying mooc forum posts using corpora semantic similarities: a study on transferability across different courses. Neural Computing and Applications, pages 1–15, 2021.
- N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
- K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.
- P. Trirat, S. Noree, and M. Y. Yi. Intellimooc: Intelligent online learning framework for mooc platforms. In Proceedings of The 13th International Conference on Educational Data Mining (EDM 2020), pages 682–685, 2020.
- W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
- D. A. Wiley and E. K. Edwards. Online self-organizing social systems: The decentralized future of online learning. Quarterly review of distance education, 3(1):33–46, 2002.
- D. Yang, M. Wen, I. Howley, R. Kraut, and C. Rose. Exploring the effect of confusion in discussion forums of massive open online courses. In Proceedings of the second (2015) ACM conference on learning@ scale, pages 121–130, 2015.
© 2022 Copyright is held by the author(s). This work is distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.