To Scale or Not to Scale: Comparing Popular Sentiment Analysis Dictionaries on Educational Twitter Data
Abstract: The extraction of sentiment from text requires many methodological decisions in order to arrive at valid conclusions about learners' engagement, mood, and opinion in informal learning contexts through sentiment analysis. To evaluate popular sentiment measures, this paper compares sentiment software (SentiStrength, LIWC, tidytext, VADER) on N = 1,382,493 short informal posts on Twitter (i.e., tweets) in the context of the Next Generation Science Standards reform (N = 546,267) and U.S. State Educational Twitter Hashtags (N = 836,226). The accuracy of automated sentiment classifications were validated based on a sample of N = 300 hand-coded tweets. Additionally, we developed a discrepancy measure that quantifies consistency between sentiment scales to identify tweet features causing such discrepancy. Our validation analyses indicated that binary sentiment classifications (positive/neutral vs. negative) were more accurate compared to trinary classifications (positive, neutral, negative). Of the tested sentiment software packages, combined tidytext dictionaries and VADER outperformed SentiStrength and LIWC for negative sentiment, which was difficult to reliably classify across platforms. Meanwhile, positive sentiment was classified with high accuracy across platforms. This study suggests that researchers should critically reflect on their use of sentiment analysis methods to investigate negativity in data and (a) consider employing overall sentiment scales or a positive/neural to negative ratio based on binary classification to characterize their sample, (b) aggregate multiple dictionaries or create their own, domain-specific sentiment dictionaries, and (c) be aware of the current limitations of detecting negative sentiment.