Over the past year, "fake news" has become a topic of particular interest for politicians, news media, social media companies, and... data scientists. As this type of news clutter becomes more prevalent, individuals and organizations are working to leverage computing power to help social media users discern the "fake" from the legitimate. In this article, we take a look at some basic natural language processing (NLP) ideas to better understand how algorithms can help make this distinction.
Natural Language Processing: A Brief Introduction
Text Preprocessing: Arguably the most important step to text mining is preparing the data for analysis. In NLP, this involves actions such as tokenizing words, removing distinctions between upper and lower case words, stemming (extracting the root of words), and removing stop words (common words in a language that don't carry meaning-- think: the, and, is). An example of tokenization and stemming is shown below in Figure 1.
Bag of Words: This model is useful in finding topics in text by focusing on word frequency. Bag of words can be supplemented with word vectors, which add meaning to NLP representations by capturing the relationship between words.
Text as a Graph: Graph-based approaches consider words as nodes and focus on associations to draw more complex and contextually rich meaning from text data.
Named Entity Recognition (NER): This method can be used to extract types of words, such as names, organizations, etc. Many NER libraries are online for public use.
Sentiment Analysis: Otherwise known as "opinion mining," this technique provides a gauge of the author's feeling towards a subject, and strength. Do fake news outlets produce more opinionated articles?
# Tokenization and Stemming Example
headline <- "The Onion Reports: Harry Potter Books Spark Rise in Satanism Among Children" tokenize_word_stems(headline)
## [[1]] ## [1] "the" "onion" "report" "harri" "potter" "book" ## [7] "spark" "rise" "in" "satan" "among" "children"
Figure 1. Tokenization and Stemming Example
How Are Data Scientists Framing the Problem?
While popular browser extensions use crowdsourcing to classify sites that publish fabrications, researchers are reframing the problem of fake news. In order to fit a model, an understanding of the most influential features that differ between fake and legitimate is helpful. Regardless of whether the fake news is created by provocateurs, bots, or satire, we know it will have a few things in common: a questionable source, content out of line with legitimate news, and an inflammatory nature. Current research in the area takes advantage of these truths and applies approaches spanning from naive Bayes classifiers to random forest models. Researchers at Stanford are investigating the importance of stance, a potential red-flag trait of misleading articles. Stance detection assesses the degree of agreement between two texts, in this case: the headline and the article. Another popular approach is the use of fact-checking pipelines to compare an article's content to known truths or an online search of a subject. As the complexity of fake news adapts to modern modes of media consumption, research in this space will expand. Image classification is a likely next step, albeit one that poses a major scalability challenge.
Interested in learning more or building your own fake news classifier? Check out these resources:
Python's Natural Language Processing Toolkit
R's NLP Package
Python's SpaCy for NER
Our analysts at CANA Advisors are always interested in hearing from you. If you have an interesting “data” dilemma, contact Lucia Darrow. [EMAIL]