Fake News Detection with Python

This article contains a tutorial of a machine learning project on how to detect fake news using Python

What is 'Fake News'?

“Fake news” is a term that has come to mean different things to different people. At its core, we are defining “fake news” as those news stories that are false: the story itself is fabricated, with no verifiable facts, sources, or quotes. Sometimes these stories may be propaganda that is intentionally designed to mislead the reader or may be designed as “click bait” written for economic incentives (the writer profits from the number of people who click on the story). In recent years, fake news stories have proliferated via social media, in part because they are so easily and quickly shared online.

Before moving ahead with this machine learning project, we need to be aware of the terms TF-IDF Vectorizer and Passive-Aggressive Classifier.

What is TF-IDF Vectorizer?

TF (Term Frequency): The number of times a word appears in a document is its Term Frequency. A higher value means a term appears more often than others, and so, the document is a good match when the term is part of the search terms.

IDF (Inverse Document Frequency): Words that occur many times in a document, but also occur many times in many others, may be irrelevant. IDF is a measure of how significant a term is in the entire corpus.

The TF-IDF Vectorizer converts a collection of raw documents into a matrix of TF-IDF features.

What is a Passive-Aggressive classifier?

Passive Aggressive algorithms are online learning algorithms. Such an algorithm remains passive for a correct classification outcome, and turns aggressive in the event of a miscalculation, updating, and adjusting. Unlike most other algorithms, it does not converge. Its purpose is to make updates that correct the loss, causing very little change in the norm of the weight vector.

Project Objective

To build a model to accurately classify a piece of news as REAL or FAKE.

This advanced Python project of detecting fake news deals with fake and real news. Using sckit-learn, we build a TF-IDF Vectorizer on our dataset. Then, we initialize a Passive-Aggressive Classifier and fit the model. In the end, the accuracy score and the confusion matrix tell us how well our model fares.

The Dataset

The dataset we’ll use for this Python project- we’ll call it news.csv. This dataset has a shape of 7796×4. The first column identifies the news, the second and third are the title and text, and the fourth column has labels denoting whether the news is REAL or FAKE. The dataset takes up 29.2MB of space and you can download it here

Project Prerequisites

For this project, I used Google Colab. Google Colab functions just like Jupyter Notebook and allows you to write and execute Python code in your browser.

In case you're working on your local machine, you'll need to install the following libraries with pip:

pip install numpy pandas sklearn

You'll need to install Jupyter Lab to run your code. Open your command prompt and run the following command:

C:\Users\RICHARD ORIDO>jupyter lab

You’ll see a new browser window open up; create a new console and use it to run your code. To run multiple lines of code at once, press Shift+Enter.

Follow these steps

Follow the below steps for detecting fake news and complete your first advanced Python Project –

Make necessary imports:

import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

Screenshot

image.png

Now, let’s read the data into a DataFrame, and get the shape of the data and the first 5 records. If you're working with Google Colab, you'll need to mount your drive first then copy and replace your file path.

#Read the data
df=pd.read_csv('/content/drive/MyDrive/news.csv')

#Get shape and head
df.shape
df.head()

Output screenshot

image.png

And get the labels from the DataFrame.

#Get the labels
labels=df.label
labels.head()

Output screenshot

image.png

Split the dataset into training and testing sets.

#Split the dataset
x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.2, random_state=7)

Output screenshot

image.png

Let’s initialize a TF-IDF Vectorizer with stop words from the English language and a maximum document frequency of 0.7 (terms with a higher document frequency will be discarded). Stop words are the most common words in a language that are to be filtered out before processing the natural language data. A TF-IDF Vectorizer turns a collection of raw documents into a matrix of TF-IDF features.

Now, fit and transform the vectorizer on the train set, and transform the vectorizer on the test set.

#Initialize a TF-IDF Vectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

#Fit and transform train set, transform test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)

Output screenshot

image.png

Next, we’ll initialize a Passive-AggressiveClassifier. This is. We’ll fit this on tfidf_train and y_train.

Then, we’ll predict the test set from the TF-IDF Vectorizer and calculate the accuracy with accuracy_score() from sklearn.metrics.

#Initialize a PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)

#Predict on the test set and calculate accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

image.png

We got an accuracy of 92.82% with this model. Finally, let’s print out a confusion matrix to gain insight into the number of false and true negatives and positives.

#Build confusion matrix
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

Output screenshot

image.png

So with this model, we have 589 true positives, 587 true negatives, 42 false positives, and 49 false negatives.