Snowplough 🏂

🖋 Authors: Eshwaran Venkat, under the guidance of Jennifer Zhu

Supervised Learning meets Media Analysis: Simple Topic Classification to Explore Bias in News Coverage. Final project for UC Berkeley MIDS 266 (Natural Language Processing with Deep Learning). See Course Repository

About 📰

Our study introduces an approach to analyze news content, centering on the development of a topic classifier using the extensive All The News v2 dataset. Our methodology progresses from baseline classifiers to more advanced models, culminating in a fine-tuned BERT classifier, adept at categorizing news articles into distinct topics such as 'Sports,' 'Finance', etc., based on textual features and news metadata.

This classifier is augmented with sentiment analysis and other indicators for a supplemental exploration into media bias, aiming to delineate its various manifestations. The core of our research lies in the robust topic classification, with media bias analysis providing additional insights. We’ve made the code, notebooks, models and newly generated (topic classified) dataset publicly available. The newly created dataset is listed as All The News v2.1 on Kaggle and a fine-tuned BERT classifier for the same is also made available online.

Project Report: Download PDF
Presentation: Download PDF

GitHub: cricksmaidiene/snowplough
Kaggle: Coming Soon
Hugging Face: Coming Soon

Data 📇

AllTheNews

AllTheNews is a popular dataset of news articles that has two versions. Version 1 & 2.

Version 2.0 has 2.7 million articles from a number of sources.
It is a published dataset that is readily downloadable.
The date range of articles is from January 1, 2016 to April 2, 2020.
The only metadata available is the article title, publication, section, author, date, and content. We use a subset of these as labels for our classifiers.

Notebooks 📙

NB Order Number	Notebook	Section	Description
01	Ingest Dataset	Ingestion	Ingests the All The News v2 dataset into a Delta Lake table.
02	Exploratory Data Analysis	Analysis	Performs exploratory data analysis on the All The News v2 dataset for Summary Statistics.
03	Word Counts & Sentiments Processor	Engineering	Transformation layer that adds word count fields and sentiment score fields per article
04	Sentiment Analysis	Analysis	Looks at descriptive statistics on sentiment scores across articles, publications and authors to find signals for bias
05	News Section Analysis	Analysis	Explores newspaper sections for topic-level coalescing and assignment
06	Topic Processor	Engineering	Transformation layer that adds topic fields per article using a topic lexicon, and performs additional processing
07	Topic & Author Analysis	Analysis	Explores the newly labeled and created topics, and how they interact with author distributution and slants
08	Standard Classification Models	Machine Learning	Comprehensive set of non-neural network models for Topic & Optional Author classification - Random Forests, Logistic Regression, & Naive Bayes
09	Neural Network Classifiers	Machine Learning	Bi-Directional LSTM and CNN networks are trained for classification of news topics from news titles
10	BERT Simple Classifier	Machine Learning	A model that minimally fine-tunes a pre-trained BERT Model to classify news topics
11	BERT Complex Classifier	Machine Learning	A model that adds LSTM and CNN layers on top of a pre-trained BERT model to train the classifier
12	Bias Analysis	Analysis	Systematically performs a simple bias analysis on newly labeled topics and sentiments on the news data