topic_processor(Python)

Loading...

Topics Processor 💬

This notebook cleans up and assigns more well structured topics to news sections.

Notebook Properties

  • Upstream Notebook: src.engineering.word_counts_and_sentiments
  • Compute Resources: 64 GB RAM, 4 CPUs
  • Last Updated: Dec 10 2023

Data

Name Type Location Type Description Location
all_the_news input Delta Read full delta dataset of AllTheNews catalog/text_eda/all_the_news.delta

2023-12-11 12:32:39.333040: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [notice] A new release of pip available: 22.2.2 -> 23.3.1 [notice] To update, run: pip install --upgrade pip ✔ Download and installation successful You can now load the package via spacy.load('en_core_web_sm')

[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip.

Read Data

0%| | 0/54 [00:00<?, ?it/s]

(1776605, 18)

5438

Year 2016 has 114 single article sections, that will be nullified. Year 2017 has 226 single article sections, that will be nullified. Year 2018 has 174 single article sections, that will be nullified. Year 2019 has 121 single article sections, that will be nullified. Year 2020 has 74 single article sections, that will be nullified.

(1314500, 18)

Assign Sections by Geography

(1754, 3)

False 86.08894 True 13.91106 Name: is_geo, dtype: float64

(('Sports News', None), ('Sports', None), ('Design', None), ('India Top News', ['India']), ('Tech By Vice', None))

False 0.886976 True 0.113024 Name: is_geo, dtype: float64

Politics 87622 Financials 57845 Bonds News 39672 Opinion 38277 Sports 35132 Name: section, dtype: int64

(1314500, 20)

Section Coalescing

Here, we use a topic lexicon to assign news sections to topics after some additional preprocessing

(1165930, 18)

<class 'pandas.core.series.Series'> RangeIndex: 1510 entries, 0 to 1509 Series name: simple_topic Non-Null Count Dtype -------------- ----- 715 non-null object dtypes: object(1) memory usage: 11.9+ KB None

<class 'pandas.core.series.Series'> Int64Index: 1165930 entries, 348936 to 2638942 Series name: simple_topic Non-Null Count Dtype -------------- ----- 1015759 non-null object dtypes: object(1) memory usage: 17.8+ MB

(1015759, 19)

(13, 3) Standard Deviation of Publication Representaton Percentage Points 0.92

{'num_added_files': 0, 'num_removed_files': 25, 'num_deleted_rows': None, 'num_copied_rows': None, 'execution_time_ms': 590, 'scan_time_ms': 258, 'rewrite_time_ms': 0}

date: date32[day] year: int64 month: int64 day: int64 author: string title: string article: string url: string section: string publication: string title_word_count: int64 article_word_count: int64 title_textblob_sentiment: double article_textblob_sentiment: double vader_prob_positive_title: double vader_prob_negative_title: double vader_prob_neutral_title: double vader_compound_title: double simple_topic: string

Index(['date', 'year', 'month', 'day', 'author', 'title', 'article', 'url', 'section', 'publication', 'title_word_count', 'article_word_count', 'title_textblob_sentiment', 'article_textblob_sentiment', 'vader_prob_positive_title', 'vader_prob_negative_title', 'vader_prob_neutral_title', 'vader_compound_title', 'simple_topic'], dtype='object')

0%| | 0/25 [00:00<?, ?it/s]