Topics Processor 💬

This notebook cleans up and assigns more well structured topics to news sections.

Notebook Properties

Upstream Notebook: src.engineering.word_counts_and_sentiments
Compute Resources: 64 GB RAM, 4 CPUs
Last Updated: Dec 10 2023

Data

Name	Type	Location Type	Description	Location
`all_the_news`	`input`	`Delta`	Read full delta dataset of `AllTheNews`	`catalog/text_eda/all_the_news.delta`

2023-12-11 12:32:39.333040: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [notice] A new release of pip available: 22.2.2 -> 23.3.1 [notice] To update, run: pip install --upgrade pip ✔ Download and installation successful You can now load the package via spacy.load('en_core_web_sm')

[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip.

0%| | 0/54 [00:00<?, ?it/s]

(1776605, 18)

5438

Year 2016 has 114 single article sections, that will be nullified. Year 2017 has 226 single article sections, that will be nullified. Year 2018 has 174 single article sections, that will be nullified. Year 2019 has 121 single article sections, that will be nullified. Year 2020 has 74 single article sections, that will be nullified.

(1314500, 18)

(1754, 3)

False 86.08894 True 13.91106 Name: is_geo, dtype: float64

(('Sports News', None), ('Sports', None), ('Design', None), ('India Top News', ['India']), ('Tech By Vice', None))

False 0.886976 True 0.113024 Name: is_geo, dtype: float64

Politics 87622 Financials 57845 Bonds News 39672 Opinion 38277 Sports 35132 Name: section, dtype: int64

(1314500, 20)

(1165930, 18)

<class 'pandas.core.series.Series'> RangeIndex: 1510 entries, 0 to 1509 Series name: simple_topic Non-Null Count Dtype -------------- ----- 715 non-null object dtypes: object(1) memory usage: 11.9+ KB None

<class 'pandas.core.series.Series'> Int64Index: 1165930 entries, 348936 to 2638942 Series name: simple_topic Non-Null Count Dtype -------------- ----- 1015759 non-null object dtypes: object(1) memory usage: 17.8+ MB

(1015759, 19)

(13, 3) Standard Deviation of Publication Representaton Percentage Points 0.92

{'num_added_files': 0, 'num_removed_files': 25, 'num_deleted_rows': None, 'num_copied_rows': None, 'execution_time_ms': 590, 'scan_time_ms': 258, 'rewrite_time_ms': 0}

date: date32[day] year: int64 month: int64 day: int64 author: string title: string article: string url: string section: string publication: string title_word_count: int64 article_word_count: int64 title_textblob_sentiment: double article_textblob_sentiment: double vader_prob_positive_title: double vader_prob_negative_title: double vader_prob_neutral_title: double vader_compound_title: double simple_topic: string

Index(['date', 'year', 'month', 'day', 'author', 'title', 'article', 'url', 'section', 'publication', 'title_word_count', 'article_word_count', 'title_textblob_sentiment', 'article_textblob_sentiment', 'vader_prob_positive_title', 'vader_prob_negative_title', 'vader_prob_neutral_title', 'vader_compound_title', 'simple_topic'], dtype='object')

0%| | 0/25 [00:00<?, ?it/s]

topic_processor(Python)

Topics Processor 💬

Notebook Properties

Data

Read Data

Assign Sections by Geography

Section Coalescing