Crawling a dataset of news articles from CommonCrawl and clustering them by their topics.
Every day thousands of news articles with different political orientations are released.
The goal of this project is to create a large collection of news articles (>250k) that...
- ...covers the latest five years of the English-speaking news agenda
- ...has a multi-level topic structure to enable exploration of news articles at varying similarity level
e.g., from a cluster that reports about the power of the national leaders worldwide to more narrowly related clusters that report about leaders of a particular country. Because of the five-years timeframe of this news collection, it is possible to investigate how the narrative, agenda, and/or framing changed within the narrow clusters.
This project has been a part of the course Key Competencies in Computer Science at the University of Wuppertal to collect and aggregate news articles for cross-document coreference resolution at scale and was supervised by Anastasia Zhukova.
To set up a conda enviroment and install requirements:
conda env create -f env.yml
conda activate kccs
We recommend using python 3.9.4
to run this project.
This project consists of two parts: The crawler and the clustering algorithms. The crawler works as a usual python script. The clustering is performed within two jupyter notebooks to allow easier adjusting of hyperparameters and visualisations.
The dataset consists of ~268.000 American news articles from 03/2016 to 07/2021. The websites chosen are based on the POLUSA dataset to ensure a diverse political spectrum.
To crawl a dataset of news articles from CommonCrawl, type:
python crawl.py
The crawler is gathering WARC data from CommonCrawl and processing it into a json layout. This json data can later be used for clustering.
Running the crawler for the first time will produce a commoncrawl_archives.json
file. This allows the crawler to be stopped and continued crawling at a later time. If there exists a file with said name, the crawler will skip the initialization of WARC paths and continue downloading immediately (while skipping already processed data). That can be used to extend an existing dataset while only changing the amount of articles crawled per timeframe.
After Crawling you may cluster the dataset on one or multiple levels.
- Latent Dirichlet Allocation (LDA): First start by running the jupyter notebook:
LDA.ipynb
- K-Means & Timed Events: For the second and third layer you may run the jupyter notebook:
KMeans.ipynb
The LDA.ipynb
notebook is taking all json files within the directory ./crawl_json
and performs this pipeline on the concluded data. Each cluster will get assigned a json file representing a cluster.
In preprocessing multiple filters are being applied to the dataset. This makes the overall topic of the articles easier to determine. You can get an idea about how preprocessing improves our dataset for our specific use case with the following wordclouds. First there is the wordcloud representing the plain maintext of all articles. The second wordcloud only represents the words that have not been filtered out by the preprocessing.
To apply K-Means on the text based dataset we are generating tfidf vectors for all news articles to represent each document. The KMeans.ipynb
notebook is taking all json files within the directory ./lda_clustered_json
one by one. Each level 1 cluster will then get devided into subclusters which are represented by a folder hierarchy in the output.
- Crawler:
./crawl_json
- LDA Clustering:
./LDA_clustered_json
- Generate multiple json files with each one representing a cluster of different topics
- Three-Level Clustering:
./clustered_json
- Generate directories by clustering algorithms
The clustering output consists of multiple directories and json files which are named after the format
- LDA:
cluster_X-keyword1_keyword2_keyword3
- K-Means:
cluster_X-keyword1_keyword2_keyword3_keyword4_keyword5
- Timeframe:
year-month.json
with the keywords representing the most dominant keywords within the cluster (sorted descending). All json-outputs follow the news-please format while adding some new variables. The added variables are:
Variable | Description |
---|---|
LDA_ID |
ID of the articles corresponding to level 1 cluster |
LDA_topic_percentage |
Indicator about how well the article fits into its LDA cluster |
LDA_topic_keywords |
The most dominant keywords within a LDA cluster |
kMeans_ID |
ID of the articles corresponding to level 2 cluster |
kMeans_topic_keywords |
The most dominant keywords within a K-Means cluster |
year-month |
Representing the timeframe this article has been released in |
"date_download": "09/07/2021, 01:35:50",
"date_modify": "09/07/2021, 01:35:50",
"date_publish": "2016-05-31T05:12:50Z",
"description": "string",
"language": "en",
"source_domain": "www.website.com",
"maintext": "maintext string",
"url": "http://www.website.com/xyz",
"LDA_ID": 0,
"LDA_topic_percentage": 0.5086100101,
"LDA_topic_keywords": "president, king, trump, stone, heche, government, house, right, degeneres, official",
"kMeans_ID": 0,
"kMeans_topic_keywords": "photo, journal, wall, street, jason, accurate, trump, look, transcript, tour",
"year_month": "2016-05"
To achieve the best results you may change some parameters in the code. The following parameters have a significant influence on the quality of the produced dataset.
Parameter | Description |
---|---|
TARGET_WEBSITES |
Websites you want to keep crawled data from |
TEST_TARGETS |
URLs to request WARC-files from CommonCrawl |
INDEXES |
Indexes from CommonCrawl |
MAX_ARCHIVE_FILES_PER_URL |
Maximum amount of archive files per item of TEST_TARGETS |
MINIMUM_MAINTEXT_LENGTH |
Shorter articles will be discarded |
MAX_CONNECTION_RETRIES |
Maximum retries while downloading |
START_NUMERATION_AT |
Change if you want to extend the dataset |
DESIRED_LANGUAGE |
Select desired language, for example en |
Define Indexes (which represent the release dates of news articles) by choosing them from the CommonCrawl Index List.
These parameters can be adjusted within the LDA_clustering.ipynb
(first level).
Parameter | Description |
---|---|
topic_amount_start |
Minimum amount of clusters |
topic_amount_end |
Maximum amount of clusters |
iteration_interval |
The default interval is 1 |
desired_coherence |
Algorithm stops when value is reached |
The LDA Pipeline filters out a predefined list of stopwords extended by a json file. You can add/remove keywords by seperating them with commas in this file:
stopwords.json
These parameters can be adjusted within the KMeans_clustering.ipynb
(second & third level).
Parameter | Description |
---|---|
max_clusters |
Maximum possible clusters |
min_df |
Igonore terms that appear in less articles (percent) |
max_df |
Igonore terms that appear in more articles (percent) |
The optimal amount of clusters is determined by calculating the coherence score for each iteration of the algorithm. The definitive choice of clusters depends on said coherence score.
As you can see in the data, the maximum coherence score is achieved relatively quickly. This makes LDA a good choice as a level 1 clustering algorithm as it is not too specific.
Amount Of Clusters | Coherence Score (percentage) |
---|---|
... | ... |
14 | 48.48 |
15 | 51.29 |
16 (best result) | 58.06 |
17 | 50.73 |
18 | 54.37 |
... | ... |
The optimal amount of clusters is determined by performing K-Means on multiple amounts of clusters. The definitive choice of clusters is made by calculating the elbow/knee of the distortion curve. The amount of level 2 clusters is calculated independently for every level 1 cluster. We chose min_df = 0.05
and max_df = 0.6
for this dataset.
We applied K-Means onto every main LDA-Cluster. Here you can see the detected elbow/knee for the distortion curve of cluster_0-president_king_trump
:
You can find all additional distortion graphs for our dataset in the directory ./repo_images/kMeans_elbow_curves/
. Each detected elbow has been applied to be the cluster amount of choice for the K-Means Clustering.
The complete resulting dataset is containing ~268.000 clustered news articles from 03/2016 to 07/2021.
- You can download the sample dataset which has been crawled with this project by clicking here.
- You can download the already clustered dataset by clicking here.
-
Felix Hamborg. 2020. Newsplease Json Format. https://github.com/fhamborg/news-please/blob/master/newsplease/examples/sample.json
-
Lukas Gebhard and Felix Hamborg. 2020. The POLUSA Dataset: 0.9M Political News Articles Balanced by Time and Outlet Popularity. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL '20). Association for Computing Machinery, New York, NY, USA, 467–468. DOI:https://doi.org/10.1145/3383583.3398567