Download all the text comments from a subreddit
Use the script subreddit_downloader.py
multiple times to download the data.
Then run the script dataset_builder.py for create a unique
dataset.
🖱 More info on website and medium.
Basic usage to download submissions and relative comments from subreddit AskReddit and News:
# Use python 3.8.5
# Install the dependencies
pip install -r requirements.txt
# Download the AskReddit comments of the last 30 submissions
python src/subreddit_downloader.py AskReddit --batch-size 10 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username>
# Download the News comments after 1 January 2020
python src/subreddit_downloader.py News --batch-size 512 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username> --utc-after 1609459201
# Build the dataset, the results will be under `./dataset/` path
python src/dataset_builder.py
- Parameters indicated with
<...>
on the previous script - Official Reddit guide
- TLDR: read this stack overflow
Parameter name | Description | How get it | Example of the value |
---|---|---|---|
reddit_id |
The Client ID generated from the apps page | Official guide | 40oK80pF8ac3Cn |
reddit_secret |
The secret generated from the apps page | Copy the value as showed here | 9KEUOE7pi8dsjs9507asdeurowGCcg |
reddit_username |
The reddit account name | The name you use for log in | pistoSniffer |
A new folder with two csv files are created from dataset_builder.py
, the script have some features:
- Remove rows with same
id
- Have a
caching_size
parameter to don't store all dataset in RAM
They have the following structure:
Each row is a submission of a specific subreddit and id
field is unique across the dataset (PK).
Column name | Description | Example |
---|---|---|
subreddit | Name of the subreddit | MTB |
id | Unique identifier of the submission | lhr2bo |
created_utc | UTC when submission was created | 1613068060 |
title | Title of the submission | Must ride So... |
selftext | Text off the submission | What are the best trails to ride in... |
full_link | Reddit unique link to the submission | https://www.reddit.com/r/MTB/comments/lhr2bo/must_ride_so_cali_trails/ |
Each row is a comment under a submission of a specific subreddit and id
field is unique across the dataset (PK).
Column name | Description | Example |
---|---|---|
subreddit | Name of the subreddit | News |
id | Unique identifier of the comment | gmz45xo |
submission_id | Id of the comment main submission | lhr2bo |
body | Text of the comment | We're past the point... |
created_utc | UTC when comment was created | 1613072734 |
parent_id | Id of the parent in a tree structure | t3_lhssi4 |
permalink | Reddit unique link to the comment | /r/news/comments/lhssi4/air_force_wants_to_know_if_key_pacific_airfield/gmz45xo/ |
-
subreddit: section of reddit website focused on a particular topic
-
submission: the post that appear in each subreddit. When you open a subreddit page, all the posts you see. Each submission has a tree of _ comments_
-
comment: text wrote by a reddit user under a submission inside a subreddit
- The main goal of this repository is to gather the comments belong to the subreddit
- Under the hood the script use pushshift to gather submissions id, and praw for collect the submissions comments
- More info about the
subreddit_downloader.py
script under the--help
command: - Other packages:
- psaw: Python Pushshift.io API Wrapper
- [?] Data empty CSV:
- Sometimes we have an empty csv under
/data/<subreddit>/<timestamp>/comments/xxx.csv
- This behaviour is due of a batch of submissions that don't have comments, you can check this opening the
/data/<subreddit>/<timestamp>/submissions/xxx.csv
equivalent file (samexxx.csv
name) and open the submission link
- Sometimes we have an empty csv under
- [?] The program stuck and don't run:
- Call the program with
--debug
flag to get in which submission the program is freezing - Very probably the program is blocked on a submission that contains 10k> comments, and the praw API need to make a lot of requests to gather all the data (thus require a lot of time).
- If you don't want to wait, or you want more control over the quantity of comments fetched per single submission, use the
--comments-cap
parameter. - If provided, the system requires new comments
comments_cap
times to the praw API, and don't download all comments.- More high the value, more comments will be downloaded
- Set to 0 to download only the comments showed on the first page of the submission
- Set to 64 to be enough sure that the system will download a good amount of data
- Tune the parameter as your favor
- Call the program with
python src/subreddit_downloader.py --help
Usage: subreddit_downloader.py [OPTIONS] SUBREDDIT
Download all the submissions and relative comments from a subreddit.
Arguments:
SUBREDDIT The subreddit name [required]
Options:
--output-dir TEXT Optional output directory [default: ./data/]
--batch-size INTEGER Request `batch_size` submission per time [default:
10]
--laps INTEGER How many times request `batch_size` reddit
submissions [default: 3]
--reddit-id TEXT Reddit client_id, visit https://github.com/reddit-
archive/reddit/wiki/OAuth2 [required]
--reddit-secret TEXT Reddit client_secret, visit
https://github.com/reddit-archive/reddit/wiki/OAuth2
[required]
--reddit-username TEXT Reddit username, used for build the `user_agent`
string, visit https://github.com/reddit-
archive/reddit/wiki/API [required]
--utc-after TEXT Fetch the submissions before this UTC date
--utc-before TEXT Fetch the submissions before this UTC date
--comments-cap INTEGER Some submissions have 10k> nested comments and stuck
the praw API call.If provided, the system requires
new comments `comments_cap` times to the praw
API.`comments_cap` under the hood will be passed
directly to `replace_more` function as `limit`
parameter. For more info see the README and visit ht
tps://asyncpraw.readthedocs.io/en/latest/code_overvi
ew/other/commentforest.html#asyncpraw.models.comment
_forest.CommentForest.replace_more.
--debug / --no-debug Enable debug logging [default: False]
--install-completion Install completion for the current shell.
--show-completion Show completion for the current shell, to copy it or
customize the installation.
--help Show this message and exit.
dataset_builder.py
- store some dataset info (subreddit, max/min utc/human, n^ lines)
subreddit_downloader.py
- use async function if possible to gather more data concurrently
- load user credentials in
subreddit_downloader.py
from local config file - store/log the utc and human datetime
- use case: download all data from X datetime until now
- early stopping if no new data fetched
- refactory of
dataset_builder.py:_rows_parser
: find a more efficient approach to checkid
duplicates- maybe switch to use pandas as matrix manager
- should switch to use psaw?