-
Notifications
You must be signed in to change notification settings - Fork 0
Processing Pipeline
RedTEI processes Reddit comments in three stages: filtering, extraction, conversion. It offers two main processing modes that affect how comments are saved.
-
Grouped Mode (default): Comments are grouped by Reddit threads. All comments belonging to the same thread are processed and stored together in a single output file (both JSON and XML).
-
No-Group Mode (
--no-group): When the--no-groupflag is activated, the pipeline processes each comment individually. Each comment, regardless of its thread, is processed and stored as a separate file (both JSON and XML).
Script: trim_username_comments.py
The trim_username_comments.py script is responsible for filtering and modifying JSON objects directly within the .zst archives before further processing. This step is important for cleaning and preparing the comment data.
JSON objects are excluded from further processing if they meet any of the following criteria:
-
Deleted/Removed Comments: Comments where the
"body"key value is"[removed]","[deleted]", or"[removed by reddit]"are excluded. These are comments that were removed by moderators or Reddit itself. -
Bot Authors: Comments authored by known bots are excluded. The script uses a list of bot usernames defined in
src/config/botlist.txt. If the value of the"author"key matches a bot name in this list, the comment is excluded. -
RemindMeBot Requests: Comments that are identified as requests to the RemindMeBot (e.g.,
!RemindMe 2 days) are filtered out. -
URL-Only Comments: Comments that consist solely of a plaintext URL, or after URL removal, contain only
[URL]placeholders (optionally followed by punctuation or whitespace), are excluded. This filter aims to remove low-content comments that primarily serve as link dropping.
In addition to exclusion, the script applies the following modifications to the comment text ("body"):
-
Quote Removal: Quotations within comments are removed to clean up the text and focus on original content.
-
URL Removal and Replacement:
- Plaintext URLs (e.g.,
http://example.com) are replaced with the placeholder[URL]. - Markdown URLs (e.g.,
[Example](https://example.com)) are processed to keep the link text (e.g.,Example), unless the link text itself is also a URL, in which case it is replaced with[URL].
- Plaintext URLs (e.g.,
-
Inline Formatting Removal:
-
Bold Text: Double asterisks
**surrounding text are removed, preserving the text (e.g.,**Text**becomesText). -
Italic Text: Single asterisks
*surrounding text are removed, preserving the text (e.g.,*Text*becomesText). -
Strikethrough Text: Tildes~~and the enclosed strikethrough text are completely removed.
-
Bold Text: Double asterisks
-
Zero-Width Space Removal: All Zero-Width Space characters (
\u200B,,​) are removed to ensure cleaner text. -
Newline Reduction: Multiple consecutive newline characters are reduced to a single newline character to normalize spacing.
-
Empty Comment Discarding: After all modifications, comments that are empty or consist only of whitespace are discarded and not processed further.
All filtering and modification actions (except inline formatting removals) are logged in a text file named filtered_log_{input_filename}.txt. This log documents the original content of comments that were filtered or modified, providing transparency and traceability to the filtering process. A single comment might appear multiple times in the log file if it was subject to multiple filtering actions (e.g., quote removal and URL removal).
Scripts: comment_tree.py & comment_processing.py
The scripts comment_tree.py and comment_processing.py work together to extract comments from the filtered .zst files and store them as JSON files. The format and structure of these JSON files depend on the chosen processing mode (grouped or no-group).
Grouped Mode (default):
-
Thread-Based Grouping: Comments are grouped into threads based on their
link_id. All comments sharing the samelink_idare part of the same thread. - Single JSON File per Thread: For each thread, all its comments are stored in a single JSON file.
- Flat List of Comments: Within each JSON file, comments are stored as a flat list of JSON objects. The original tree-like structure of replies and nested comments within the thread is not preserved.
-
Filename Convention: JSON files are named using the
link_idof the thread, followed by_flat.json. For example, a file for thread ID10ax890would be named10ax890_flat.json.
No-Group Mode (--no-group):
- Individual Comment Processing: Each comment is processed and extracted independently, without considering thread groupings.
- Separate JSON File per Comment: Each comment is stored in its own JSON file.
- Single JSON Object per File: Each JSON file contains only one JSON object (= a single comment).
-
Filename Convention: JSON files are named using a combination of the
link_idand theidof the comment, separated by an underscore and with the.jsonextension. For example, a comment withlink_id10wugaxandidjepcf1rwould be named10wugax_jepcf1r.json.
Script: json2xml.py
This script converts the JSON files into TEI-XML format. The XML structure depends on the processing mode:
Grouped Mode (default):
- Creates one XML file per thread.
- TEI Header: Contains metadata about the thread (title, subreddit, date of last comment, thread URL), extracted from the first comment in the JSON file.
-
XML Body:
- Comments are grouped in a
<list>element within a<div type="comments">. - Each comment is represented as an
<item>element within the<list>. -
<item>elements contain:- Comment text
-
<date>(creation date) -
<name>(author) -
sourceattribute (comment URL) - Line breaks in the comment text are encoded as
<lb/>.
- Comments are grouped in a
No-Group Mode (--no-group):
- Creates one XML file per comment.
- TEI Header: Contains more detailed metadata about the comment and thread (title of the thread, subreddit, date of the comment, thread and comment URLs).
-
XML Body:
- Each comment is placed as a single
<p>element directly in the<body>. - The
<p>element contains the comment text. No<item>,<list>,<date>, or<name>elements in the comment body. - Line breaks in the comment text are encoded as
<lb/>.
- Each comment is placed as a single
The script validate.py is used to check the validity of the XML files.