Skip to content

Processing Pipeline

Sebastian Göttel edited this page Mar 1, 2025 · 3 revisions

RedTEI processes Reddit comments in three stages: filtering, extraction, conversion. It offers two main processing modes that affect how comments are saved.

Processing Modes

  • Grouped Mode (default): Comments are grouped by Reddit threads. All comments belonging to the same thread are processed and stored together in a single output file (both JSON and XML).

  • No-Group Mode (--no-group): When the --no-group flag is activated, the pipeline processes each comment individually. Each comment, regardless of its thread, is processed and stored as a separate file (both JSON and XML).


1. Filtering JSON Objects

Script: trim_username_comments.py

The trim_username_comments.py script is responsible for filtering and modifying JSON objects directly within the .zst archives before further processing. This step is important for cleaning and preparing the comment data.

JSON objects are excluded from further processing if they meet any of the following criteria:

  • Deleted/Removed Comments: Comments where the "body" key value is "[removed]", "[deleted]", or "[removed by reddit]" are excluded. These are comments that were removed by moderators or Reddit itself.

  • Bot Authors: Comments authored by known bots are excluded. The script uses a list of bot usernames defined in src/config/botlist.txt. If the value of the "author" key matches a bot name in this list, the comment is excluded.

  • RemindMeBot Requests: Comments that are identified as requests to the RemindMeBot (e.g., !RemindMe 2 days) are filtered out.

  • URL-Only Comments: Comments that consist solely of a plaintext URL, or after URL removal, contain only [URL] placeholders (optionally followed by punctuation or whitespace), are excluded. This filter aims to remove low-content comments that primarily serve as link dropping.

In addition to exclusion, the script applies the following modifications to the comment text ("body"):

  • Quote Removal: Quotations within comments are removed to clean up the text and focus on original content.

  • URL Removal and Replacement:

    • Plaintext URLs (e.g., http://example.com) are replaced with the placeholder [URL].
    • Markdown URLs (e.g., [Example](https://example.com)) are processed to keep the link text (e.g., Example), unless the link text itself is also a URL, in which case it is replaced with [URL].
  • Inline Formatting Removal:

    • Bold Text: Double asterisks ** surrounding text are removed, preserving the text (e.g., **Text** becomes Text).
    • Italic Text: Single asterisks * surrounding text are removed, preserving the text (e.g., *Text* becomes Text).
    • Strikethrough Text: Tildes ~~ and the enclosed strikethrough text are completely removed.
  • Zero-Width Space Removal: All Zero-Width Space characters (\u200B, , ​) are removed to ensure cleaner text.

  • Newline Reduction: Multiple consecutive newline characters are reduced to a single newline character to normalize spacing.

  • Empty Comment Discarding: After all modifications, comments that are empty or consist only of whitespace are discarded and not processed further.

All filtering and modification actions (except inline formatting removals) are logged in a text file named filtered_log_{input_filename}.txt. This log documents the original content of comments that were filtered or modified, providing transparency and traceability to the filtering process. A single comment might appear multiple times in the log file if it was subject to multiple filtering actions (e.g., quote removal and URL removal).


2. Extraction to JSON Files

Scripts: comment_tree.py & comment_processing.py

The scripts comment_tree.py and comment_processing.py work together to extract comments from the filtered .zst files and store them as JSON files. The format and structure of these JSON files depend on the chosen processing mode (grouped or no-group).

Grouped Mode (default):

  • Thread-Based Grouping: Comments are grouped into threads based on their link_id. All comments sharing the same link_id are part of the same thread.
  • Single JSON File per Thread: For each thread, all its comments are stored in a single JSON file.
  • Flat List of Comments: Within each JSON file, comments are stored as a flat list of JSON objects. The original tree-like structure of replies and nested comments within the thread is not preserved.
  • Filename Convention: JSON files are named using the link_id of the thread, followed by _flat.json. For example, a file for thread ID 10ax890 would be named 10ax890_flat.json.

No-Group Mode (--no-group):

  • Individual Comment Processing: Each comment is processed and extracted independently, without considering thread groupings.
  • Separate JSON File per Comment: Each comment is stored in its own JSON file.
  • Single JSON Object per File: Each JSON file contains only one JSON object (= a single comment).
  • Filename Convention: JSON files are named using a combination of the link_id and the id of the comment, separated by an underscore and with the .json extension. For example, a comment with link_id 10wugax and id jepcf1r would be named 10wugax_jepcf1r.json.

3. Conversion to XML

Script: json2xml.py

This script converts the JSON files into TEI-XML format. The XML structure depends on the processing mode:

Grouped Mode (default):

  • Creates one XML file per thread.
  • TEI Header: Contains metadata about the thread (title, subreddit, date of last comment, thread URL), extracted from the first comment in the JSON file.
  • XML Body:
    • Comments are grouped in a <list> element within a <div type="comments">.
    • Each comment is represented as an <item> element within the <list>.
    • <item> elements contain:
      • Comment text
      • <date> (creation date)
      • <name> (author)
      • source attribute (comment URL)
      • Line breaks in the comment text are encoded as <lb/>.

No-Group Mode (--no-group):

  • Creates one XML file per comment.
  • TEI Header: Contains more detailed metadata about the comment and thread (title of the thread, subreddit, date of the comment, thread and comment URLs).
  • XML Body:
    • Each comment is placed as a single <p> element directly in the <body>.
    • The <p> element contains the comment text. No <item>, <list>, <date>, or <name> elements in the comment body.
    • Line breaks in the comment text are encoded as <lb/>.

The script validate.py is used to check the validity of the XML files.

Clone this wiki locally