Skip to content

Compare and explore two large social media datasets using text & topics

Notifications You must be signed in to change notification settings

theharmonylab/maladaptivedreaming-study

Repository files navigation

Maladaptive vs Immersive Daydreaming Study

A comprehensive R-based analysis of text data comparing immersive and maladaptive daydreaming using topic modeling, sentiment analysis, and embedding-based approaches. This project processes Reddit posts from two communities and applies advanced NLP techniques to extract insights about cognitive and emotional patterns.

Project Overview

This study analyzes written descriptions from two Reddit communities:

  • r/MaladaptiveDreaming - Maladaptive Dreaming (MD)
  • r/ImmersiveDaydreaming - Immersive Daydreaming (ID)

The analysis pipeline includes text preprocessing, parallel embedding generation, topic modeling (LDA), sentiment/valence analysis, and visualization of findings across multiple dimensions.

Core R Scripts

Data Processing & Preparation

  • imports.r: Initializes the R environment with all required packages and memory settings. Source this file first to ensure all dependencies are loaded.

  • preprocessing.r: Modular text cleaning functions using spaCy (via spacyr). Implements tokenization, lemmatization, part-of-speech filtering, and text normalization. Functions include initialize_spacy() and clean_text_spacy().

  • data_descriptives.r: Generates descriptive statistics including valence t-tests, word frequencies, and basic dataset characterizations.

Analysis & Modeling

  • parallel_predictions.r: Efficiently processes large text datasets using batch processing and parallel computing. Generates word embeddings and valence predictions across multiple CPU cores. Splits data into configurable batches and exports results to .rds files.

  • topicPlots.r: Implements Latent Dirichlet Allocation (LDA) topic modeling with visualization. Produces topic distributions, prevalence plots, 1D valence distributions, and 2D subreddit-vs-valence scatter plots. Includes custom stopword handling.

  • sentences_dataset.r: Processes sentence-level datasets and manages sentence embeddings.

  • sentences_analysis.r: Analyzes sentence-level data, extracts sentence examples, and builds lookup functions for mapping posts to authors. Generates result tables with representative sentences.

  • lollipop.r: Creates lollipop-style visualizations for topic comparisons with colorblind-friendly palettes. Reorders topics by category (Immersive Daydreaming vs Maladaptive Dreaming).

  • textTrainExamples.r: Visual representation of the most extreme sentence samples for the two datasets.

Experiments

  • clustering_experiments.r: Experimental clustering analyses and comparisons.

  • testLDA.r: Test and validation scripts for LDA implementations.

Typical Workflow

  1. Initialize Environment

    source("imports.r")
  2. Preprocess Text Data

    source("preprocessing.r")
    initialize_spacy()
    # Use cleaning functions on raw text
  3. Generate Embeddings and Valence Predictions

    source("parallel_predictions.r")
    # Processes data in parallel batches
  4. Analyze Descriptive Statistics

    source("data_descriptives.r")
  5. Perform Topic Modeling

    source("topicPlots.r")
    # Generates LDA models and visualizations using the 'topics' package
  6. Create Sentence Dataset

    source("sentences_dataset.r")
    # Processes sentence-level data and generates sentence embeddings for downstream analysis
  7. Analyze Sentence-Level Patterns

    source("sentences_analysis.r")
    # Extracts representative sentences and analyzes sentence-level valence
  8. Create Publication-Ready Visualizations

    source("lollipop.r")
    source("topicPlots.r")  # For additional plots

Requirements

  • R: Version 4.0 or higher recommended
  • Python: spaCy with English model (en_core_web_sm)
  • Key R Packages:
    • Topic modeling: topics - For Latent Dirichlet Allocation (LDA) models
    • NLP/Embeddings: text - For word embeddings and text representation learning
    • Text processing: spacyr, stringr, quanteda, stopwords
    • Data manipulation: tidyverse, dplyr, tibble, purrr
    • Visualization: ggplot2, ggforce
    • Parallel computing: parallel, future
    • Utilities: reticulate, crayon

Installation

  1. Clone the repository
  2. Install R packages: See imports.r for the complete list
  3. Install spaCy model:
    python -m spacy download en_core_web_sm

Performance Notes

  • For datasets with 10,000+ records, parallel processing is strongly recommended (see parallel_predictions.r)
  • Batch size should be adjusted based on available RAM (typically 1,000-5,000 records per batch)
  • LDA computation time scales with corpus size and number of topics; start with smaller topic numbers for testing
  • Pre-computed embeddings are cached in data/ to avoid recomputation

Contact & Attribution

Part of the Maladaptive Dreaming Study research conducted at The Harmony Lab.

About

Compare and explore two large social media datasets using text & topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages