A comprehensive R-based analysis of text data comparing immersive and maladaptive daydreaming using topic modeling, sentiment analysis, and embedding-based approaches. This project processes Reddit posts from two communities and applies advanced NLP techniques to extract insights about cognitive and emotional patterns.
This study analyzes written descriptions from two Reddit communities:
- r/MaladaptiveDreaming - Maladaptive Dreaming (MD)
- r/ImmersiveDaydreaming - Immersive Daydreaming (ID)
The analysis pipeline includes text preprocessing, parallel embedding generation, topic modeling (LDA), sentiment/valence analysis, and visualization of findings across multiple dimensions.
-
imports.r: Initializes the R environment with all required packages and memory settings. Source this file first to ensure all dependencies are loaded.
-
preprocessing.r: Modular text cleaning functions using spaCy (via
spacyr). Implements tokenization, lemmatization, part-of-speech filtering, and text normalization. Functions includeinitialize_spacy()andclean_text_spacy(). -
data_descriptives.r: Generates descriptive statistics including valence t-tests, word frequencies, and basic dataset characterizations.
-
parallel_predictions.r: Efficiently processes large text datasets using batch processing and parallel computing. Generates word embeddings and valence predictions across multiple CPU cores. Splits data into configurable batches and exports results to
.rdsfiles. -
topicPlots.r: Implements Latent Dirichlet Allocation (LDA) topic modeling with visualization. Produces topic distributions, prevalence plots, 1D valence distributions, and 2D subreddit-vs-valence scatter plots. Includes custom stopword handling.
-
sentences_dataset.r: Processes sentence-level datasets and manages sentence embeddings.
-
sentences_analysis.r: Analyzes sentence-level data, extracts sentence examples, and builds lookup functions for mapping posts to authors. Generates result tables with representative sentences.
-
lollipop.r: Creates lollipop-style visualizations for topic comparisons with colorblind-friendly palettes. Reorders topics by category (Immersive Daydreaming vs Maladaptive Dreaming).
-
textTrainExamples.r: Visual representation of the most extreme sentence samples for the two datasets.
-
clustering_experiments.r: Experimental clustering analyses and comparisons.
-
testLDA.r: Test and validation scripts for LDA implementations.
-
Initialize Environment
source("imports.r") -
Preprocess Text Data
source("preprocessing.r") initialize_spacy() # Use cleaning functions on raw text
-
Generate Embeddings and Valence Predictions
source("parallel_predictions.r") # Processes data in parallel batches
-
Analyze Descriptive Statistics
source("data_descriptives.r") -
Perform Topic Modeling
source("topicPlots.r") # Generates LDA models and visualizations using the 'topics' package
-
Create Sentence Dataset
source("sentences_dataset.r") # Processes sentence-level data and generates sentence embeddings for downstream analysis
-
Analyze Sentence-Level Patterns
source("sentences_analysis.r") # Extracts representative sentences and analyzes sentence-level valence
-
Create Publication-Ready Visualizations
source("lollipop.r") source("topicPlots.r") # For additional plots
- R: Version 4.0 or higher recommended
- Python: spaCy with English model (
en_core_web_sm) - Key R Packages:
- Topic modeling:
topics- For Latent Dirichlet Allocation (LDA) models - NLP/Embeddings:
text- For word embeddings and text representation learning - Text processing:
spacyr,stringr,quanteda,stopwords - Data manipulation:
tidyverse,dplyr,tibble,purrr - Visualization:
ggplot2,ggforce - Parallel computing:
parallel,future - Utilities:
reticulate,crayon
- Topic modeling:
- Clone the repository
- Install R packages: See
imports.rfor the complete list - Install spaCy model:
python -m spacy download en_core_web_sm
- For datasets with 10,000+ records, parallel processing is strongly recommended (see
parallel_predictions.r) - Batch size should be adjusted based on available RAM (typically 1,000-5,000 records per batch)
- LDA computation time scales with corpus size and number of topics; start with smaller topic numbers for testing
- Pre-computed embeddings are cached in
data/to avoid recomputation
Part of the Maladaptive Dreaming Study research conducted at The Harmony Lab.