ESE 5460: Principles of Deep Learning
This repository contains our implementation of multimodal misinformation detection models that jointly analyze text and images from social media posts. We evaluate unimodal and multimodal baselines on the Fakeddit dataset and explore CLIP-guided multimodal fusion to improve fake news detection performance.
- Ravi Raghavan (rr1133@seas.upenn.edu)
- Dhruv Verma (vdhruv@seas.upenn.edu)
- Raafae Zaki (rzaki2@seas.upenn.edu)
We use the Fakeddit dataset, a large-scale multimodal dataset collected from Reddit, designed for fine-grained fake news and misinformation classification.
- Text: The text associated with each Reddit post
- Images: Images attached to the corresponding Reddit post
- Labels: Fine-grained misinformation categories indicating the credibility and intent of the post
Note to Instructor: Dataset details follow the setup described in r/Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection
Note to Instructor: Due to file size limitations, the dataset is not included in this repository. Instead, we provide the instructions below for accessing the data directly from the original Google Drive source.
Following the instructions from the r/Fakeddit paper, we obtained the dataset from the official Fakeddit GitHub repository.
The repository provides a link to the dataset’s Google Drive Folder
Since our project focuses on multimodal analysis, we use only the multimodal samples, which contain both text and images.
Specifically, we downloaded the following files from the Google Drive link above and stored them locally in a folder named data/:
multimodal_train.tsvmultimodal_validate.tsvmultimodal_test_public.tsv
Local File Structure
data/
├── multimodal_train.tsv
├── multimodal_validate.tsv
└── multimodal_test_public.tsv
Note to Instructor: We considered the different label schemes provided in the Fakeddit dataset:
2_way_label,3_way_label, and6_way_label. Since our primary goal is binary classification, we used2_way_labelfor all experiments. In our experiments, a label of 0 corresponds to fake news, while a label of 1 corresponds to non-fake news. The other label schemes (3_way_labeland6_way_label) provide finer-grained categorizations of misinformation, but they were not used in our models to maintain focus on the binary detection task.
This section describes the preprocessing steps applied to the multimodal Fakeddit dataset prior to modeling and analysis.
First, the training, validation, and test splits were loaded from TSV files into pandas DataFrames. Basic inspections (e.g., previews and summary statistics) were performed to verify successful loading, confirm schema consistency, and identify missing or malformed entries.
Next, we performed feature selection to retain only columns relevant to multimodal misinformation detection. Identifier fields, redundant metadata, and features unlikely to contribute meaningful predictive signal were removed. We retained textual content, image-related fields, engagement and contextual metadata, and the target labels. The same feature subset was applied consistently across all dataset splits to ensure a uniform schema.
We then applied sanity checks tailored to multimodal learning. Samples missing either textual content or image URLs were removed, ensuring that every example contained both modalities. Entries with missing or invalid labels were also discarded to maintain valid supervision. After filtering, indices were reset to keep the DataFrames clean and contiguous.
We then converted data types to ensure consistency across the dataset and avoid downstream errors. Text-based fields were explicitly cast to strings, while Unix timestamps were converted into datetime objects to enable temporal analysis. Enforcing consistent data types across all splits ensures that subsequent preprocessing, feature engineering, and modeling steps can operate reliably without additional type handling.
Because images in the Fakeddit dataset are provided as URLs rather than raw image files, we crawled the internet to download the corresponding images for each sample. Since some image URLs were no longer accessible or failed to download, we intentionally sampled a larger number of examples per split to ensure sufficient usable data after crawling.
After cleaning, we performed dataset subsampling to accommodate compute constraints and potential image retrieval failures. We initially sampled a larger subset from each split while preserving the original label distribution. Specifically, we used:
- 50,000 samples for training
- 50,000 samples for validation
- 50,000 samples for testing
The subsampling procedure was stratified by the target label, ensuring that class proportions remained consistent across all splits. This approach ensured that even if a portion of images failed to download, we still retained ample multimodal data for training and evaluation.
Together, these preprocessing steps produce a clean, consistent, and fully multimodal dataset suitable for reproducible experimentation.
Source Code Reference: The code for data preprocessing was implemented in the following Jupyter Notebook
After crawling the internet for the images and removing samples where image retrieval failed, the final dataset sizes for each split were as follows:
- Train: 33324
- Validation: 33316
- Test: 33519
These final splits maintain the original label distributions and ensure that every example contains both textual and visual modalities. The resulting dataset is clean, consistent, and fully multimodal, providing a robust foundation for downstream modeling and experimentation.
To further reduce computational overhead during training, we performed additional downsampling of the validation and test splits. Although the initial subsampled splits contained approximately 33,000 examples each, evaluating the model at frequent checkpoints—every 50–100 weight updates—proved to be prohibitively expensive.
To address this, we randomly downsampled the validation and test sets to 5,000 samples each, while ensuring that the original label distributions were preserved. This stratified downsampling allowed for efficient evaluation without distorting class proportions.
By reducing the size of these splits, we were able to maintain frequent model checkpointing and monitoring of validation performance, while keeping the training loop tractable. This approach balances evaluation efficiency with representativeness during model development.
Source Code Reference:
After crawling the image URLs, some images could not be opened because they were corrupted. To handle this, we used the notebooks listed below to identify which samples in the dataset were corrupted. In every model we subsequently train, when loading the data, we filter out these corrupted samples to ensure that only valid images are used during training and evaluation.
Source Code Reference:
Before developing our main CLIP-based models, we began by implementing several baseline models. The purpose of these baselines was to establish reference performance levels and provide a point of comparison for our more complex multimodal models.
- Utilized the pretrained BERT (bert-base-uncased) model to encode only the text portion of each Reddit post.
- Added a classification head on top of the [CLS] token embedding to perform binary classification
- Compared Pretrained BERT versus fully fine-tuned BERT, allowing us to evaluate the benefits of adapting the language model to our specific fake-news detection task.
- Source Jupyter Notebook:
Milestone #2_Ravi.ipynb
- Utilized the pretrained ResNet-101 model to extract visual features from only the image portion of each Reddit post.
- Features were obtained from the layer immediately before the final fully connected (FC) classification layer, capturing high-level image representations.
- Conducted experiments comparing Pretrained ResNet versus Fine-tuned ResNet, allowing us to assess the benefits of adapting the visual model to our specific fake-news detection task.
- Source Jupyter Notebook:
Milestone2_Dhruv.ipynb
- Combined text embeddings from BERT and image features from ResNet-101 by simple concatenation, effectively representing both modalities for each Reddit post.
- Added a joint classification head on top of the concatenated features and trained the model end-to-end, allowing gradients to update both the text and image encoders (when fine-tuning).
- This baseline provides a reference for evaluating how much multimodal fusion improves performance compared to unimodal models.
- Source Jupyter Notebook:
Milestone3_Dhruv.ipynb
- Leveraged CLIP Zhou et al., 2022 to obtain joint text–image representations for each Reddit post, effectively capturing cross-modal semantic relationships.
- The multimodal fusion was guided by CLIP embeddings, enabling the model to integrate textual and visual information more effectively than simple concatenation.
- Performed an ablation study to evaluate the impact of different attention mechanisms on multimodal fusion:
- QKV (self-attention): applies standard self-attention across all modality embeddings
- Modality-Wise attention: explicitly accounts for the modality of each feature, enabling the model to assign different weights to text and image information during fusion.
- This approach allowed us to evaluate how attention-based fusion strategies impact multimodal misinformation detection performance.
- Source Code (Modality-Wise Attention):
Milestone #4_Ravi.ipynb - Source Code (QKV Attention):
Milestone_4_Raafae.ipynb
To better understand the behavior and performance of our CLIP-based models, we performed a series of analyses:
- Embedding Similarity Analysis: Computed and plotted cosine similarities between CLIP text and image embeddings for posts in the training dataset, providing insight into how well the model aligns textual and visual information.
- Training Dynamics: Plotted training and validation loss curves for each CLIP-based model to evaluate convergence behavior, overfitting, and generalization across different attention mechanisms.
- Attention Visualization:
- For the QKV attention model, generated attention heatmaps to examine how the model distributes attention across text and image features.
- For the Modality-Wise attention model, visualized the modality-specific attention weights, highlighting how the model balances the contribution of text versus image information during prediction.
These analyses helped us interpret model behavior, identify strengths and limitations of each attention mechanism, and provide qualitative evidence for the effectiveness of CLIP-guided multimodal fusion in misinformation detection.
- Source Jupyter Notebook (contains all analysis plots for all CLIP-based Models):
CLIP_Models_Analysis.ipynb
Note to Instructor: For every Jupyter Notebook (.ipynb) mentioned above, an equivalent Python script (.py) with the same filename is available in the same directory and in the code folder. These Python files were generated directly from the corresponding notebooks using Google Colab. Furthermore, we note that every Jupyter Notebook also has a corresponding Markdown (.md) file describing the contents of that notebook
All notebooks and Python scripts must be run using Google Colab. Local execution is not supported due to environment and dependency requirements.
The processed dataset required for training and evaluation is available in the Cleaned Data folder on our Project's Google Drive:
Ensure that this Google Drive folder is accessible from your Colab session.
- Open the desired notebook in Google Colab.
- Mount Google Drive:
from google.colab import drive drive.mount('/content/drive')
- Execute Colab Notebook
Here I will explain the overall repository structure and what each directory corresponds to
ESE_5460_FINAL_PROJECT/
├── cleaned_data/
│ └── train.csv # Training Dataset (Size: 33324)
│ └── validation.csv # Original Validation Dataset (Size: 33316)
│ └── validation_5k.csv # Downsampled Validation Dataset (Size: 5000)
│ └── test.csv # Original Validation Dataset (Size: 33519)
│ └── test_5k.csv # Downsampled Test Dataset (Size: 5000)
│ └── train_corrupted_filenames.txt # Image files that are corrupted from train
│ └── train_corrupted_indices.txt # Indices in dataframe of corrupted image files from train
│ └── validation_corrupted_filenames.txt # Image files that are corrupted from Validation
│ └── validation_corrupted_indices.txt # Indices in dataframe of corrupted image files from Validation
│ └── validation_5k_corrupted_filenames.txt # Image files that are corrupted from Validation in downsampled dataset
│ └── validation_5k_corrupted_indices.txt # Indices in dataframe of corrupted image files from Validation in downsampled dataset
│ └── test_corrupted_filenames.txt # Image files that are corrupted from Test
│ └── test_corrupted_indices.txt # Indices in dataframe of corrupted image files from Test
│ └── test_5k_corrupted_filenames.txt # Image files that are corrupted from Test in downsampled dataset
│ └── test_5k_corrupted_indices.txt # Indices in dataframe of corrupted image files from Test in downsampled dataset
├── code/ # Contains Python Files of the Jupyter Notebooks mentioned. The names are corresponding
├── data/
│ └── multimodal_train.tsv # Original Fakeeddit Training Dataset for Multimodal Data Samples
│ └── multimodal_validate.tsv # Original Fakeeddit Validation Dataset for Multimodal Data Samples
│ └── multimodal_test_public.tsv # Original Fakeeddit Test Dataset for Multimodal Data Samples
├── images/ # Contains all Images used in Report
├── metrics/ # Contains Evaluation metrics from Model Training Runs
├── notebooks/ # Contains Jupyter Notebooks
├── output/ # Generated outputs (predictions, logs, etc.)
├── test/
│ └── test_notebook.ipynb # Notebook for testing/evaluation
├── test_images/ # Test image data
├── train_images/ # Training image data
├── validation_images/ # Validation image data
├── report/ # Contains PDF of our Report