This project implements an automated fact-checking system using pretrained BERT-based transformers for sequence classification. The system addresses the critical challenge of misinformation by classifying text claims as SUPPORTS, REFUTES, or NOT ENOUGH INFO based on provided evidence. Our approach leverages state-of-the-art transformer architectures with custom optimizations to achieve robust performance on fact-checking tasks.
Keywords: Fact-checking, BERT, Transformer Models, Natural Language Processing, Misinformation Detection, Sequence Classification
- Automated Fact-Checking: Develop a reliable system for classifying factual claims against evidence
- BERT-Based Architecture: Implement and optimize transformer models for natural language understanding
- Performance Optimization: Achieve high accuracy and F1-scores on imbalanced fact-checking datasets
- Reproducible Research: Provide comprehensive evaluation metrics and experimental tracking
- Custom Optimizer: Implementation of ClippyAdagrad with layer-specific learning rates
- Weighted Loss Training: Class-balanced training for handling imbalanced datasets
- Comprehensive Evaluation: Multi-metric assessment including accuracy, F1-score, precision, and recall
- Experiment Tracking: Integration with Weights & Biases for reproducible research
The fact-checking task is formulated as a 3-class sequence classification problem:
- SUPPORTS (0): The evidence supports the claim
- REFUTES (1): The evidence refutes the claim
- NOT ENOUGH INFO (2): Insufficient evidence to determine claim validity
- Architecture: 12-layer transformer with 768 hidden dimensions
- Attention Heads: 12 multi-head attention mechanisms
- Vocabulary Size: 30,522 tokens
- Parameters: ~110M trainable parameters
- Format:
[CLS] claim [SEP] evidence [SEP] - Max Sequence Length: 512 tokens
- Tokenization: BERT tokenizer with WordPiece subword tokenization
- Padding: Dynamic padding with attention masks
- Output Layer: Linear layer with 768 β 3 dimensions
- Activation: Softmax for probability distribution
- Loss Function: Cross-entropy with optional class weighting
optimizer = ClippyAdagrad([
{'params': model.bert.encoder.layer[:6].parameters(), 'lr': 1e-5},
{'params': model.bert.encoder.layer[6:].parameters(), 'lr': 2e-5},
{'params': model.bert.pooler.parameters(), 'lr': 2e-5},
{'params': model.classifier.parameters(), 'lr': 3e-5},
], lr=3e-5)class WeightedLossTrainer(Trainer):
def __init__(self, *args, class_weights=None, **kwargs):
super().__init__(*args, **kwargs)
self.class_weights = class_weights
def compute_loss(self, model, inputs, return_outputs=False):
outputs = model(**inputs)
logits = outputs.logits
loss_fct = nn.CrossEntropyLoss(weight=self.class_weights)
loss = loss_fct(logits.view(-1, self.model.config.num_labels),
inputs["labels"].view(-1))
return (loss, outputs) if return_outputs else loss| Parameter | Value | Rationale |
|---|---|---|
| Learning Rate | 5e-5 | Standard for BERT fine-tuning |
| Batch Size | 12 | Memory-optimized for GPU training |
| Epochs | 15 | Sufficient for convergence |
| Warmup Steps | 500 | Gradual learning rate increase |
| Weight Decay | 0.01 | Regularization to prevent overfitting |
| Dropout Rate | 0.2 | Reduce overfitting in classification head |
| Gradient Accumulation | 3 | Effective batch size of 36 |
Our experiments demonstrate the following performance on the validation set:
| Metric | Value | Interpretation |
|---|---|---|
| Accuracy | 39.6% | Overall classification correctness |
| F1-Score | 0.39 | Harmonic mean of precision and recall |
| Precision | 0.50 | Correct positive predictions |
| Recall | 0.48 | Complete positive predictions |
| Metric | Value | Interpretation |
|---|---|---|
| Accuracy | 48.7% | Overall classification correctness |
| F1-Score | 0.47 | Harmonic mean of precision and recall |
| Precision | 0.49 | Correct positive predictions |
| Recall | 0.50 | Complete positive predictions |
The results show:
- Class-weighted training improved accuracy by ~9 percentage points
- F1-score improvement from 0.39 to 0.47 with weighted loss
- Balanced precision and recall in both experiments
- Room for optimization with more sophisticated architectures
- Training Samples: ~15,000 claim-evidence pairs
- Validation Samples: ~3,000 claim-evidence pairs
- Test Samples: ~3,000 claim-evidence pairs
- Class Distribution: Imbalanced (40% SUPPORTS, 30% each for REFUTES/NEI)
Our training process shows consistent improvement across epochs:
Figure 1: Training and validation loss curves showing model convergence over epochs
Figure 2: Accuracy and F1-score progression during training
Figure 3: Confusion matrix for baseline BERT model showing class-wise prediction patterns
Figure 4: Detailed classification report for baseline model with precision, recall, and F1-scores per class
Figure 5: Confusion matrix for class-weighted training showing improved class balance
Figure 6: Classification report for enhanced model demonstrating performance improvements
- Training Stability: Both loss curves show stable convergence without overfitting
- Class Imbalance: Confusion matrices reveal the challenge of imbalanced classes
- Performance Improvement: Enhanced model shows better class-wise performance
- Metric Consistency: F1-scores and accuracy show correlated improvements
- Model: BERT-Base-Uncased
- Optimizer: AdamW
- Learning Rate: 3e-5
- Result: Baseline performance establishment
- Enhancement: Class-weighted loss function
- Purpose: Address class imbalance
- Result: Improved minority class performance
- Optimizer: ClippyAdagrad with layer-specific learning rates
- Features: Adaptive learning rates for different model components
- Result: Better convergence and stability
βββ src/
β βββ data/ # Data processing utilities
β β βββ data_processing.py
β β βββ __init__.py
β βββ models/ # Model training and evaluation
β β βββ model_utils.py # Core training logic
β β βββ baseline.py # Baseline model implementation
β β βββ train.py # Training script
β β βββ test.py # Testing script
β β βββ __init__.py
β βββ utils/ # Utility functions
β β βββ clippyadagrad.py # Custom optimizer
β β βββ aggregate_summaries.py
β β βββ compare_experiments.py
β β βββ __init__.py
β βββ experiments/ # Experiment tracking
β β βββ logs/ # Training logs
β β βββ wandb/ # Weights & Biases runs
β β βββ __init__.py
β βββ main.py # Main entry point
βββ data/
β βββ raw/ # Raw datasets (gitignored)
β βββ processed/ # Processed datasets
βββ docs/
β βββ figures/ # Generated visualizations
β βββ results/ # Model outputs
βββ notebooks/ # Jupyter notebooks for analysis
βββ examples/ # Example scripts and configurations
βββ tests/ # Unit tests
- Python: 3.10 or higher
- CUDA: Compatible GPU (recommended for training)
- Memory: 8GB+ RAM
- Storage: 5GB+ for models and datasets
-
Clone the repository
git clone https://github.com/yourusername/fact-checking-bert.git cd fact-checking-bert -
Create virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Prepare data
python src/main.py --compose
-
Train model
python src/main.py --train --experiment_name "experiment_1" -
Evaluate model
python src/main.py --evaluate --experiment_name "experiment_1"
python src/main.py --train \
--experiment_name "custom_experiment" \
--pretrained_model "bert-base-uncased"# Complete pipeline
make pipeline
# Individual steps
make data-prepare
make train
make evaluate- Loss Progression: Training and validation loss over epochs
- Metrics Evolution: Accuracy and F1-score development
- Learning Rate: Dynamic learning rate scheduling
- Confusion Matrix: Class-wise prediction analysis
- Classification Report: Detailed performance metrics
- ROC Curves: Receiver operating characteristic analysis
Epoch 1/15: 100%|ββββββββββ| 1250/1250 [00:45<00:00, 27.8it/s]
eval_loss: 1.0722, eval_accuracy: 0.3962, eval_f1: 0.3923
-
Text Preprocessing
- Claim and evidence concatenation
- Tokenization with BERT tokenizer
- Sequence length management (truncation/padding)
-
Dataset Preparation
- Custom
FactDatasetclass - Dynamic batching with attention masks
- Class weight computation for imbalanced data
- Custom
-
Training Loop
- Gradient accumulation for effective batch size
- Early stopping with patience mechanism
- Learning rate scheduling with warmup
- Encoder Layers 1-6: 1e-5 (frozen pre-trained knowledge)
- Encoder Layers 7-12: 2e-5 (gradual adaptation)
- Pooler Layer: 2e-5 (feature extraction)
- Classifier: 3e-5 (task-specific learning)
- Dropout: 0.2 probability in classification head
- Weight Decay: 0.01 for parameter regularization
- Gradient Clipping: Prevents gradient explosion
torch>=2.0.0- PyTorch deep learning frameworktransformers>=4.46.0- Hugging Face transformers librarydatasets>=2.14.0- Dataset utilities and processingevaluate>=0.4.0- Evaluation metrics computationwandb>=0.15.0- Experiment tracking and visualization
pandas>=2.0.0- Data manipulation and analysisnumpy>=1.24.0- Numerical computingscikit-learn>=1.3.0- Machine learning utilities
matplotlib>=3.7.0- Plotting and visualizationseaborn>=0.12.0- Statistical visualizationplotly>=5.15.0- Interactive plots
We welcome contributions to improve the fact-checking system. Please see CONTRIBUTING.md for detailed guidelines.
# Install development dependencies
pip install -e ".[dev]"
# Run tests
make test
# Format code
make format
# Lint code
make lintThis project is licensed under the MIT License - see the LICENSE file for details.
Wei-Han Tu
- Course: CSE 256 Natural Language Processing
- Institution: University of California, San Diego
- Email: [your-email@ucsd.edu]
- Research Focus: Transformer-based NLP, Fact-checking Systems
- UCSD CSE Department: Computational resources and academic guidance
- Course Instructors: Technical mentorship and project supervision
- Teaching Assistants: Implementation guidance and code review
- Hugging Face: Transformers library and BERT implementation
- Weights & Biases: Experiment tracking and visualization tools
- PyTorch Team: Deep learning framework and optimization
- BERT Authors: Original transformer architecture
- Fact-checking Researchers: Dataset and evaluation methodologies
- NLP Community: Best practices and implementation insights
-
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
-
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
-
Wolf, T., et al. (2020). Transformers: State-of-the-art Natural Language Processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.
-
Thorne, J., et al. (2018). FEVER: a large-scale dataset for Fact Extraction and VERification. Proceedings of NAACL-HLT 2018.
-
Hanselowski, A., et al. (2018). UKP-Athene: Multi-Sentence Textual Entailment for Claim Verification. Proceedings of the First Workshop on Fact Extraction and VERification (FEVER).
-
Paszke, A., et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems.
-
Abadi, M., et al. (2016). TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467.
β Star this repository if you find it helpful for your research!





