A framework for training and evaluating Open Language Models (OLMo) using the DataDecide methodology for efficient data curation and model development.
DataDecider implements the DataDecide approach for training language models, which uses small-scale proxy experiments to predict which data mixtures will perform best at scale. This package provides:
- Complete OLMo model implementation with 14 size variants (4M to 1B parameters)
- DataDecide data curation pipeline with proxy metrics
- Training infrastructure with distributed support
- Evaluation suite for model assessment
- Integration with Weights & Biases for experiment tracking
# Using pip
pip install git+https://github.com/yourusername/DataDecider.git
# Using uv
uv pip install git+https://github.com/yourusername/DataDecider.git
# For local development from another project
pip install -e /path/to/DataDecider
# Clone the repository
git clone https://github.com/yourusername/DataDecider.git
cd DataDecider
# Install in development mode with uv (recommended)
uv pip install -e ".[dev]"
# Or using pip
pip install -e ".[dev]"
The framework expects tokenized datasets in HuggingFace format. You can use the provided scripts to prepare your data:
# Build a dataset from raw files
python -m data_decide.scripts.prepare_training_data \
--input-dir ./raw_data \
--output-dir ./processed_data \
--tokenizer EleutherAI/gpt-neox-20b \
--max-length 2048
Configuration files are in YAML format. Example for 4M model:
# configs/model_configs/olmo_4m.yaml
model_size: "4M"
model_params:
num_layers: 8
hidden_size: 64
num_attention_heads: 8
vocab_size: 50254
# Using the main training script
data-decide-train \
--config configs/training_configs/olmo_4m_training.yaml
# Or use the enhanced version with rich UI
python -m data_decide.scripts.train_enhanced \
--config configs/training_configs/olmo_4m_training.yaml
# Real-time monitoring with rich terminal UI
data-decide-monitor --run-name my_training_run
# Or analyze completed runs
data-decide-analyze --wandb-run-path username/project/run_id
The DataDecide approach involves:
- Proxy Dataset Creation: Generate multiple small datasets with different data mixtures
- Proxy Metrics: Compute perplexity, diversity, and quality scores without full training
- Mixture Selection: Choose the best data mixture based on proxy results
- Full Training: Train the model on the selected data mixture
Supported OLMo configurations:
Model Size | Parameters | Hidden Size | Layers | Heads |
---|---|---|---|---|
4M | 3.7M | 64 | 8 | 8 |
20M | 18.6M | 128 | 16 | 16 |
38M | 36.9M | 192 | 16 | 16 |
70M | 66.8M | 256 | 18 | 16 |
160M | 152.2M | 384 | 20 | 16 |
410M | 390.2M | 640 | 24 | 16 |
1B | 982.3M | 1024 | 28 | 16 |
- Distributed Training: Full support for multi-GPU training via Accelerate
- Mixed Precision: FP16/BF16 training for efficiency
- Gradient Checkpointing: Memory-efficient training for larger models
- Learning Rate Scheduling: Cosine decay with warmup
- Comprehensive Monitoring: WANDB integration with system metrics and rich terminal UI
- Pre-tokenized Data Pipeline: Efficient training with separated tokenization
DataDecider includes a comprehensive monitoring system that provides both local and cloud-based tracking:
- Real-time progress bars for epochs, steps, and evaluation
- Live metrics display (loss, learning rate, GPU usage)
- Beautiful colored output with system information
- Time estimates and performance metrics
- Automatic experiment tracking to Weights & Biases
- System monitoring (GPU utilization, memory, temperature)
- Model metrics (gradients, learning rates, predictions)
- Checkpoint artifact management
- Hyperparameter tracking and visualization
# 1. Add to .env file
WANDB_API_KEY=your_api_key
WANDB_PROJECT=finpile_datadecide
WANDB_ENTITY=your_username
# 2. Run training (monitoring enabled by default)
uv run python examples/train_olmo_pretokenized.py --dataset tiny_100k
See docs/monitoring.md
for complete documentation and docs/wandb-quickstart.md
for a quick start guide.
DataDecider/
├── configs/ # Configuration files
│ ├── model_configs/ # Model architecture configs
│ ├── training_configs/ # Training hyperparameters
│ └── data_configs/ # Data processing configs
├── data_decide/ # Main package
│ ├── olmo/ # OLMo implementation
│ │ ├── models/ # Model architecture
│ │ ├── data/ # Data processing
│ │ ├── training/ # Training logic
│ │ ├── evaluation/ # Evaluation metrics
│ │ └── utils/ # Utilities
│ └── scripts/ # Executable scripts
├── tests/ # Unit tests
└── data/ # Data directory (gitignored)
This repository does not include the large training datasets. To obtain the data:
- Sample Data: A small sample dataset is included in
tests/test_data/
for testing - Full Datasets: See
data/README.md
for instructions on downloading the full arXiv datasets - Custom Data: Use the data preparation scripts to process your own datasets
Create a .env
file in the project root:
# Weights & Biases
WANDB_API_KEY=your_api_key_here
WANDB_PROJECT=olmo-datadecide
WANDB_ENTITY=your_entity
# Training
CUDA_VISIBLE_DEVICES=0,1,2,3
TOKENIZERS_PARALLELISM=false
Example training configuration:
# Training parameters
model_size: "4M"
data_path: "./data/processed/olmo_4m_400M_tokens"
output_dir: "./checkpoints/olmo_4m_datadecide"
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 4
learning_rate: 1.4e-2
warmup_steps: 572
save_steps: 1000
eval_steps: 500
logging_steps: 10
# W&B configuration
report_to: ["wandb"]
wandb_project: "olmo-4m-datadecide"
wandb_name: "olmo-4m-arxiv-400M"
# Run all tests
pytest
# Run with coverage
pytest --cov=data_decide
# Run specific test
pytest tests/test_data_curation.py
# Format code
ruff format .
# Check style
ruff check .
To use DataDecider in another project (like FinPileCode):
from data_decide.olmo.models import OLMoForCausalLM, OLMoConfig
from data_decide.olmo.data import DataDecideCurator
# Initialize model
config = OLMoConfig.from_pretrained("olmo-4m")
model = OLMoForCausalLM(config)
# Use DataDecide for data curation
curator = DataDecideCurator()
proxy_datasets = curator.create_proxy_datasets(your_data)
best_mixture = curator.select_best_mixture(proxy_datasets)
If you use this framework in your research, please cite:
@software{datadecider,
title = {DataDecider: OLMo Training with DataDecide Methodology},
author = {FinPile Team},
year = {2024},
url = {https://github.com/yourusername/DataDecider}
}
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
- OLMo architecture based on the paper "OLMo: Accelerating the Science of Language Models"
- DataDecide methodology for efficient data curation
- Built with HuggingFace Transformers and Accelerate