Convert HuggingFace datasets to MosaicML Streaming format (MDS) for efficient cloud-based training.
pip install datasets huggingface_hub mosaicml-streamingBatch convert entire dataset:
python batch_to_mds.py \
--src wikimedia/wikipedia \
--out-hub bgub/wikipedia-mds-test \
--out-local ./mds-local-2/wikipedia \
--procs 10Convert single config/split:
python hf_to_mds_streaming.py \
--repo-id HuggingFaceFW/fineweb \
--split train \
--out-local /mnt/mds/fineweb \
--out-hub ben-gubler/fineweb-mds \
--procs 16 \
--streamingbatch_to_mds.py - Batch convert all configs/splits:
--src/--out-hub: Source and destination repos (required)--procs: Worker processes (default: 16)--compression: e.g.,zstd,zstd:11--include-config/--exclude-config: Regex filters--dry-run: Preview without executing--force: Rebuild existing datasets
hf_to_mds_streaming.py - Single config/split converter (called by batch script)
# Convert specific language only
python batch_to_mds.py \
--src wikimedia/wikipedia \
--out-hub your-username/wikipedia-en-mds \
--include-config "^20231101\.en$"
# Preview what would be processed
python batch_to_mds.py \
--src microsoft/orca-math-word-problems-200k \
--out-hub your-username/orca-math-mds \
--dry-runfrom streaming import StreamingDataset
from torch.utils.data import DataLoader
dataset = StreamingDataset(remote='hf://your-username/dataset-mds')
dataloader = DataLoader(dataset, batch_size=32)MDS format provides:
- Elastic Determinism: Reproducible across hardware configs
- Fast Resumption: Resume training in seconds
- High Throughput: Optimized for cloud streaming
- Effective Shuffling: Maintains quality while reducing costs