This folder contains scripts for converting text data from original sources (HF, JSON) to the Mosaic StreamingDataset format for consumption by our training scripts. StreamingDataset is designed to make training on large datasets from cloud storage as fast, cheap, and scalable as possible. In particular, it is custom built for multi-node, distributed training for large models while maximizing correctness guarantees, performance, and ease of use.
They following scripts will run on CPUs (no GPUs needed). Execute them in an environment with python
and llm-foundry
dependencies installed. All scripts should run from the ./llm-foundry/scripts/data_prep
directory.
In this example, we use the convert_dataset_hf.py
script to convert a HuggingFace c4
dataset into a StreamingDataset
, using the EleutherAI/gpt-neox-20b
tokenizer. The resulting directory is saved at ./llm-foundry/scripts/data_prep/my-copy-c4
.
Currently supports c4
and The Pile
.
# Convert C4 dataset to StreamingDataset format
python convert_dataset_hf.py \
--dataset allenai/c4 --data_subset en \
--out_root my-copy-c4 --splits train_small val_small \
--concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>' \
--compression zstd
Using the convert_dataset_json.py
script...
# Convert json dataset to StreamingDataset format
python convert_dataset_json.py \
--path ./example_data/arxiv.jsonl \
--out_root my-copy-arxiv --split train \
--concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>' \
--compression zstd
Where --path
can be a single json file, or a folder containing json files. --split
denotes the intended split (hf defaults to train
).
Using the convert_text_to_mds.py
script, we convert a text file containing the complete works of William Shakespeare.
# Convert json dataset to StreamingDataset format
mkdir shakespeare && cd shakespeare
curl -O https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
cd ..
python convert_text_to_mds.py \
--output_folder my-copy-shakespeare \
--input_folder shakespeare \
--concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b \
--compression zstd
Using the convert_finetuning_dataset.py
script you can run a command such as:
python convert_finetuning_dataset.py --dataset "Muennighoff/P3" \
--splits "train" "validation" \
--preprocessor "llmfoundry.data.finetuning.tasks:p3_preprocessing_function"\
--out_root "/path/to/your/output_directory"
This example assumes:
"Muennighoff/P3"
is the dataset you want to convert. Substitute "Muennighoff/P3" with the name or path of your dataset.train
andvalidation
are the splits of the dataset to convert.llmfoundry.data.finetuning.tasks:p3_preprocessing_function
is a string that provides the name or import path of the function used to preprocess the dataset. Substitute it with your actual preprocessor. See tasks for available functions and examples.s3://<bucket>/muennighoff-p3
is the root path of your output directory where MDS shards will be stored. Replace this with the actual path to your output directory.
Please note that you need to fill in actual values for "your_preprocessing_function" and "/path/to/your/output_directory" in the command above for it to work correctly.
Also, if you want to keep a local copy of the output when out_root
is remote, you can use the --local
argument:
python convert_finetuning_dataset.py --dataset "squad" --splits "train" "validation" --preprocessor "your_preprocessing_function" --out_root "s3://your_bucket/output_directory" --local "/path/to/local/directory"
Remember that all these command line arguments should be filled with your actual dataset name/path, preprocessing function, and output directories.