Training with 1T Tokens:
Model | Arc-c | Arc-e | BoolQ |
Hellaswag |
OBQA | PIQA | Winogrande | SciQ | Avg |
---|---|---|---|---|---|---|---|---|---|
OpenLLaMA-3B-v2 | 0.339 | 0.676 | 0.657 | 0.700 | 0.260 | 0.767 | 0.629 | 0.924 | 0.619 |
StableLM-base-alpha-3B-v2 | 0.324 | 0.673 | 0.646 | 0.686 | 0.264 | 0.760 | 0.621 | 0.921 | 0.612 |
StableLM-3B-4E1T | --- | 0.666 | --- | --- | --- | 0.768 | 0.632 | 0.914 | --- |
YOCO-3B | 0.379 | 0.731 | 0.645 | 0.689 | 0.298 | 0.763 | 0.639 | 0.924 | 0.634 |
Training with 1.6T Tokens:
Model | Arc-c | Arc-e | BoolQ |
Hellaswag |
OBQA | PIQA | Winogrande | SciQ | Avg |
---|---|---|---|---|---|---|---|---|---|
StableLM-3B-4E1T | --- | 0.688 | --- | --- | --- | 0.762 | 0.627 | 0.913 | --- |
YOCO-3B | 0.396 | 0.733 | 0.644 | 0.698 | 0.300 | 0.764 | 0.631 | 0.921 | 0.636 |
YOCO-3B-1M | 0.413 | 0.747 | 0.638 | 0.705 | 0.300 | 0.773 | 0.651 | 0.932 | 0.645 |
Model | Size | N=1 | N=2 | N=4 | N=8 |
---|---|---|---|---|---|
GPT-4-128K | -- | 1.00 | 1.00 | 0.98 | 1.00 |
MiniCPM-128K | 2.4B | 1.00 | 1.00 | 0.54 | 0.56 |
ChatGLM3-128K | 6B | 0.94 | 0.72 | 0.52 | 0.44 |
YaRN-Mistral-128K | 7B | 0.02 | 0.12 | 0.08 | 0.20 |
LWM-1M-text | 7B | 1.00 | 0.90 | 0.76 | 0.62 |
YOCO-3B-1M | 3B | 0.98 | 0.98 | 0.84 | 0.56 |
To install the required packages, use the following command:
pip install -r requirements.txt
Besides normal packages, Apex and Flash-Attention should be installed seperately following their offcial guidences.
To evaluate models in Harness-Eval, the script is as follows in scripts/eval_task.sh
:
cd fairseq/
TASK='harness_boolq'
torchrun --master-port=29505 --nproc_per_node=1 validate.py \
--data-dir ../harness_data/ \
--criterion harness_eval \
--task harness_eval \
--batch-size 4 \
--eval-data ${TASK} \
--log-format simple --log-interval 10 \
--bf16 \
--tokenizer-pad-to-multiple 8 \
--arch yoco_3b_new --tiktoken-model cl100k_base --load-ckpt /path_to_ckpt/YOCO-3B-1M/checkpoint.pth --yoco-model /path_to_ckpt/YOCO-3B-1M --tokens-per-sample 4096
Our model uses city-number pairs for long sequence evaluation. To get the results at a certain maximal length, the script is as follows in scripts/eval_needle.sh
:
cd fairseq/
torchrun --master-port=29504 --nproc_per_node=1 validate.py \
--task pseudo \
--criterion needle_haystack \
--batch-size 1 \
--max-epoch 1 \
--no-save \
--tiktoken-model cl100k_base \
--bf16 \
--arch yoco_3b_new --tiktoken-model cl100k_base --load-ckpt /path_to_ckpt/YOCO-3B-1M/checkpoint.pth --yoco-model /path_to_ckpt/YOCO-3B-1M --tokens-per-sample 1048576 --interval 1048576
To run Multi-Needle experiments, replace --criterion needle_haystack
with --criterion multi_needle --needle-num {num}
.
To support distributed training, our implementation is based on infinibatch to read data iteratively. The overall data directory should be organized as follows:
Data/
├── json/
│ ├── train.json
│ └── CC.json
│ └── StarCoder.json
│ └── ...
├── shard/
│ ├── CC/
│ │ ├── 00000.jsonl
│ │ ├── 00001.jsonl
│ │ └── ...
│ └── StarCoder/
│ ├── 00000.jsonl
│ ├── 00001.jsonl
│ └── ...
We recommend that each sharded data files contains no more than 10K lines with one json dict per line, and jsonl file, such as Data/shard/CC/00000.jsonl
, should be in the format like this:
{"text": "File 1 is here..."}
{"text": "File 2 is here..."}
...
Then, for each source, a JSON file preserves all the paths of the jsonl files. Take Data/json/CC.json
for example:
[
"/path_to_data/Data/shard/CC/00000.jsonl",
"/path_to_data/Data/shard/CC/00001.jsonl",
...
]
Finally, train.json
records all sources' information and sampling ratio:
[
{
"name": "CC",
"weight": 0.5
},
{
"name": "StarCoder",
"weight": 0.2
},
...
]
scripts/train.sh
:
cd fairseq/
torchrun --nproc-per-node=1 train.py /path_to_data \
--save-interval-updates 5000 \
--no-epoch-checkpoints \
--arch yoco_base \
--criterion cross_entropy \
--task gpt \
--tokens-per-sample 2048 \
--tokenizer-pad-to-multiple 8 \
--pad-to-max-len \
--optimizer adam --adam-betas "(0.9, 0.95)" \
--adam-eps 1e-06 \
--clip-norm 2.0 \
--lr 0.00015 \
--lr-scheduler polynomial_decay \
--warmup-updates 50 \
--weight-decay 0.05 \
--batch-size 1 \
--model-parallel-size 1 \
--update-freq 1 \
--batch-read-ahead 1000 \
--total-num-update 300000 \
--log-format simple --log-interval 10 --disable-validation \
--tiktoken-model cl100k_base \
--save-interval-updates 5000 \
--bf16 # bf16 is encouraged in pre-training