Skip to content

Latest commit

 

History

History

YOCO

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

You Only Cache Once: Decoder-Decoder Architectures for Large Language Models

Approach

Performance

Harness Eval

Training with 1T Tokens:

Model Arc-c Arc-e BoolQ Hellaswag$^*$ OBQA PIQA Winogrande SciQ Avg
OpenLLaMA-3B-v2 0.339 0.676 0.657 0.700 0.260 0.767 0.629 0.924 0.619
StableLM-base-alpha-3B-v2 0.324 0.673 0.646 0.686 0.264 0.760 0.621 0.921 0.612
StableLM-3B-4E1T --- 0.666 --- --- --- 0.768 0.632 0.914 ---
YOCO-3B 0.379 0.731 0.645 0.689 0.298 0.763 0.639 0.924 0.634

Training with 1.6T Tokens:

Model Arc-c Arc-e BoolQ Hellaswag$^*$ OBQA PIQA Winogrande SciQ Avg
StableLM-3B-4E1T --- 0.688 --- --- --- 0.762 0.627 0.913 ---
YOCO-3B 0.396 0.733 0.644 0.698 0.300 0.764 0.631 0.921 0.636
YOCO-3B-1M 0.413 0.747 0.638 0.705 0.300 0.773 0.651 0.932 0.645

Needle In A Haystack

Multi-Needle Eval

Model Size N=1 N=2 N=4 N=8
GPT-4-128K -- 1.00 1.00 0.98 1.00
MiniCPM-128K 2.4B 1.00 1.00 0.54 0.56
ChatGLM3-128K 6B 0.94 0.72 0.52 0.44
YaRN-Mistral-128K 7B 0.02 0.12 0.08 0.20
LWM-1M-text 7B 1.00 0.90 0.76 0.62
YOCO-3B-1M 3B 0.98 0.98 0.84 0.56

Setup

To install the required packages, use the following command:

pip install -r requirements.txt

Besides normal packages, Apex and Flash-Attention should be installed seperately following their offcial guidences.

Harness Eval

To evaluate models in Harness-Eval, the script is as follows in scripts/eval_task.sh:

cd fairseq/
TASK='harness_boolq'

torchrun --master-port=29505 --nproc_per_node=1 validate.py \
    --data-dir ../harness_data/ \
    --criterion harness_eval \
    --task harness_eval \
    --batch-size 4 \
    --eval-data ${TASK}  \
    --log-format simple  --log-interval 10 \
    --bf16 \
    --tokenizer-pad-to-multiple 8 \
    --arch yoco_3b_new --tiktoken-model cl100k_base --load-ckpt /path_to_ckpt/YOCO-3B-1M/checkpoint.pth --yoco-model /path_to_ckpt/YOCO-3B-1M  --tokens-per-sample 4096

Needle In A Haystack Evaluation

Our model uses city-number pairs for long sequence evaluation. To get the results at a certain maximal length, the script is as follows in scripts/eval_needle.sh:

cd fairseq/
torchrun --master-port=29504 --nproc_per_node=1 validate.py \
    --task pseudo \
    --criterion needle_haystack \
    --batch-size 1 \
    --max-epoch 1 \
    --no-save \
    --tiktoken-model cl100k_base \
    --bf16 \
    --arch yoco_3b_new --tiktoken-model cl100k_base --load-ckpt /path_to_ckpt/YOCO-3B-1M/checkpoint.pth --yoco-model /path_to_ckpt/YOCO-3B-1M --tokens-per-sample 1048576 --interval 1048576

To run Multi-Needle experiments, replace --criterion needle_haystack with --criterion multi_needle --needle-num {num}.

Pretraining From Scratch

To support distributed training, our implementation is based on infinibatch to read data iteratively. The overall data directory should be organized as follows:

Data/
├── json/
│   ├── train.json
│   └── CC.json
│   └── StarCoder.json
│   └── ...
├── shard/
│   ├── CC/
│   │   ├── 00000.jsonl
│   │   ├── 00001.jsonl
│   │   └── ...
│   └── StarCoder/
│       ├── 00000.jsonl
│       ├── 00001.jsonl
│       └── ...

We recommend that each sharded data files contains no more than 10K lines with one json dict per line, and jsonl file, such as Data/shard/CC/00000.jsonl, should be in the format like this:

{"text": "File 1 is here..."}
{"text": "File 2 is here..."}
...

Then, for each source, a JSON file preserves all the paths of the jsonl files. Take Data/json/CC.json for example:

[
    "/path_to_data/Data/shard/CC/00000.jsonl",
    "/path_to_data/Data/shard/CC/00001.jsonl",
    ...
]

Finally, train.json records all sources' information and sampling ratio:

[
    {
        "name": "CC",
        "weight": 0.5
    },
    {
        "name": "StarCoder",
        "weight": 0.2
    },
    ...
]

scripts/train.sh:

cd fairseq/
torchrun --nproc-per-node=1 train.py /path_to_data \
    --save-interval-updates 5000 \
    --no-epoch-checkpoints \
    --arch yoco_base \
    --criterion cross_entropy \
    --task gpt \
    --tokens-per-sample 2048 \
    --tokenizer-pad-to-multiple 8 \
    --pad-to-max-len \
    --optimizer adam --adam-betas "(0.9, 0.95)" \
    --adam-eps 1e-06 \
    --clip-norm 2.0 \
    --lr 0.00015 \
    --lr-scheduler polynomial_decay \
    --warmup-updates 50 \
    --weight-decay 0.05 \
    --batch-size 1  \
    --model-parallel-size 1 \
    --update-freq 1 \
    --batch-read-ahead 1000 \
    --total-num-update 300000 \
    --log-format simple      --log-interval 10    --disable-validation \
    --tiktoken-model cl100k_base \
    --save-interval-updates 5000 \
    --bf16 # bf16 is encouraged in pre-training