Quant Things

lanuch training task:

# [reference]: https://www.youtube.com/watch?v=KaAJtI1T2x4&list=PL_lsbAsL_o2CSuhUhJIiW0IkdT5C2wGWj

simple launch on one node:

python train/main.py  # export CUDA_VISIBLE_DEVICES=0,1  # only use GPU 0 and GPU 1

DDP (FSDP) launch on one node by torchrun (e.g. 8 GPUs):

torchrun --standalone --nproc_per_node=8 train/main.py

DDP (FSDP) launch on multi node by torchrun (e.g. 2 * 8 GPUs, two nodes):

# on node 0#
torchrun --nproc_per_node=8 --nnodes=2 -node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=xxx.xxx.xxx.xxx:xxxx train/main.py

# on node 1#
torchrun --nproc_per_node=8 --nnodes=2 -node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=xxx.xxx.xxx.xxx:xxxx train/main.py

lanuch generation task:

python gen/main.py

lanuch generation task with tensor parallelism:

torchrun --standalone --nproc_per_node=2 gen/main.py

lanuch test code:

python test/module_tests.py

llama3 configs:

llama3_8B

dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256,
multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-5, rope_theta=500000.0,
max_batch_size=32, max_seq_len=2048

llama3_70B

dim=8192, n_layers=80, n_heads=64, n_kv_heads=8, vocab_size=128256,
multiple_of=4096, ffn_dim_multiplier=1.3, norm_eps=1e-5, rope_theta=500000.0,
max_batch_size=32, max_seq_len=2048

llama3_405B

dim=16384, n_layers=126, n_heads=128, n_kv_heads=8, vocab_size=128256,
multiple_of=None, ffn_dim_multiplier=1.3, norm_eps=1e-5, rope_theta=500000.0,
max_batch_size=32, max_seq_len=2048

env configuration:

base pytorch env

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

pip install transformers==4.44.0

pip install tiktoken==0.7.0

pip install blobfile==3.0.0

pip install tqdm==4.66.5

faiss env

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

pip install faiss-cpu==1.8.0.post1

pip install transformers==4.44.0

pip install tiktoken==0.7.0

pip install blobfile==3.0.0

pip install tqdm==4.66.5

fairscale env

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

pip install fairscale==0.4.13

pip install fire==0.7.0

pip install transformers==4.44.0

pip install tiktoken==0.7.0

pip install blobfile==3.0.0

pip install tqdm==4.66.5

tensorrt-llm

cd TensorRT-LLM/examples/bloom

pip install torch torchvision torchaudio (2.4.0, cuda 12.1)

conda install -y mpi4py

conda install openmpi

pip install tensorrt_llm==0.13.0.dev2024081300 --extra-index-
url https://pypi.nvidia.com

pip install -r ./requirements.txt

Quant Things

Pre Doings

Depend On Docker Env

Prepare source code

git lfs install   # Open git large file transport
git clone https://github.com/NVIDIA/TensorRT-LLM.git
git checkout v0.8.0  #change version to 0.8.0
git submodule update --init --recursive  # update some sumbodules
git lsf pull

build docker image
```
make -C docker release_build
```
and you will get a docker image
run docker container

run from scratch
```
make -C docker release_run
```
mapping the file is better
```
makr -C docker release_run DOCKER_RUN_ARGS="-v  /host/machine/path/:/container/path"
```
Tips: Run this

cd This path

then, do some about quant things 👇，good luck😀

cd /app/tensorrt_llm/examples/llama

Intro

Quantization happened in generate checkpoint stage. Beside convert_checkpoint.py, another script is quantize.py, This two scripts include different quant functions. Here is a demo, shows different quantization methods, LLaMA 7B based.

Int8 weight-only

python convert_checkpoint.py --model_dir /home/Llama-2-7b-hf/  --output_dir ./checkpoint_1gpu_int8_wo --dtype float16 --use_weight_only --weight_only_precision int8

Int4 wight-only

python convert_checkpoint.py --model_dir /home/Llama-2-7b-hf/  --output_dir ./checkpoint_1gpu_int8_wo --dtype float16 --use_weight_only --weight_only_precision int4

Int4 AWQ

python ../quantization/quantize.py --model_dir /home/Llama-2-7b-hf/  --output_dir ./checkpoint_1gpu_int8_wo --dtype float16 --qformat int4_awq --awq_block_size 128 --calib_size 32

Int4 GPTQ

use part of full model model-00001-of-00002.safetensors

python convert_checkpoint.py --model_dir /home/Llama-2-7b-hf/  --output_dir ./checkpoint_1gpu_int8_wo --dtype float16 --use_weight_only --weight_only_precision int4_gptq --per_group --ammo_quant_ckpt_path ./llama-7b-4bit-gs128.safetensors

Int8 SmoothQuant

This quant method also in convert_checkpoint.py, use the --smoothquant parameter to open.

also use --per_token & --per_channel , 0.5 means transfer 50% activation quant stress to weight quant, now let's us consider the corner case, if 0, that mean none-transfer the quant stress, it's hard to quant activation part, easy to quant weight part. if 1,that mean transfer all stress to weight part, cause easy to quant activate part, hard to quant weight part.

python convert_checkpoint.py --model_dir /home/Llama-2-7b-hf/  --output_dir ./checkpoint_1gpu_int8_wo --dtype float16 --smoothquant 0.5 --per_token --per_channel

Tips: if you meet trust_remote_code problem, just run export HF_DATASETS_TRUST_REMOTE_CODE=1 in your terminal, it will download cnn_stories.tgz and dailymail_stories.tgz.

FP8

--calib_size 512 Optional Parameter

python ../quantization/quantize.py --model_dir /home/Llama-2-7b-hf/  --output_dir ./checkpoint_1gpu_int8_wo --dtype float16 --qformat fp8 --calib_size 512

INT8 KV_Cache

python convert_checkpoint.py --model_dir /home/Llama-2-7b-hf/  --output_dir ./checkpoint_1gpu_int8_wo --dtype float16 --int8_kv_cache

👆 file type .safetensors to .safetensors

👇 file type .safetensors to .engine

Engine Generate

After quant, safetensors to engine

trtllm-build --checkpoint_dir ./checkpoint_1gpu_int8_wo --output_dir ./engine_1gpu_int8_wo --gemm_plugin float16 --max_batch_size 16 --max_input_len 2048 --max_output_len 2048

If you meet this problem ValueError: max_workers must be greater than 0, need to mapping CUDA in Container.

sudo make -C docker release_run DOCKER_RUN_ARGS=" --gpus all -v /data_ws/Data_1/tinghao/boxue/llm_weight:/home"

and you see gpus now,

the get the engine😀

Outro

Actually，lots of quant methods in TensorRT-LLM/examples/llama at main · NVIDIA/TensorRT-LLM (github.com). Can test them one by one.

LLama Like Model All Steps

`~ From begin to end,from train to quant ~`

Intro

All files in myllama, path in docker /home/myllama

First of all, generate and organize files, LLama file type like ! :)

Include config.json,

model-00001-of-00004.safetensors,

model-00002-of-00004.safetensors,

model-00003-of-00004.safetensors,

model-00004-of-00004.safetensors,

pytorch_model-00001-of-00004.bin,

pytorch_model-00002-of-00004.bin,

pytorch_model-00003-of-00004.bin,

pytorch_model-00004-of-00004.bin,

model.safetensors.index.json

fp16 To int8

python convert_checkpoint.py --model_dir /home/Llama-2-7b-hf/ --output_dir ./myllama_1gpu_int8_wo --dtype float16 --use_weight_only --weight_only_precision int8

skip the type check function :)

safetensors To engine

trtllm-build --checkpoint_dir ./myllama_1gpu_int8_wo --output_dir ./myllama_1gpu_int8_wo_engine --gemm_plugin float16 --max_batch_size 16 --max_input_len 2048 --max_output_len 2048

Inference

 python3 run.py --engine_dir ./llama/myllama_1gpu
_int8_wo_engine/ --max_output_len 100 --tokenizer_dir /home/Llama-2-7b-hf/ --input_text "How do I count to nine in French?"

Ref

TODO:

add AWS s3 support to let several nodes read data from AWS S3 bucket during multi-node training task

some useful links:

quant

https://chatgpt.com/share/31aa8af3-dce2-457f-85db-2b18b3c242ce

torch.distributed

https://pytorch.org/docs/stable/distributed.html The package (torch.distributed) needs to be initialized using the torch.distributed.init_process_group() or torch.distributed.device_mesh.init_device_mesh() function before calling any other methods. Both block until all processes have joined.

Name		Name	Last commit message	Last commit date
Latest commit History 376 Commits
ckpt		ckpt
config		config
data		data
data_pipeline		data_pipeline
dist		dist
gen		gen
huggingface		huggingface
log		log
models		models
rag_database		rag_database
test		test
tokenizer		tokenizer
train		train
tutorials		tutorials
utils		utils
README.md		README.md

HTHloveYDH/custom_llama3

Folders and files

Latest commit

History

Repository files navigation

lanuch training task:

simple launch on one node:

DDP (FSDP) launch on one node by torchrun (e.g. 8 GPUs):

DDP (FSDP) launch on multi node by torchrun (e.g. 2 * 8 GPUs, two nodes):

lanuch generation task:

lanuch generation task with tensor parallelism:

lanuch test code:

llama3 configs:

llama3_8B

llama3_70B

llama3_405B

env configuration:

base pytorch env

faiss env

fairscale env

tensorrt-llm

Quant Things

Pre Doings

Intro

Int8 weight-only

Int4 wight-only

Int4 AWQ

Int4 GPTQ

Int8 SmoothQuant

FP8

INT8 KV_Cache

Engine Generate

Outro

LLama Like Model All Steps

~ From begin to end,from train to quant ~

Intro

fp16 To int8

safetensors To engine

Inference

Ref

TODO:

some useful links:

quant

torch.distributed

torch.distributed.tensor.parallel

Pytorch: Pipeline Parallelism

huggingface safetensors to llama.cpp gguf format

useful links about Transformer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`~ From begin to end,from train to quant ~`

Packages