# [reference]: https://www.youtube.com/watch?v=KaAJtI1T2x4&list=PL_lsbAsL_o2CSuhUhJIiW0IkdT5C2wGWj
python train/main.py # export CUDA_VISIBLE_DEVICES=0,1 # only use GPU 0 and GPU 1
torchrun --standalone --nproc_per_node=8 train/main.py
# on node 0#
torchrun --nproc_per_node=8 --nnodes=2 -node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=xxx.xxx.xxx.xxx:xxxx train/main.py
# on node 1#
torchrun --nproc_per_node=8 --nnodes=2 -node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=xxx.xxx.xxx.xxx:xxxx train/main.py
python gen/main.py
torchrun --standalone --nproc_per_node=2 gen/main.py
python test/module_tests.py
dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256,
multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-5, rope_theta=500000.0,
max_batch_size=32, max_seq_len=2048
dim=8192, n_layers=80, n_heads=64, n_kv_heads=8, vocab_size=128256,
multiple_of=4096, ffn_dim_multiplier=1.3, norm_eps=1e-5, rope_theta=500000.0,
max_batch_size=32, max_seq_len=2048
dim=16384, n_layers=126, n_heads=128, n_kv_heads=8, vocab_size=128256,
multiple_of=None, ffn_dim_multiplier=1.3, norm_eps=1e-5, rope_theta=500000.0,
max_batch_size=32, max_seq_len=2048
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.44.0
pip install tiktoken==0.7.0
pip install blobfile==3.0.0
pip install tqdm==4.66.5
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install faiss-cpu==1.8.0.post1
pip install transformers==4.44.0
pip install tiktoken==0.7.0
pip install blobfile==3.0.0
pip install tqdm==4.66.5
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install fairscale==0.4.13
pip install fire==0.7.0
pip install transformers==4.44.0
pip install tiktoken==0.7.0
pip install blobfile==3.0.0
pip install tqdm==4.66.5
cd TensorRT-LLM/examples/bloom
pip install torch torchvision torchaudio (2.4.0, cuda 12.1)
conda install -y mpi4py
conda install openmpi
pip install tensorrt_llm==0.13.0.dev2024081300 --extra-index-
url https://pypi.nvidia.com
pip install -r ./requirements.txt
Depend On Docker Env
-
Prepare source code
git lfs install # Open git large file transport git clone https://github.com/NVIDIA/TensorRT-LLM.git git checkout v0.8.0 #change version to 0.8.0 git submodule update --init --recursive # update some sumbodules git lsf pull
-
build docker image
make -C docker release_build
and you will get a docker image
-
run docker container
run from scratch
make -C docker release_run
mapping the file is better
makr -C docker release_run DOCKER_RUN_ARGS="-v /host/machine/path/:/container/path"
Tips: Run this
cd This path
then, do some about quant things 👇,good luck😀
cd /app/tensorrt_llm/examples/llama
Quantization happened in generate checkpoint stage. Beside convert_checkpoint.py, another script is quantize.py, This two scripts include different quant functions. Here is a demo, shows different quantization methods, LLaMA 7B based.
python convert_checkpoint.py --model_dir /home/Llama-2-7b-hf/ --output_dir ./checkpoint_1gpu_int8_wo --dtype float16 --use_weight_only --weight_only_precision int8
python convert_checkpoint.py --model_dir /home/Llama-2-7b-hf/ --output_dir ./checkpoint_1gpu_int8_wo --dtype float16 --use_weight_only --weight_only_precision int4
python ../quantization/quantize.py --model_dir /home/Llama-2-7b-hf/ --output_dir ./checkpoint_1gpu_int8_wo --dtype float16 --qformat int4_awq --awq_block_size 128 --calib_size 32
use part of full model model-00001-of-00002.safetensors
python convert_checkpoint.py --model_dir /home/Llama-2-7b-hf/ --output_dir ./checkpoint_1gpu_int8_wo --dtype float16 --use_weight_only --weight_only_precision int4_gptq --per_group --ammo_quant_ckpt_path ./llama-7b-4bit-gs128.safetensors
This quant method also in convert_checkpoint.py, use the --smoothquant
parameter to open.
also use --per_token
& --per_channel
, 0.5 means transfer 50% activation quant stress to weight quant, now let's us consider the corner case, if 0, that mean none-transfer the quant stress, it's hard to quant activation part, easy to quant weight part. if 1,that mean transfer all stress to weight part, cause easy to quant activate part, hard to quant weight part.
python convert_checkpoint.py --model_dir /home/Llama-2-7b-hf/ --output_dir ./checkpoint_1gpu_int8_wo --dtype float16 --smoothquant 0.5 --per_token --per_channel
Tips: if you meet trust_remote_code
problem, just run export HF_DATASETS_TRUST_REMOTE_CODE=1
in your terminal, it will download cnn_stories.tgz
and dailymail_stories.tgz.
--calib_size 512
Optional Parameter
python ../quantization/quantize.py --model_dir /home/Llama-2-7b-hf/ --output_dir ./checkpoint_1gpu_int8_wo --dtype float16 --qformat fp8 --calib_size 512
python convert_checkpoint.py --model_dir /home/Llama-2-7b-hf/ --output_dir ./checkpoint_1gpu_int8_wo --dtype float16 --int8_kv_cache
👆 file type .safetensors to .safetensors
👇 file type .safetensors to .engine
After quant, safetensors to engine
trtllm-build --checkpoint_dir ./checkpoint_1gpu_int8_wo --output_dir ./engine_1gpu_int8_wo --gemm_plugin float16 --max_batch_size 16 --max_input_len 2048 --max_output_len 2048
If you meet this problem ValueError: max_workers must be greater than 0
, need to mapping CUDA in Container.
sudo make -C docker release_run DOCKER_RUN_ARGS=" --gpus all -v /data_ws/Data_1/tinghao/boxue/llm_weight:/home"
and you see gpus now,
the get the engine😀
Actually,lots of quant methods in TensorRT-LLM/examples/llama at main · NVIDIA/TensorRT-LLM (github.com). Can test them one by one.
All files in myllama, path in docker /home/myllama
First of all, generate and organize files, LLama
file type like ! :)
Include config.json,
model-00001-of-00004.safetensors,
model-00002-of-00004.safetensors,
model-00003-of-00004.safetensors,
model-00004-of-00004.safetensors,
pytorch_model-00001-of-00004.bin,
pytorch_model-00002-of-00004.bin,
pytorch_model-00003-of-00004.bin,
pytorch_model-00004-of-00004.bin,
model.safetensors.index.json
python convert_checkpoint.py --model_dir /home/Llama-2-7b-hf/ --output_dir ./myllama_1gpu_int8_wo --dtype float16 --use_weight_only --weight_only_precision int8
skip the type check function :)
trtllm-build --checkpoint_dir ./myllama_1gpu_int8_wo --output_dir ./myllama_1gpu_int8_wo_engine --gemm_plugin float16 --max_batch_size 16 --max_input_len 2048 --max_output_len 2048
python3 run.py --engine_dir ./llama/myllama_1gpu
_int8_wo_engine/ --max_output_len 100 --tokenizer_dir /home/Llama-2-7b-hf/ --input_text "How do I count to nine in French?"
- TensorRT-LLM/examples/llama at main · NVIDIA/TensorRT-LLM (github.com)
- Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit 1.16.2 documentation
- add AWS s3 support to let several nodes read data from AWS S3 bucket during multi-node training task
https://chatgpt.com/share/31aa8af3-dce2-457f-85db-2b18b3c242ce
https://pytorch.org/docs/stable/distributed.html The package (torch.distributed) needs to be initialized using the torch.distributed.init_process_group() or torch.distributed.device_mesh.init_device_mesh() function before calling any other methods. Both block until all processes have joined.
https://pytorch.org/docs/stable/distributed.tensor.parallel.html
https://pytorch.org/docs/stable/distributed.pipelining.html https://pytorch.org/tutorials/intermediate/pipelining_tutorial.html
https://huggingface.co/docs/transformers/main/en/gguf
https://huggingface.co/docs/transformers/main/model_doc/llama2 https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L255
https://huggingface.co/docs/transformers/main/model_doc/llama3 https://github.com/meta-llama/llama3
https://spaces.ac.cn/archives/9708 https://spaces.ac.cn/archives/9948
https://github.com/bojone/rerope?tab=readme-ov-file
https://github.com/hkproj/pytorch-lora
https://github.com/pytorch/torchtitan/tree/main
https://huggingface.co/docs/transformers/main/en/gguf
https://pytorch.org/docs/stable/distributed.pipelining.html https://pytorch.org/tutorials/intermediate/pipelining_tutorial.html