-
Notifications
You must be signed in to change notification settings - Fork 156
inference_with_transformers_en
ymcui edited this page Apr 26, 2024
·
2 revisions
We provide a command-line method for performing inference using the native 🤗Transformers. Below, we demonstrate how to start using the Llama-3-Chinese-Instruct model as an example.
After downloading the full rights weights, start the script with the following command.
python scripts/inference/inference_hf.py \
--base_model path_to_llama3_chinese_instruct_hf_dir \
--with_prompt \
--interactive
You can use vLLM as a backend for LLM inference, which requires installing the vLLM library.
pip install vllm
Simply add the --use_vllm
parameter to the original command line:
python scripts/inference/inference_hf.py \
--base_model path_to_llama3_chinese_instruct_hf_dir \
--with_prompt \
--interactive \
--use_vllm
-
--base_model {base_model}
: Directory containing the HF format Llama-3-Chinese-Instruct model weights and configuration files. You can also use the model calling name from 🤗Model Hub. -
--tokenizer_path {tokenizer_path}
: Directory containing the corresponding tokenizer. If this parameter is not provided, its default value will be the same as--base_model
. -
--with_prompt
: Whether to merge the input with the prompt template. If loading the Llama-3-Chinese-Instruct model, please enable this option! -
--interactive
: Starts in interactive mode for multiple single-round Q&As (not context conversation as in llama.cpp). -
--data_file {file_name}
: In non-interactive mode, reads the content fromfile_name
line by line for prediction. -
--predictions_file {file_name}
: In non-interactive mode, writes the predicted results tofile_name
in JSON format. -
--only_cpu
: Uses only the CPU for inference. -
--gpus {gpu_ids}
: Specifies the GPU device numbers to use, default is 0. For using multiple GPUs, separate with commas, like0,1,2
. -
--load_in_8bit
or--load_in_4bit
: Loads the model in 8bit or 4bit mode to reduce memory usage, with--load_in_4bit
recommended. -
--use_vllm
: Uses vLLM as a backend for LLM inference. -
--use_flash_attention_2
: Uses Flash-Attention 2 for accelerated inference, if not specified, the code defaults to SDPA for acceleration.
This script is intended for quick and easy experience and is not optimized for inference speed.
- Model Reconstruction
- Model Quantization, Inference and Deployment
- System Performance
- Training Scripts
- FAQ