Skip to content

inference_with_transformers_en

ymcui edited this page Apr 26, 2024 · 2 revisions

Using Transformers for Inference

We provide a command-line method for performing inference using the native 🤗Transformers. Below, we demonstrate how to start using the Llama-3-Chinese-Instruct model as an example.

Inference Using the Transformers Library

After downloading the full rights weights, start the script with the following command.

python scripts/inference/inference_hf.py \
    --base_model path_to_llama3_chinese_instruct_hf_dir \
    --with_prompt \
    --interactive

Using vLLM for Accelerated Inference

You can use vLLM as a backend for LLM inference, which requires installing the vLLM library.

pip install vllm

Simply add the --use_vllm parameter to the original command line:

python scripts/inference/inference_hf.py \
    --base_model path_to_llama3_chinese_instruct_hf_dir \
    --with_prompt \
    --interactive \
    --use_vllm

Parameter Explanation

  • --base_model {base_model}: Directory containing the HF format Llama-3-Chinese-Instruct model weights and configuration files. You can also use the model calling name from 🤗Model Hub.
  • --tokenizer_path {tokenizer_path}: Directory containing the corresponding tokenizer. If this parameter is not provided, its default value will be the same as --base_model.
  • --with_prompt: Whether to merge the input with the prompt template. If loading the Llama-3-Chinese-Instruct model, please enable this option!
  • --interactive: Starts in interactive mode for multiple single-round Q&As (not context conversation as in llama.cpp).
  • --data_file {file_name}: In non-interactive mode, reads the content from file_name line by line for prediction.
  • --predictions_file {file_name}: In non-interactive mode, writes the predicted results to file_name in JSON format.
  • --only_cpu: Uses only the CPU for inference.
  • --gpus {gpu_ids}: Specifies the GPU device numbers to use, default is 0. For using multiple GPUs, separate with commas, like 0,1,2.
  • --load_in_8bit or --load_in_4bit: Loads the model in 8bit or 4bit mode to reduce memory usage, with --load_in_4bit recommended.
  • --use_vllm: Uses vLLM as a backend for LLM inference.
  • --use_flash_attention_2: Uses Flash-Attention 2 for accelerated inference, if not specified, the code defaults to SDPA for acceleration.

Note

This script is intended for quick and easy experience and is not optimized for inference speed.

Clone this wiki locally