Skip to content

Fail to test DraftTarget model with triton server tensorrtllm backend #720

Open
@gloritygithub11

Description

@gloritygithub11

System Info

GPU: 1 * A100 80G
tensorrt 10.6.0
tensorrt_llm 0.15.0

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Following instructions at:
https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html#Draft-Target-Model

draft and target model are Qwen2.5 7b/32b
both quantizied as w8a16

I can test success with:

export BASE_MODEL_PATH=<path to work dir>

TENSORRT_LLM_DRAFT_MODEL_NAME="tensorrt_llm_draft"
TENSORRT_LLM_MODEL_NAME="tensorrt_llm"

DRAFT_ENGINE_PATH=$BASE_MODEL_PATH/llm_engines_draft
TARGET_ENGINE_PATH=$BASE_MODEL_PATH/llm_engines
TOKENIZER_PATH=$BASE_MODEL_PATH/tokenizer

python3 /app/tensorrt_llm/examples/run.py \
    --tokenizer_dir $TOKENIZER_PATH \
    --draft_engine_dir $DRAFT_ENGINE_PATH \
    --engine_dir $TARGET_ENGINE_PATH \
    --draft_target_model_config="[4,[0],[0],False]" \
    --max_output_len=256 \
    --kv_cache_enable_block_reuse \
    --kv_cache_free_gpu_memory_fraction=0.1 \
    --input_text="How does Draft-Sampling work?"

following script also could success to start triton server


export BASE_MODEL_PATH=<some local dir>

DRAFT_ENGINE_PATH=$BASE_MODEL_PATH/llm_engines_draft
TARGET_ENGINE_PATH=$BASE_MODEL_PATH/llm_engines
TOKENIZER_PATH=$BASE_MODEL_PATH/tokenizer

ACCUMULATE_TOKEN="false"
BACKEND="tensorrtllm"
BATCH_SCHEDULER_POLICY="guaranteed_no_evict"
BATCHING_STRATEGY="inflight_fused_batching"
BLS_INSTANCE_COUNT="1"
DECODING_MODE="top_k_top_p"
DECOUPLED_MODE="False"
DRAFT_GPU_DEVICE_IDS="0"
E2E_MODEL_NAME="ensemble"
ENABLE_KV_CACHE_REUSE="true"
ENGINE_PATH=$TARGET_ENGINE_PATH
EXCLUDE_INPUT_IN_OUTPUT="false"
KV_CACHE_FREE_GPU_MEM_FRACTION="0.1"
MAX_ATTENTION_WINDOW_SIZE=""
MAX_BEAM_WIDTH="1"
MAX_QUEUE_DELAY_MICROSECONDS="0"
MAX_TOKENS_IN_KV_CACHE=""
NORMALIZE_LOG_PROBS="true"
POSTPROCESSING_INSTANCE_COUNT="1"
PREPROCESSING_INSTANCE_COUNT="1"
TARGET_GPU_DEVICE_IDS="0"
TENSORRT_LLM_DRAFT_MODEL_NAME="tensorrt_llm_draft"
TENSORRT_LLM_MODEL_NAME="tensorrt_llm"
# TOKENIZER_TYPE=llama
TRITON_GRPC_PORT="8001"
TRITON_HTTP_PORT="8000"
TRITON_MAX_BATCH_SIZE="16"
TRITON_METRICS_PORT="8002"
TRITON_REPO="tritonllm_repo"
USE_DRAFT_LOGITS="false"
LOGITS_DATATYPE="TYPE_FP32" # Replace by TYPE_FP16 for FP8 model

BASEDIR=`cd "$(dirname $0)"; pwd`
TRITON_SCRIPTS_DIR=$BASEDIR/configs/triton_trtllm_0.15/scripts
FILL_TEMPLATE=$TRITON_SCRIPTS_DIR/fill_template.py

# Make a copy of triton repo and replace the fields in the configuration files
# cd /app/tensorrtllm_backend/
# apt-get update && apt-get install -y build-essential cmake git-lfs
# pip3 install git-lfs tritonclient grpcio
rm -rf ${TRITON_REPO}
cp -R configs/triton_trtllm_0.15/inflight_batcher_llm ${TRITON_REPO}
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE}
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_PATH},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${PREPROCESSING_INSTANCE_COUNT}
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_PATH},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${POSTPROCESSING_INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE}
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},accumulate_tokens:${ACCUMULATE_TOKEN},bls_instance_count:${BLS_INSTANCE_COUNT},tensorrt_llm_model_name:${TENSORRT_LLM_MODEL_NAME},tensorrt_llm_draft_model_name:${TENSORRT_LLM_DRAFT_MODEL_NAME},logits_datatype:${LOGITS_DATATYPE}

# Make a copy of tensorrt_llm as configurations of draft / target models.
cp -R ${TRITON_REPO}/tensorrt_llm ${TRITON_REPO}/tensorrt_llm_draft
sed -i 's/name: "tensorrt_llm"/name: "tensorrt_llm_draft"/g' ${TRITON_REPO}/tensorrt_llm_draft/config.pbtxt
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/tensorrt_llm/config.pbtxt          triton_backend:${BACKEND},engine_dir:${TARGET_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${TARGET_GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE}
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/tensorrt_llm_draft/config.pbtxt    triton_backend:${BACKEND},engine_dir:${DRAFT_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${DRAFT_GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE}

python3 $TRITON_SCRIPTS_DIR/launch_triton_server.py \
    --model_repo=${TRITON_REPO} \
    --tensorrt_llm_model_name "${TENSORRT_LLM_MODEL_NAME},${TENSORRT_LLM_DRAFT_MODEL_NAME}" \
    --multi-model 

test fails with script

TENSORRT_LLM_DRAFT_MODEL_NAME="tensorrt_llm_draft"
TENSORRT_LLM_MODEL_NAME="tensorrt_llm"

python3 /app/tensorrtllm_backend/tools/inflight_batcher_llm/speculative_decoding_test.py \
    --max-input-len 2048 \
    --dataset=input_data.json \
    --url-target=localhost:8001 \
    --url-draft=localhost:8001 \
    --url-control=localhost:8001 \
    --draft-tensorrt-llm-model-name="${TENSORRT_LLM_DRAFT_MODEL_NAME}" \
    --target-tensorrt-llm-model-name="${TENSORRT_LLM_MODEL_NAME}" \
    --bls-speculative-tensorrt-llm-model-name="tensorrt_llm_bls" \
    --execute-bls-speculative-decoding \
    --disable-output-comparison \
    --num-draft-tokens=4 \
    --verbose

Expected behavior

above test could success

actual behavior

I get error:

flags: Namespace(verbose=True, url_target='localhost:8001', url_draft='localhost:8001', url_control='localhost:8001', max_input_len=2048, preprocessor_model_name='preprocessing', postprocessor_model_name='postprocessing', draft_tensorrt_llm_model_name='tensorrt_llm_draft', target_tensorrt_llm_model_name='tensorrt_llm', bls_speculative_tensorrt_llm_model_name='tensorrt_llm_bls', execute_bls_speculative_decoding=True, beam_width=1, temperature=1.0, repetition_penalty=None, presence_penalty=None, frequency_penalty=None, output_len=100, num_draft_tokens=4, use_draft_logits=False, return_context_logits=False, return_generation_logits=False, end_id=None, pad_id=None, stop_words=[], bad_words=[], dataset='input_data.json', disable_output_comparison=True, return_draft_model_draft_logits=False, return_target_model_accepted_token_logits=False)
Prompt: James Best, best known for his  Continue writing the following story:
Output len: 84
Calling control model
Received an error from server:
in ensemble 'ensemble', Executor failed process requestId 1 due to the following error: Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: The embedding bias shape is not as expected. Expected last dimension to be same as vocab size: 152064. (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp:483)
1       0x5575841f3d06 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7f26bf539b51 /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x790b51) [0x7f26bf539b51]
3       0x7f26c14ae2ee tensorrt_llm::runtime::GptDecoderBatched::newRequests(std::vector<int, std::allocator<int> > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator<tensorrt_llm::runtime::SamplingConfig> > const&, tensorrt_llm::runtime::ModelConfig const&) + 222
4       0x7f26c1942288 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1432
5       0x7f26c19457c3 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 3507
6       0x7f26c1981bc8 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 472
7       0x7f26c198758e tensorrt_llm::executor::Executor::Impl::executionLoop() + 1390
8       0x7f26bd52c930 /app/tensorrt_llm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930) [0x7f26bd52c930]
9       0x7f26b9d37ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f26b9d37ac3]
10      0x7f26b9dc9850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f26b9dc9850]
output_control: 
Calling BLS speculative decoding model
Received an error from server:
Traceback (most recent call last):
  File "/app/llm-infer-service/tritonllm_repo/tensorrt_llm_bls/1/model.py", line 108, in execute
    for res in res_gen:
  File "/app/llm-infer-service/tritonllm_repo/tensorrt_llm_bls/1/lib/decode.py", line 219, in decode
    for gen_response in self._spec_generate(preproc_response, request):
  File "/app/llm-infer-service/tritonllm_repo/tensorrt_llm_bls/1/lib/decode.py", line 271, in _spec_generate
    draft_response: GenerationResponse = self._draft_generate_non_streaming(
  File "/app/llm-infer-service/tritonllm_repo/tensorrt_llm_bls/1/lib/triton_decoder.py", line 307, in _draft_generate_non_streaming
    triton_response = self._exec_triton_request_single(triton_req)
  File "/app/llm-infer-service/tritonllm_repo/tensorrt_llm_bls/1/lib/triton_decoder.py", line 149, in _exec_triton_request_single
    raise pb_utils.TritonModelException(responses.error().message())

additional notes

  1. Note that --url-control is required, and not included in original document, I added it as "--url-control=localhost:8001"
  2. when I use Qwen2.5 1.5b as draft model, I get the same error

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions