Open
Description
System Info
GPU: 1 * A100 80G
tensorrt 10.6.0
tensorrt_llm 0.15.0
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Following instructions at:
https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html#Draft-Target-Model
draft and target model are Qwen2.5 7b/32b
both quantizied as w8a16
I can test success with:
export BASE_MODEL_PATH=<path to work dir>
TENSORRT_LLM_DRAFT_MODEL_NAME="tensorrt_llm_draft"
TENSORRT_LLM_MODEL_NAME="tensorrt_llm"
DRAFT_ENGINE_PATH=$BASE_MODEL_PATH/llm_engines_draft
TARGET_ENGINE_PATH=$BASE_MODEL_PATH/llm_engines
TOKENIZER_PATH=$BASE_MODEL_PATH/tokenizer
python3 /app/tensorrt_llm/examples/run.py \
--tokenizer_dir $TOKENIZER_PATH \
--draft_engine_dir $DRAFT_ENGINE_PATH \
--engine_dir $TARGET_ENGINE_PATH \
--draft_target_model_config="[4,[0],[0],False]" \
--max_output_len=256 \
--kv_cache_enable_block_reuse \
--kv_cache_free_gpu_memory_fraction=0.1 \
--input_text="How does Draft-Sampling work?"
following script also could success to start triton server
export BASE_MODEL_PATH=<some local dir>
DRAFT_ENGINE_PATH=$BASE_MODEL_PATH/llm_engines_draft
TARGET_ENGINE_PATH=$BASE_MODEL_PATH/llm_engines
TOKENIZER_PATH=$BASE_MODEL_PATH/tokenizer
ACCUMULATE_TOKEN="false"
BACKEND="tensorrtllm"
BATCH_SCHEDULER_POLICY="guaranteed_no_evict"
BATCHING_STRATEGY="inflight_fused_batching"
BLS_INSTANCE_COUNT="1"
DECODING_MODE="top_k_top_p"
DECOUPLED_MODE="False"
DRAFT_GPU_DEVICE_IDS="0"
E2E_MODEL_NAME="ensemble"
ENABLE_KV_CACHE_REUSE="true"
ENGINE_PATH=$TARGET_ENGINE_PATH
EXCLUDE_INPUT_IN_OUTPUT="false"
KV_CACHE_FREE_GPU_MEM_FRACTION="0.1"
MAX_ATTENTION_WINDOW_SIZE=""
MAX_BEAM_WIDTH="1"
MAX_QUEUE_DELAY_MICROSECONDS="0"
MAX_TOKENS_IN_KV_CACHE=""
NORMALIZE_LOG_PROBS="true"
POSTPROCESSING_INSTANCE_COUNT="1"
PREPROCESSING_INSTANCE_COUNT="1"
TARGET_GPU_DEVICE_IDS="0"
TENSORRT_LLM_DRAFT_MODEL_NAME="tensorrt_llm_draft"
TENSORRT_LLM_MODEL_NAME="tensorrt_llm"
# TOKENIZER_TYPE=llama
TRITON_GRPC_PORT="8001"
TRITON_HTTP_PORT="8000"
TRITON_MAX_BATCH_SIZE="16"
TRITON_METRICS_PORT="8002"
TRITON_REPO="tritonllm_repo"
USE_DRAFT_LOGITS="false"
LOGITS_DATATYPE="TYPE_FP32" # Replace by TYPE_FP16 for FP8 model
BASEDIR=`cd "$(dirname $0)"; pwd`
TRITON_SCRIPTS_DIR=$BASEDIR/configs/triton_trtllm_0.15/scripts
FILL_TEMPLATE=$TRITON_SCRIPTS_DIR/fill_template.py
# Make a copy of triton repo and replace the fields in the configuration files
# cd /app/tensorrtllm_backend/
# apt-get update && apt-get install -y build-essential cmake git-lfs
# pip3 install git-lfs tritonclient grpcio
rm -rf ${TRITON_REPO}
cp -R configs/triton_trtllm_0.15/inflight_batcher_llm ${TRITON_REPO}
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE}
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_PATH},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${PREPROCESSING_INSTANCE_COUNT}
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_PATH},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${POSTPROCESSING_INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE}
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},accumulate_tokens:${ACCUMULATE_TOKEN},bls_instance_count:${BLS_INSTANCE_COUNT},tensorrt_llm_model_name:${TENSORRT_LLM_MODEL_NAME},tensorrt_llm_draft_model_name:${TENSORRT_LLM_DRAFT_MODEL_NAME},logits_datatype:${LOGITS_DATATYPE}
# Make a copy of tensorrt_llm as configurations of draft / target models.
cp -R ${TRITON_REPO}/tensorrt_llm ${TRITON_REPO}/tensorrt_llm_draft
sed -i 's/name: "tensorrt_llm"/name: "tensorrt_llm_draft"/g' ${TRITON_REPO}/tensorrt_llm_draft/config.pbtxt
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/tensorrt_llm/config.pbtxt triton_backend:${BACKEND},engine_dir:${TARGET_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${TARGET_GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE}
python3 $FILL_TEMPLATE -i ${TRITON_REPO}/tensorrt_llm_draft/config.pbtxt triton_backend:${BACKEND},engine_dir:${DRAFT_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${DRAFT_GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE}
python3 $TRITON_SCRIPTS_DIR/launch_triton_server.py \
--model_repo=${TRITON_REPO} \
--tensorrt_llm_model_name "${TENSORRT_LLM_MODEL_NAME},${TENSORRT_LLM_DRAFT_MODEL_NAME}" \
--multi-model
test fails with script
TENSORRT_LLM_DRAFT_MODEL_NAME="tensorrt_llm_draft"
TENSORRT_LLM_MODEL_NAME="tensorrt_llm"
python3 /app/tensorrtllm_backend/tools/inflight_batcher_llm/speculative_decoding_test.py \
--max-input-len 2048 \
--dataset=input_data.json \
--url-target=localhost:8001 \
--url-draft=localhost:8001 \
--url-control=localhost:8001 \
--draft-tensorrt-llm-model-name="${TENSORRT_LLM_DRAFT_MODEL_NAME}" \
--target-tensorrt-llm-model-name="${TENSORRT_LLM_MODEL_NAME}" \
--bls-speculative-tensorrt-llm-model-name="tensorrt_llm_bls" \
--execute-bls-speculative-decoding \
--disable-output-comparison \
--num-draft-tokens=4 \
--verbose
Expected behavior
above test could success
actual behavior
I get error:
flags: Namespace(verbose=True, url_target='localhost:8001', url_draft='localhost:8001', url_control='localhost:8001', max_input_len=2048, preprocessor_model_name='preprocessing', postprocessor_model_name='postprocessing', draft_tensorrt_llm_model_name='tensorrt_llm_draft', target_tensorrt_llm_model_name='tensorrt_llm', bls_speculative_tensorrt_llm_model_name='tensorrt_llm_bls', execute_bls_speculative_decoding=True, beam_width=1, temperature=1.0, repetition_penalty=None, presence_penalty=None, frequency_penalty=None, output_len=100, num_draft_tokens=4, use_draft_logits=False, return_context_logits=False, return_generation_logits=False, end_id=None, pad_id=None, stop_words=[], bad_words=[], dataset='input_data.json', disable_output_comparison=True, return_draft_model_draft_logits=False, return_target_model_accepted_token_logits=False)
Prompt: James Best, best known for his Continue writing the following story:
Output len: 84
Calling control model
Received an error from server:
in ensemble 'ensemble', Executor failed process requestId 1 due to the following error: Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: The embedding bias shape is not as expected. Expected last dimension to be same as vocab size: 152064. (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp:483)
1 0x5575841f3d06 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2 0x7f26bf539b51 /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x790b51) [0x7f26bf539b51]
3 0x7f26c14ae2ee tensorrt_llm::runtime::GptDecoderBatched::newRequests(std::vector<int, std::allocator<int> > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator<tensorrt_llm::runtime::SamplingConfig> > const&, tensorrt_llm::runtime::ModelConfig const&) + 222
4 0x7f26c1942288 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1432
5 0x7f26c19457c3 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 3507
6 0x7f26c1981bc8 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 472
7 0x7f26c198758e tensorrt_llm::executor::Executor::Impl::executionLoop() + 1390
8 0x7f26bd52c930 /app/tensorrt_llm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930) [0x7f26bd52c930]
9 0x7f26b9d37ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f26b9d37ac3]
10 0x7f26b9dc9850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f26b9dc9850]
output_control:
Calling BLS speculative decoding model
Received an error from server:
Traceback (most recent call last):
File "/app/llm-infer-service/tritonllm_repo/tensorrt_llm_bls/1/model.py", line 108, in execute
for res in res_gen:
File "/app/llm-infer-service/tritonllm_repo/tensorrt_llm_bls/1/lib/decode.py", line 219, in decode
for gen_response in self._spec_generate(preproc_response, request):
File "/app/llm-infer-service/tritonllm_repo/tensorrt_llm_bls/1/lib/decode.py", line 271, in _spec_generate
draft_response: GenerationResponse = self._draft_generate_non_streaming(
File "/app/llm-infer-service/tritonllm_repo/tensorrt_llm_bls/1/lib/triton_decoder.py", line 307, in _draft_generate_non_streaming
triton_response = self._exec_triton_request_single(triton_req)
File "/app/llm-infer-service/tritonllm_repo/tensorrt_llm_bls/1/lib/triton_decoder.py", line 149, in _exec_triton_request_single
raise pb_utils.TritonModelException(responses.error().message())
additional notes
- Note that --url-control is required, and not included in original document, I added it as "--url-control=localhost:8001"
- when I use Qwen2.5 1.5b as draft model, I get the same error