Model conversion process failed. Unable to find bin files #2365

joshight · 2024-09-05T20:11:32Z

Description

(A clear and concise description of what the bug is.)

Seeing the following error during conversion when attempting to deploy a v1.4_llama3 fine tuned LLM with tensorrtllm.

LLM Inference Container:
763104351884.dkr.ecr.region.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124

Bin files exist in s3 path, but cannot be found by the conversion process

Please note this works fine for vllm, but not tensorrt:

VLLM properties:

engine=Python
option.model_id=s3_path
option.tensor_parallel_degree=1
option.trust_remote_code=true
option.rolling_batch=vllm
option.entryPoint=djl_python.huggingface
option.max_model_len=16384
option.max_rolling_batch_size=16
option.enable_streaming=false

TRTLLM properties:

engine=Python
option.model_id=s3_path
option.tensor_parallel_degree=1
option.trust_remote_code=true
option.rolling_batch=trtllm
option.entryPoint=djl_python.huggingface
option.max_model_len=16384
option.max_rolling_batch_size=128
option.enable_streaming=false

Expected Behavior

(what's the expected behavior?)

Expect for the model conversion process to succeed just as it does for vllm config.

Error Message

(Paste the complete error message, including stack trace.)

[INFO ] LmiUtils - convert_py: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/.djl.ai/download/cffe5246b14faa11e217a6f21535dff1719c39ba/pytorch_model-00001-of-00004.bin'

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.) Can be reproduced in sagemaker

Steps to reproduce

(Paste the commands you ran that produced the error.)

Output logs of model deployment process

What have you tried to solve it?

Tried changing instance size/type
Validated .bin files are in place and correct path in s3

sindhuvahinis · 2024-10-10T21:50:58Z

Does it exist under an s3 object or another folder perhaps?

joshight added the bug Something isn't working label Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model conversion process failed. Unable to find bin files #2365

Model conversion process failed. Unable to find bin files #2365

joshight commented Sep 5, 2024

sindhuvahinis commented Oct 10, 2024

Model conversion process failed. Unable to find bin files #2365

Model conversion process failed. Unable to find bin files #2365

Comments

joshight commented Sep 5, 2024

Description

Expected Behavior

Error Message

How to Reproduce?

Steps to reproduce

What have you tried to solve it?

sindhuvahinis commented Oct 10, 2024