You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
INFO 07-25 16:06:22 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-25 16:06:27 model_runner.py:680] Starting to load model facebook/opt-125m...
INFO 07-25 16:06:27 weight_utils.py:223] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.32it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.32it/s]
INFO 07-25 16:06:28 model_runner.py:692] Loading model weights took 0.2389 GB
INFO 07-25 16:06:29 gpu_executor.py:102] # GPU blocks: 34638, # CPU blocks: 7281
INFO 07-25 16:06:35 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-25 16:06:35 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-25 16:06:42 model_runner.py:1181] Graph capturing finished in 8 secs.
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ubuntu/tensorize.py", line 228, in <module>
[rank0]: tensorizer.tensorize_vllm_model(engine_args, tensorizer_config)
[rank0]: File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/tensorizer.py", line 472, in tensorize_vllm_model
[rank0]: serialize_vllm_model(
[rank0]: File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/tensorizer.py", line 429, in serialize_vllm_model
[rank0]: with _write_stream(output_file, **tensorizer_args.stream_params) as stream:
[rank0]: NameError: name '_write_stream' is not defined
The text was updated successfully, but these errors were encountered:
Your current environment
🐛 Describe the bug
When following https://docs.vllm.ai/en/stable/getting_started/examples/tensorize_vllm_model.html#, I get an error. I saved the the script to
tensorizer.py
and ran it with the command:Output:
The text was updated successfully, but these errors were encountered: