System freezes and reboots while running vLLM Cont. on Jetson Orin Nano with test.py script #800
Open
Description
Description
When running the test.py
script on a Jetson Orin Nano board inside a vLLM container, the system freezes and reboots. The issue occurs consistently regardless of the gpu_memory_utilization
value (tested with 0.3, 0.5, and 0.8).
Steps to Reproduce
-
Use a Jetson Orin Nano board with vLLM installed inside a Docker container.
-
Run the following script (
test.py
):#!/usr/bin/env python3 print('testing vLLM...') from huggingface_hub import hf_hub_download from vllm import LLM, SamplingParams import xgrammar def run_gguf_inference(model_path): PROMPT_TEMPLATE = "<|system|>\n{system_message}</s>\n<|user|>\n{prompt}</s>\n<|assistant|>\n" system_message = "You are a friendly chatbot who always responds in the style of a pirate." prompts = [ "How many helicopters can a human eat in one sitting?", "What's the future of AI?", ] prompts = [ PROMPT_TEMPLATE.format(system_message=system_message, prompt=prompt) for prompt in prompts ] sampling_params = SamplingParams(temperature=0, max_tokens=128) llm = LLM(model=model_path, tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0", gpu_memory_utilization=0.3) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") if __name__ == "__main__": repo_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF" filename = "tinyllama-1.1b-chat-v1.0.Q4_0.gguf" model = hf_hub_download(repo_id, filename=filename) run_gguf_inference(model) print(xgrammar) print('vLLM OK\n')
-
Observe that the system freezes and reboots.
Expected Behavior
The script should execute without causing the system to freeze or reboot.
Actual Behavior
The system freezes and reboots during execution, regardless of the gpu_memory_utilization
value.
Logs
Here are the logs captured before the system freezes:
testing vLLM...
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
tinyllama-1.1b-chat-v1.0.Q4_0.gguf: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 638M/638M [01:17<00:00, 8.19MB/s]
INFO 01-27 12:38:07 config.py:2272] Downcasting torch.float32 to torch.float16.
INFO 01-27 12:38:27 config.py:510] This model supports multiple tasks: {'generate', 'score', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
WARNING 01-27 12:38:27 config.py:588] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 01-27 12:38:27 config.py:1051] Possibly too large swap space. 4.00 GiB out of the 7.44 GiB total CPU memory is allocated for the swap space.
INFO 01-27 12:38:29 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='/data/models/huggingface/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf', speculative_config=None, tokenizer='TinyLlama/TinyLlama-1.1B-Chat-v1.0', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models/huggingface/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.29k/1.29k [00:00<00:00, 2.91MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 10.1MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 3.43MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 551/551 [00:00<00:00, 1.19MB/s]
INFO 01-27 12:38:50 selector.py:120] Using Flash Attention backend.
INFO 01-27 12:38:51 model_runner.py:1094] Starting to load model /data/models/huggingface/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf...
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:226: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
INFO 01-27 12:39:09 model_runner.py:1099] Loading model weights took 0.5974 GB
INFO 01-27 12:39:16 worker.py:241] Memory profiling takes 7.06 seconds
INFO 01-27 12:39:16 worker.py:241] the current vLLM instance can use total_gpu_memory (7.44GiB) x gpu_memory_utilization (0.50) = 3.72GiB
INFO 01-27 12:39:16 worker.py:241] model weights take 0.60GiB; non_torch_memory takes 0.78GiB; PyTorch activation peak memory takes 0.30GiB; the rest of the memory reserved for KV Cache is 2.05GiB.
INFO 01-27 12:39:17 gpu_executor.py:76] # GPU blocks: 6096, # CPU blocks: 11915
INFO 01-27 12:39:17 gpu_executor.py:80] Maximum concurrency for 2048 tokens per request: 47.62x
Environment
- Board: Jetson Orin Nano
- Jetpack: Jetpack 6.2
- Docker Image: dustynv/vllm:0.6.6.post1-r36.4.0
- Model: TinyLlama-1.1B-Chat-v1.0 (quantized GGUF format)
- Python Version: 3.10
- GPU Utilization: Tried with 0.3, 0.5, and 0.8
Metadata
Assignees
Labels
No labels