System freezes and reboots while running vLLM Cont. on Jetson Orin Nano with test.py script

### Description
When running the `test.py` script on a Jetson Orin Nano board inside a vLLM container, the system freezes and reboots. The issue occurs consistently regardless of the `gpu_memory_utilization` value (tested with 0.3, 0.5, and 0.8).

---

### Steps to Reproduce
1. Use a Jetson Orin Nano board with vLLM installed inside a Docker container.
2. Run the following script (`test.py`):

    ```python
    #!/usr/bin/env python3
    print('testing vLLM...')

    from huggingface_hub import hf_hub_download
    from vllm import LLM, SamplingParams
    import xgrammar

    def run_gguf_inference(model_path):
        PROMPT_TEMPLATE = "<|system|>\n{system_message}</s>\n<|user|>\n{prompt}</s>\n<|assistant|>\n"
        system_message = "You are a friendly chatbot who always responds in the style of a pirate."
        prompts = [
            "How many helicopters can a human eat in one sitting?",
            "What's the future of AI?",
        ]
        prompts = [
            PROMPT_TEMPLATE.format(system_message=system_message, prompt=prompt)
            for prompt in prompts
        ]
        sampling_params = SamplingParams(temperature=0, max_tokens=128)
        llm = LLM(model=model_path,
                  tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                  gpu_memory_utilization=0.3)
        outputs = llm.generate(prompts, sampling_params)
        for output in outputs:
            prompt = output.prompt
            generated_text = output.outputs[0].text
            print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

    if __name__ == "__main__":
        repo_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
        filename = "tinyllama-1.1b-chat-v1.0.Q4_0.gguf"
        model = hf_hub_download(repo_id, filename=filename)
        run_gguf_inference(model)
        print(xgrammar)

    print('vLLM OK\n')
    ```

3. Observe that the system freezes and reboots.

---

### Expected Behavior
The script should execute without causing the system to freeze or reboot.

---

### Actual Behavior
The system freezes and reboots during execution, regardless of the `gpu_memory_utilization` value.

---

### Logs
Here are the logs captured before the system freezes:
```
testing vLLM...
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
  warnings.warn(
tinyllama-1.1b-chat-v1.0.Q4_0.gguf: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 638M/638M [01:17<00:00, 8.19MB/s]
INFO 01-27 12:38:07 config.py:2272] Downcasting torch.float32 to torch.float16.
INFO 01-27 12:38:27 config.py:510] This model supports multiple tasks: {'generate', 'score', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
WARNING 01-27 12:38:27 config.py:588] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 01-27 12:38:27 config.py:1051] Possibly too large swap space. 4.00 GiB out of the 7.44 GiB total CPU memory is allocated for the swap space.
INFO 01-27 12:38:29 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='/data/models/huggingface/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf', speculative_config=None, tokenizer='TinyLlama/TinyLlama-1.1B-Chat-v1.0', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models/huggingface/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.29k/1.29k [00:00<00:00, 2.91MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 10.1MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 3.43MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 551/551 [00:00<00:00, 1.19MB/s]
INFO 01-27 12:38:50 selector.py:120] Using Flash Attention backend.
INFO 01-27 12:38:51 model_runner.py:1094] Starting to load model /data/models/huggingface/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf...
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:226: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
INFO 01-27 12:39:09 model_runner.py:1099] Loading model weights took 0.5974 GB
INFO 01-27 12:39:16 worker.py:241] Memory profiling takes 7.06 seconds
INFO 01-27 12:39:16 worker.py:241] the current vLLM instance can use total_gpu_memory (7.44GiB) x gpu_memory_utilization (0.50) = 3.72GiB
INFO 01-27 12:39:16 worker.py:241] model weights take 0.60GiB; non_torch_memory takes 0.78GiB; PyTorch activation peak memory takes 0.30GiB; the rest of the memory reserved for KV Cache is 2.05GiB.
INFO 01-27 12:39:17 gpu_executor.py:76] # GPU blocks: 6096, # CPU blocks: 11915
INFO 01-27 12:39:17 gpu_executor.py:80] Maximum concurrency for 2048 tokens per request: 47.62x
```

---

### Environment
- **Board**: Jetson Orin Nano
- **Jetpack**: Jetpack 6.2 
- **Docker Image**: dustynv/vllm:0.6.6.post1-r36.4.0
- **Model**: TinyLlama-1.1B-Chat-v1.0 (quantized GGUF format)
- **Python Version**: 3.10
- **GPU Utilization**: Tried with 0.3, 0.5, and 0.8

---


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System freezes and reboots while running vLLM Cont. on Jetson Orin Nano with test.py script #800

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Logs

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

System freezes and reboots while running vLLM Cont. on Jetson Orin Nano with test.py script #800

Description

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Logs

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions