Skip to content

System freezes and reboots while running vLLM Cont. on Jetson Orin Nano with test.py script #800

Open
@Ja-efan

Description

Description

When running the test.py script on a Jetson Orin Nano board inside a vLLM container, the system freezes and reboots. The issue occurs consistently regardless of the gpu_memory_utilization value (tested with 0.3, 0.5, and 0.8).


Steps to Reproduce

  1. Use a Jetson Orin Nano board with vLLM installed inside a Docker container.

  2. Run the following script (test.py):

    #!/usr/bin/env python3
    print('testing vLLM...')
    
    from huggingface_hub import hf_hub_download
    from vllm import LLM, SamplingParams
    import xgrammar
    
    def run_gguf_inference(model_path):
        PROMPT_TEMPLATE = "<|system|>\n{system_message}</s>\n<|user|>\n{prompt}</s>\n<|assistant|>\n"
        system_message = "You are a friendly chatbot who always responds in the style of a pirate."
        prompts = [
            "How many helicopters can a human eat in one sitting?",
            "What's the future of AI?",
        ]
        prompts = [
            PROMPT_TEMPLATE.format(system_message=system_message, prompt=prompt)
            for prompt in prompts
        ]
        sampling_params = SamplingParams(temperature=0, max_tokens=128)
        llm = LLM(model=model_path,
                  tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                  gpu_memory_utilization=0.3)
        outputs = llm.generate(prompts, sampling_params)
        for output in outputs:
            prompt = output.prompt
            generated_text = output.outputs[0].text
            print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    
    if __name__ == "__main__":
        repo_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
        filename = "tinyllama-1.1b-chat-v1.0.Q4_0.gguf"
        model = hf_hub_download(repo_id, filename=filename)
        run_gguf_inference(model)
        print(xgrammar)
    
    print('vLLM OK\n')
  3. Observe that the system freezes and reboots.


Expected Behavior

The script should execute without causing the system to freeze or reboot.


Actual Behavior

The system freezes and reboots during execution, regardless of the gpu_memory_utilization value.


Logs

Here are the logs captured before the system freezes:

testing vLLM...
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
  warnings.warn(
tinyllama-1.1b-chat-v1.0.Q4_0.gguf: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 638M/638M [01:17<00:00, 8.19MB/s]
INFO 01-27 12:38:07 config.py:2272] Downcasting torch.float32 to torch.float16.
INFO 01-27 12:38:27 config.py:510] This model supports multiple tasks: {'generate', 'score', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
WARNING 01-27 12:38:27 config.py:588] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 01-27 12:38:27 config.py:1051] Possibly too large swap space. 4.00 GiB out of the 7.44 GiB total CPU memory is allocated for the swap space.
INFO 01-27 12:38:29 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='/data/models/huggingface/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf', speculative_config=None, tokenizer='TinyLlama/TinyLlama-1.1B-Chat-v1.0', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models/huggingface/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.29k/1.29k [00:00<00:00, 2.91MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 10.1MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 3.43MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 551/551 [00:00<00:00, 1.19MB/s]
INFO 01-27 12:38:50 selector.py:120] Using Flash Attention backend.
INFO 01-27 12:38:51 model_runner.py:1094] Starting to load model /data/models/huggingface/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf...
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:226: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
INFO 01-27 12:39:09 model_runner.py:1099] Loading model weights took 0.5974 GB
INFO 01-27 12:39:16 worker.py:241] Memory profiling takes 7.06 seconds
INFO 01-27 12:39:16 worker.py:241] the current vLLM instance can use total_gpu_memory (7.44GiB) x gpu_memory_utilization (0.50) = 3.72GiB
INFO 01-27 12:39:16 worker.py:241] model weights take 0.60GiB; non_torch_memory takes 0.78GiB; PyTorch activation peak memory takes 0.30GiB; the rest of the memory reserved for KV Cache is 2.05GiB.
INFO 01-27 12:39:17 gpu_executor.py:76] # GPU blocks: 6096, # CPU blocks: 11915
INFO 01-27 12:39:17 gpu_executor.py:80] Maximum concurrency for 2048 tokens per request: 47.62x

Environment

  • Board: Jetson Orin Nano
  • Jetpack: Jetpack 6.2
  • Docker Image: dustynv/vllm:0.6.6.post1-r36.4.0
  • Model: TinyLlama-1.1B-Chat-v1.0 (quantized GGUF format)
  • Python Version: 3.10
  • GPU Utilization: Tried with 0.3, 0.5, and 0.8

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions