You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/generation/utils.py:1249: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `
89
-
max_new_tokens` to control the maximum length of the generation.
90
-
warnings.warn(
91
-
Hello, my dog is cute and I want to give him a good home. I have a
WARNING 03-13 09:37:29 rocm.py:31] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
105
-
INFO 03-13 09:37:35 config.py:510] This model supports multiple tasks: {'embed', 'classify', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
106
-
INFO 03-13 09:37:35 config.py:1339] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
107
-
INFO 03-13 09:37:35 llm_engine.py:234] Initializing an LLM engine (v0.6.6) with config: model='/models/facebook/opt-125m', speculative_config=None, tokenizer='/models/facebook/opt-125m', skip_tokenizer_init=False,
INFO 03-13 09:37:39 client.py:113] Model loaded: facebook/opt-125m/rank_0, 8554547c-25d3-4a01-92b6-27d69d91d3b8
119
-
INFO 03-13 09:37:39 torch.py:160] restore state_dict takes 0.00017452239990234375 seconds
120
-
INFO 03-13 09:37:39 client.py:117] confirm_model_loaded: facebook/opt-125m/rank_0, 8554547c-25d3-4a01-92b6-27d69d91d3b8
121
-
INFO 03-13 09:37:39 client.py:125] Model loaded
122
-
INFO 03-13 09:37:39 model_runner.py:1099] Loading model weights took 0.0000 GB
123
-
/app/third_party/vllm/vllm/model_executor/layers/linear.py:140: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at ../aten/src/ATen
124
-
/Context.cpp:296.)
125
-
return F.linear(x, layer.weight, bias)
126
-
INFO 03-13 09:37:42 worker.py:253] Memory profiling takes 2.68 seconds
127
-
INFO 03-13 09:37:42 worker.py:253] the current vLLM instance can use total_gpu_memory (23.98GiB) x gpu_memory_utilization (0.90) = 21.59GiB
128
-
INFO 03-13 09:37:42 worker.py:253] model weights take 0.00GiB; non_torch_memory takes 0.62GiB; PyTorch activation peak memory takes 0.46GiB; the rest of the memory reserved for KV Cache is 20.50GiB.
129
-
INFO 03-13 09:37:42 gpu_executor.py:76] # GPU blocks: 37326, # CPU blocks: 7281
130
-
INFO 03-13 09:37:42 gpu_executor.py:80] Maximum concurrency for 2048 tokens per request: 291.61x
131
-
INFO 03-13 09:37:43 model_runner.py:1429] Capturing cudagraphs fordecoding. This may lead to unexpected consequences if the model is not static. To run the modelin eager mode, set'enforce_eager=True' or use '--
132
-
enforce-eager'in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decre
133
-
ase memory usage.
134
-
Capturing CUDA graph shapes: 100%|████████████████████████████████████████| 35/35 [00:09<00:00, 3.73it/s]
135
-
INFO 03-13 09:37:52 model_runner.py:1549] Graph capturing finished in 9 secs, took 0.06 GiB
136
-
INFO 03-13 09:37:52 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 12.80 seconds
137
-
Processed prompts: 100%|█| 4/4 [00:00<00:00, 50.16it/s, est. speed input: 326.19 toks/s, output: 802.89 to
102
+
INFO 06-05 13:02:51 [__init__.py:243] Automatically detected platform rocm.
103
+
INFO 06-05 13:02:52 [__init__.py:31] Available plugins for group vllm.general_plugins:
104
+
INFO 06-05 13:02:52 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
105
+
INFO 06-05 13:02:52 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
106
+
INFO 06-05 13:03:00 [config.py:793] This model supports multiple tasks: {'reward', 'embed', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
107
+
INFO 06-05 13:03:00 [arg_utils.py:1594] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
108
+
INFO 06-05 13:03:04 [config.py:1910] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 06-05 13:03:17 [worker.py:303] Memory profiling takes 11.68 seconds
124
+
INFO 06-05 13:03:17 [worker.py:303] the current vLLM instance can use total_gpu_memory (23.98GiB) x gpu_memory_utilization (0.90) = 21.59GiB
125
+
INFO 06-05 13:03:17 [worker.py:303] model weights take 0.24GiB; non_torch_memory takes 0.53GiB; PyTorch activation peak memory takes 0.49GiB; the rest of the memory reserved for KV Cache is 20.33GiB.
126
+
INFO 06-05 13:03:17 [executor_base.py:112] # rocm blocks: 37011, # CPU blocks: 7281
127
+
INFO 06-05 13:03:17 [executor_base.py:117] Maximum concurrency for 2048 tokens per request: 289.15x
128
+
INFO 06-05 13:03:18 [model_runner.py:1526] Capturing cudagraphs fordecoding. This may lead to unexpected consequences if the model is not static. To run the modelin eager mode, set'enforce_eager=True' or use '--enforce-eager'in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
129
+
Capturing CUDA graph shapes: 100%|█████████████████████████| 35/35 [00:09<00:00, 3.55it/s]
130
+
INFO 06-05 13:03:28 [model_runner.py:1684] Graph capturing finished in 10 secs, took 0.13 GiB
131
+
INFO 06-05 13:03:28 [llm_engine.py:428] init engine (profile, create kv cache, warmup model) took 22.81 seconds
Processed prompts: 100%|█| 4/4 [00:00<00:00, 6.71it/s, est. speed input: 43.59 toks/s, out
138
134
Prompt: 'Hello, my name is', Generated text: ' Joel, my dad is my friend and we are in a relationship. I am'
139
135
Prompt: 'The president of the United States is', Generated text: ' speaking out against the release of some State Department documents which show the Russians were involved'
140
136
Prompt: 'The capital of France is', Generated text: ' a worldwide knowledge center. What better place to learn about the history and culture of'
141
137
Prompt: 'The future of AI is', Generated text: " here: it's the future of everything\nIf you want to test your minds"
142
-
[rank0]:[W313 09:37:53.050846849 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_p
143
-
rocess_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This cons
144
-
traint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
145
-
138
+
[rank0]:[W605 13:03:30.532018298 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
This issue is due to an internal bug in ROCm. After the inference instance is completed, the GPU memory is still occupied and not released. For more information, please refer to [issue](https://github.com/ROCm/HIP/issues/3580).
171
164
172
-
2. vLLM v0.5.0.post1 can not be built in ROCm 6.2.0
173
-
174
-
This issue is due to the ambiguity of a function call in ROCm 6.2.0. You may change the vLLM's source code as in this [commit](https://github.com/vllm-project/vllm/commit/9984605412de1171a72d955cfcb954725edd4d6f).
0 commit comments