Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ray OOM in tensor parallel #322

Closed
liulfy opened this issue Jun 30, 2023 · 27 comments
Closed

ray OOM in tensor parallel #322

liulfy opened this issue Jun 30, 2023 · 27 comments
Labels
bug Something isn't working

Comments

@liulfy
Copy link

liulfy commented Jun 30, 2023

In my case , I can deploy the vllm service on single GPU. but when I use multi gpu, I meet the ray OOM error. Could you please help solve this problem?
my model is yahma/llama-7b-hf
my transformers version is 4.28.0
my cuda version is 11.4


2023-06-30 09:24:53,455 WARNING utils.py:593 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1 as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1 to mute this warning.
2023-06-30 09:24:53,459 WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=6.12gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-06-30 09:24:53,584 INFO worker.py:1636 -- Started a local Ray instance.
INFO 06-30 09:24:54 llm_engine.py:59] Initializing an LLM engine with config: model='/opt/app/yahma-llama-lora', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
WARNING 06-30 09:24:54 config.py:131] Possibly too large swap space. 16.00 GiB out of the 32.00 GiB total CPU memory is allocated for the swap space.
/opt/app/yahma-llama-lora
Exception in thread ray_print_logs:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 900, in print_logs
global_worker_stdstream_dispatcher.emit(data)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/ray_logging.py", line 264, in emit
handle(data)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 1788, in print_to_stdstream
print_worker_logs(batch, sink)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 1950, in print_worker_logs
restore_tqdm()
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 1973, in restore_tqdm
tqdm_ray.instance().unhide_bars()
File "/usr/local/lib/python3.8/dist-packages/ray/experimental/tqdm_ray.py", line 344, in instance
_manager = _BarManager()
File "/usr/local/lib/python3.8/dist-packages/ray/experimental/tqdm_ray.py", line 256, in init
self.should_colorize = not ray.widgets.util.in_notebook()
File "/usr/local/lib/python3.8/dist-packages/ray/widgets/util.py", line 205, in in_notebook
shell = _get_ipython_shell_name()
File "/usr/local/lib/python3.8/dist-packages/ray/widgets/util.py", line 194, in _get_ipython_shell_name
import IPython
File "/usr/local/lib/python3.8/dist-packages/IPython/init.py", line 30, in
raise ImportError(
ImportError:
IPython 8.13+ supports Python 3.9 and above, following NEP 29.
IPython 8.0-8.12 supports Python 3.8 and above, following NEP 29.
When using Python 2.7, please install IPython 5.x LTS Long Term Support version.
Python 3.3 and 3.4 were supported up to IPython 6.x.
Python 3.5 was supported with IPython 7.0 to 7.9.
Python 3.6 was supported with IPython up to 7.16.
Python 3.7 was still supported with the 7.x branch.

See IPython README.rst file for more information:

https://github.com/ipython/ipython/blob/main/README.rst

Traceback (most recent call last):
File "", line 1, in
File "/opt/app/vllm-0.1.1/vllm/entrypoints/llm.py", line 55, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 151, in from_engine_args
engine = cls(*engine_configs, distributed_init_method, devices,
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 102, in init
self._init_cache()
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 114, in _init_cache
num_blocks = self._run_workers(
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 317, in _run_workers
all_outputs = ray.get(all_outputs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2542, in get
raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 10.30.192.36, ID: 17400c6c9eee3bc1384c172eecd4e1ecf2992cbc7f50cb27d2dc60d7) where the task (task ID: ffffffffffffffff283e91f20257d747969124a201000000, name=Worker.init, pid=26332, memory used=4.54GB) was running was 31.27GB / 32.00GB (0.977298), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: cb6154315a0e1a33d85683935ae20cf76eecd48230c3c4b3a5563fe4) because it was the most recently scheduled task; to see more information about memory usage on this node, use ray logs raylet.out -ip 10.30.192.36. To see the logs of the worker, use ray logs worker-cb6154315a0e1a33d85683935ae20cf76eecd48230c3c4b3a5563fe4*out -ip 10.30.192.36. Top 10 memory users: PID MEM(GB) COMMAND 26333 4.60 ray::Worker.__init__ 26332 4.54 ray::Worker.__init__ 26331 4.51 ray::Worker.__init__ 26330 4.47 ray::Worker.__init__ 25044 0.23 python 25099 0.19 /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20... 25340 0.06 ray::IDLE 25174 0.06 /usr/bin/python /usr/local/lib/python3.8/dist-packages/ray/dashboard/dashboard.py --host=127.0.0.1 -... 25310 0.06 /usr/bin/python -u /usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py --node-ip-address=1... 25349 0.05 ray::IDLE Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable RAY_memory_usage_thresholdwhen starting Ray. To disable worker killing, set the environment variableRAY_memory_monitor_refresh_ms` to zero.

@WoosukKwon
Copy link
Collaborator

Hi @liulfy, it's because we allocate 4gb of cpu memory per gpu Adding swap_space=1 when initializing LLM will solve the problem.

@liulfy
Copy link
Author

liulfy commented Jul 3, 2023

@WoosukKwon Thank you for answering my problem! When I try the swap_space, the problem has not been solved.
my code is here:
from vllm import LLM
model_path = 'yahma/llama-13b-hf'
llama_model = LLM(model = model_path, tensor_parallel_size=4, swap_space=1)

my CPU has 32GB memory, and I use 4 A100 40GB.
and the error message is still the same:
2023-07-03 03:27:55,908 WARNING utils.py:593 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1 as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1 to mute this warning.
2023-07-03 03:27:55,911 WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=6.08gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-07-03 03:27:56,045 INFO worker.py:1636 -- Started a local Ray instance.
INFO 07-03 03:27:56 llm_engine.py:59] Initializing an LLM engine with config: model='/opt/app/yahma-llama-lora', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
/opt/app/yahma-llama-lora
Traceback (most recent call last):
File "", line 1, in
File "/opt/app/vllm-0.1.1/vllm/entrypoints/llm.py", line 55, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 151, in from_engine_args
engine = cls(*engine_configs, distributed_init_method, devices,
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 102, in init
self._init_cache()
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 114, in _init_cache
num_blocks = self._run_workers(
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 317, in _run_workers
all_outputs = ray.get(all_outputs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2542, in get
raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 10.30.192.36, ID: 91847a2262e263f96264497d39d4641c385303a97ff78e3fc6f0e721) where the task (task ID: ffffffffffffffff27a08d091fe239dc78e7cd0c01000000, name=Worker.init, pid=51664, memory used=4.45GB) was running was 31.21GB / 32.00GB (0.97518), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: ddd4c0e44d6355f85eb5027fac7616a529d599bb4e3193b1df451167) because it was the most recently scheduled task; to see more information about memory usage on this node, use ray logs raylet.out -ip 10.30.192.36. To see the logs of the worker, use ray logs worker-ddd4c0e44d6355f85eb5027fac7616a529d599bb4e3193b1df451167*out -ip 10.30.192.36. Top 10 memory users: PID MEM(GB) COMMAND 51660 4.45 ray::Worker.__init__ 51664 4.45 ray::Worker.__init__ 51658 4.42 ray::Worker.__init__ 51662 4.41 ray::Worker.__init__ 45071 0.27 python 50443 0.18 /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20... 50650 0.06 /usr/bin/python -u /usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py --node-ip-address=1... 50694 0.05 ray::IDLE 50681 0.05 ray::IDLE 50688 0.05 ray::IDLE Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable RAY_memory_usage_thresholdwhen starting Ray. To disable worker killing, set the environment variableRAY_memory_monitor_refresh_ms` to zero.

@jibowang
Copy link

jibowang commented Jul 3, 2023

I met same problem.

model:

25G	./llama-13b-lora-hf 

free -h

              total        used        free      shared  buff/cache   available
Mem:           31Gi       2.1Gi        26Gi       0.0Ki       2.3Gi        28Gi
Swap:         8.0Gi       1.2Gi       6.8Gi

Initializing an LLM engine with config:

model='/data/ketadb/text-generation-webui/models/llama-13b-lora-hf/', tokenizer='hf-internal-testing/llama-tokenizer', tokenizer_mode=auto, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=2, seed=0

This is error:

2023-07-04 20:26:39,125	INFO worker.py:1452 -- Connecting to existing Ray cluster at address: 192.168.1.240:6379...
2023-07-04 20:26:39,141	INFO worker.py:1636 -- Connected to Ray cluster.
INFO 07-04 20:26:39 llm_engine.py:60] Initializing an LLM engine with config: model='/data/ketadb/text-generation-webui/models/llama-13b-lora-hf/', tokenizer='hf-internal-testing/llama-tokenizer', tokenizer_mode=auto, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=2, seed=0)
INFO 07-04 20:26:39 tokenizer.py:28] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
Traceback (most recent call last):
  File "api_server.py", line 80, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/ubuntu/wangjibo/vllm-main/vllm/engine/async_llm_engine.py", line 232, in from_engine_args
    engine = cls(engine_args.worker_use_ray,
  File "/home/ubuntu/wangjibo/vllm-main/vllm/engine/async_llm_engine.py", line 55, in __init__
    self.engine = engine_class(*args, **kwargs)
  File "/home/ubuntu/wangjibo/vllm-main/vllm/engine/llm_engine.py", line 105, in __init__
    self._init_cache()
  File "/home/ubuntu/wangjibo/vllm-main/vllm/engine/llm_engine.py", line 117, in _init_cache
    num_blocks = self._run_workers(
  File "/home/ubuntu/wangjibo/vllm-main/vllm/engine/llm_engine.py", line 334, in _run_workers
    all_outputs = ray.get(all_outputs)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.8/site-packages/ray/_private/worker.py", line 2542, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 192.168.1.240, ID: 3d1f1d89602340cb023e506de2a0dd5eb353e2ec29b8800cdf553655) where the task (task ID: ffffffffffffffffaece4988873caddc35d289400c000000, name=Worker.__init__, pid=3290289, memory used=13.74GB) was running was 29.74GB / 31.17GB (0.954021), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 078e723f2df7c75d778b26c2703d072ddf697853e5f8bdf0e0ba9efa) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 192.168.1.240`. To see the logs of the worker, use `ray logs worker-078e723f2df7c75d778b26c2703d072ddf697853e5f8bdf0e0ba9efa*out -ip 192.168.1.240. Top 10 memory users:
PID	MEM(GB)	COMMAND
3290289	13.74	ray::Worker.__init__
3290288	13.67	ray::Worker.__init__
3290170	0.21	python api_server.py --model /data/ketadb/text-generation-webui/models/llama-13b-lora-hf/ --tokenize...
3290084	0.11	/home/ubuntu/ketad/agent/subprocess/bin/keta-agent/keta-agent -c keta-agent.yaml
3220981	0.02	ray::IDLE
3219445	0.02	/home/ubuntu/miniconda3/envs/vllm/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --log_...
3220978	0.02	ray::IDLE
3220980	0.02	ray::IDLE
3220979	0.02	ray::IDLE
3220990	0.02	ray::IDLE
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

I guess vllm allocate memory size for model more than it's physical size,Is there a formula for calculating memory size?

@CtfGo
Copy link

CtfGo commented Jul 4, 2023

Me too.
May be the Ray memory monitor detected memory usage incorrectly ? because I found there were a lot of memory occupied by system buffer/cache, and Ray regard them as unavailable according its error log

@CtfGo
Copy link

CtfGo commented Jul 4, 2023

Me too. May be the Ray memory monitor detected memory usage incorrectly ? because I found there were a lot of memory occupied by system buffer/cache, and Ray regard them as unavailable according its error log

disable the ray memory monitor by export RAY_memory_monitor_refresh_ms=0 work for me : https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html#how-do-i-disable-the-memory-monitor

related issue: ray-project/ray#10895

@liulfy
Copy link
Author

liulfy commented Jul 4, 2023

Me too. May be the Ray memory monitor detected memory usage incorrectly ? because I found there were a lot of memory occupied by system buffer/cache, and Ray regard them as unavailable according its error log

disable the ray memory monitor by export RAY_memory_monitor_refresh_ms=0 work for me : https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html#how-do-i-disable-the-memory-monitor

related issue: ray-project/ray#10895

This does not work for me. I set NCCL_DEBUG=INFO and my log is as follows:
2023-07-04 08:55:35,247 INFO utils.py:573 -- Detected RAY_USE_MULTIPROCESSING_CPU_COUNT=1: Using multiprocessing.cpu_count() to detect the number of CPUs. This may be inconsistent when used inside docker. To correctly detect CPUs, unset the env var: RAY_USE_MULTIPROCESSING_CPU_COUNT.
2023-07-04 08:55:35,252 WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=5.75gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-07-04 08:55:35,377 INFO worker.py:1636 -- Started a local Ray instance.
INFO 07-04 08:55:37 llm_engine.py:59] Initializing an LLM engine with config: model='/opt/app/yahma-llama-lora', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
/opt/app/yahma-llama-lora
Traceback (most recent call last):
File "", line 1, in
File "/opt/app/vllm-0.1.1/vllm/entrypoints/llm.py", line 55, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 151, in from_engine_args
engine = cls(*engine_configs, distributed_init_method, devices,
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 102, in init
self._init_cache()
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 114, in _init_cache
num_blocks = self._run_workers(
File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 317, in _run_workers
all_outputs = ray.get(all_outputs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2542, in get
raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::Worker.init() (pid=26149, ip=10.30.192.153, actor_id=c84d115691be21b19ce79faa01000000, repr=<vllm.worker.worker.Worker object at 0x7f989c09ffd0>)
File "/opt/app/vllm-0.1.1/vllm/worker/worker.py", line 40, in init
_init_distributed_environment(parallel_config, rank,
File "/opt/app/vllm-0.1.1/vllm/worker/worker.py", line 302, in _init_distributed_environment
torch.distributed.all_reduce(torch.zeros(1).cuda())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
work = default_pg.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Bootstrap : no socket interface found

(Worker pid=26149) 2023-07-04 08:55:43,893 ERROR worker.py:861 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::Worker.init() (pid=26149, ip=10.30.192.153, actor_id=c84d115691be21b19ce79faa01000000, repr=<vllm.worker.worker.Worker object at 0x7f989c09ffd0>)
(Worker pid=26149) File "/opt/app/vllm-0.1.1/vllm/worker/worker.py", line 40, in init
(Worker pid=26149) _init_distributed_environment(parallel_config, rank,
(Worker pid=26149) File "/opt/app/vllm-0.1.1/vllm/worker/worker.py", line 302, in _init_distributed_environment
(Worker pid=26149) torch.distributed.all_reduce(torch.zeros(1).cuda())
(Worker pid=26149) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
(Worker pid=26149) return func(*args, **kwargs)
(Worker pid=26149) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
(Worker pid=26149) work = default_pg.allreduce([tensor], opts)
(Worker pid=26149) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, internal error, NCCL version 2.14.3
(Worker pid=26149) ncclInternalError: Internal check failed.
(Worker pid=26149) Last error:
(Worker pid=26149) Bootstrap : no socket interface found
(Worker pid=26150) RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.

@justusmattern27
Copy link

hi, we're having the same issue. Has anyone found a solution for this yet?

@lucasjinreal
Copy link

Same issue here, but I doubt it has nothing to do with ray

@zhuohan123 zhuohan123 added the bug Something isn't working label Jul 18, 2023
@Oliver-ss
Copy link
Contributor

mark

@a1164714
Copy link

a1164714 commented Sep 6, 2023

mark,i have the same problem

@flexwang
Copy link

flexwang commented Sep 8, 2023

same problem here

@saumya-saran
Copy link
Contributor

I'm having the same issue.

@pfldy2850
Copy link
Contributor

same here, mark

@pfldy2850
Copy link
Contributor

pfldy2850 commented Sep 25, 2023

In my humble opinion,
There might be a problem when loading the model checkpoint.

for name, loaded_weight in hf_model_weights_iterator(
model_name_or_path, cache_dir, load_format, revision):
if "rotary_emb.inv_freq" in name:
continue

For this loop, it needs some cpu memories per GPU device for loading a checkpoint file.

For @liulfy 's case,
9.8GB checkpoint file (pytorch_model-00001-of-00002.bin) loaded on all workers at the same time.

@pfldy2850
Copy link
Contributor

Indeed, after sharding my model's checkpoints to small pieces,
It works on me normally.

@pfldy2850
Copy link
Contributor

I know that there is no way to partially load a large checkpoint file at code level.
(To load a checkpoint file, memory of the same size as the checkpoint file is required)

Any ideas on how vLLM can solve these problems?

@smallmocha
Copy link

Same issue here,anyone fix it now?

@lonngxiang
Copy link

same here, mark

@boydfd
Copy link
Contributor

boydfd commented Oct 17, 2023

I met the same issue and figured out how to fix it. Already created a PR #1395

@smallmocha
Copy link

@boydfd seems did not fix this issue,not when load model,i get oom after runing several days

@boydfd
Copy link
Contributor

boydfd commented Dec 5, 2023

@boydfd seems did not fix this issue,not when load model,i get oom after runing several days

maybe you can share more infos?

@AzureSilent
Copy link

AzureSilent commented Dec 30, 2023

@boydfd seems did not fix this issue,not when load model,i get oom after runing several days

maybe you can share more infos?

Same issue here,I'v found some info may help:

1.It goes well when --tensor-parallel-size==1, that is with out ray. The cpu memory usage is static.
2.when set --tensor-parallel-size 2, vllm will use ray. and as the model infers, the cpu memory increases slowly until OOM.
3.If use --enforce-eager along with --tensor-parallel-size 2, the cpu memory increases much slower (near 5X). but will still increase to OOM.
4.Whether running in a container or not will always lead to this mem leaking bug.

Model: llama-7b
cuda version 12.1

@chaos318
Copy link

chaos318 commented Jan 3, 2024

It seems that if turn down the --max_model_len ,it'll start。
for example:
stat with the command like:
python -m vllm.entrypoints.api_server --model /workspace/model/ --tensor-parallel-size 4 --max-model-len 6000

@Taiinguyenn139
Copy link

If anybody run vllm on Triton server
Triton server will auto run your llm instance on every possible GPU. So if you have 2 GPU and you run --tensor-parallel-size 2. it will create 2 instances and split that 2 instances. May lead to OOM.
Solutions: specify which is your "main" GPUin your config.pbtxt.
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
}
]

@HAN-oQo
Copy link

HAN-oQo commented Jan 22, 2024

I wrote same answer to Issue #721, Can you try this?

I had the issue when I'm using a docker container.
I was able to circumvent the issue by mounting the empty directory to /tmp/ray.
I hope this solution could help someone.

For example,

mkdir ./tmp_local
docker run -v ./tmp_local:/tmp/ray ...

@su-park
Copy link

su-park commented Feb 6, 2024

I encountered the same oom error message and
I guess there is still no other solution..

  • model: Llama-2-7b
  • cuda version: 12.2
  • vllm version: 0.3.0
  • multi gpus (8)

@su-park
Copy link

su-park commented Feb 7, 2024

I resolved my case by enforce_eager=True with slower generations.
Thank you all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests