Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MPT-30B] OutOfMemoryError: CUDA out of memory #372

Closed
mspronesti opened this issue Jul 5, 2023 · 9 comments
Closed

[MPT-30B] OutOfMemoryError: CUDA out of memory #372

mspronesti opened this issue Jul 5, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@mspronesti
Copy link
Contributor

mspronesti commented Jul 5, 2023

Hi vllm dev team,
is vllm supposed to work with MPT-30B ? I tried loading it on AWS SageMaker using a ml.g5.12xlarge and even a ml.g5.48xlarge instance.

from vllm import LLM, SamplingParams

llm = LLM(model="mosaicml/mpt-30b")

However in both cases I run into this error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 294.00 MiB (GPU 0; 22.19 GiB total capacity; 21.35 GiB already allocated; 46.50 MiB free; 21.35 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
@WilliamTambellini
Copy link

does vLLM support pytorch dataparallel:
https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
?

@zhuohan123
Copy link
Member

@mspronesti Can you try distributed inference following this guide? https://vllm.readthedocs.io/en/latest/serving/distributed_serving.html

@Joejoequ
Copy link

Joejoequ commented Jul 6, 2023

i ran into same problem, and fix it by using LLM(model="",tokenizer_mode="slow")

@mspronesti
Copy link
Contributor Author

mspronesti commented Jul 6, 2023

@zhuohan123 thanks for your quick reply. Installing ray and setting tensor_parallel_size=4 (or =8 on bigger instances) yields

RayActorError: The actor died because of an error raised in its creation task, ray::Worker.__init__() (pid=22291, ip=172.16.94.76, actor_id=9773ec3febd4775b0cb0bed101000000, repr=<vllm.worker.worker.Worker object at 0x7f4511af7e80>)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/worker/worker.py", line 40, in __init__
    _init_distributed_environment(parallel_config, rank,
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/worker/worker.py", line 307, in _init_distributed_environment
    torch.distributed.all_reduce(torch.zeros(1).cuda())
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
    return func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
    work = default_pg.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'

@Joejoequ setting tokenizer_mode='slow'

ValueError: Tokenizer class GPTNeoXTokenizer does not exist or is not currently imported.

@nearmax-p
Copy link

Same for me. Actually, setting tensor_parallel_size works for me on 2 A100 GPUs. However, after the LLM Engine said it is starting, it never finishes the setup process

@BEpresent
Copy link

BEpresent commented Jul 11, 2023

Also got the same PYTORCH_CUDA_ALLOC_CONF error on an A100 40GB GPU for several WizardLM 33B models, both quantized and non quantized). Should I open a new issue for this since it's not an MPT model ? The model runs on that GPU, e.g. using Exllama.

@FarziBuilder
Copy link

Same here. I am able to load LLaMa 65-b locally using this notebook:-https://twitter.com/m_ryabinin/status/1679217067310960645?s=20

But I am unable to run it on vLLMs

@zhuohan123 zhuohan123 added the bug Something isn't working label Jul 18, 2023
@Gourang97
Copy link

@mspronesti , setting this os.environ["NCCL_IGNORE_DISABLED_P2P"] = '1', should resolve the issue

@hmellor
Copy link
Collaborator

hmellor commented Mar 6, 2024

Closing as stale.

Original issue was due to insufficient GPU memory on single GPU, should be solvable using tensor parallel as mentioned by @zhuohan123.

@hmellor hmellor closed this as completed Mar 6, 2024
dtrifiro pushed a commit to dtrifiro/vllm that referenced this issue Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants