Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second GPU is not found when running --sharded true #150

Open
2 of 4 tasks
psych0v0yager opened this issue Dec 24, 2023 · 6 comments
Open
2 of 4 tasks

Second GPU is not found when running --sharded true #150

psych0v0yager opened this issue Dec 24, 2023 · 6 comments
Assignees
Labels
question Further information is requested

Comments

@psych0v0yager
Copy link

psych0v0yager commented Dec 24, 2023

System Info

Lorax version: 0.4.1
Lorax_launcher: 0.1.0
Model: mistralai/Mixtral-8x7B-Instruct-v0.1
GPUS: 3090 (24 gb) 3060 (12 gb)

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

model= mistralai/Mixtral-8x7B-Instruct-v0.1
volume=$PWD/data

sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --trust-remote-code --quantize bitsandbytes-nf4 --max-batch-prefill-tokens 2048 --sharded true

Error Message:
2023-12-24T07:02:10.759386Z INFO lorax_launcher: Parsing num_shard from CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES
Error: NotEnoughCUDADevices("sharded is true but only found 1 CUDA devices")

Expected behavior

The expected behavior is for LoRAX to find both GPUs. For reference here is the output of nvidia-smi

'''
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 49C P8 15W / 170W | 9MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A |
| 0% 51C P8 18W / 350W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
'''

I checked the documentation and it said that --sharded true is the default setting of the server; however, when I do not pass --sharded true, I get an out of memory error and need to use a much smaller --max-batch-prefill-tokens (1024 to be exact), when I print nvidia-smi I get the following output

'''
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 44C P8 15W / 170W | 12MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A |
| 81% 57C P2 114W / 350W | 23873MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 7439 C /opt/conda/bin/python3.10 23856MiB |
+---------------------------------------------------------------------------------------+
'''

It appears as if the server cannot find the 3060. I swapped the 3060 for one of my other GPUs (a Tesla P100 16gb) yet I still received the same error

@tgaddair
Copy link
Contributor

Hey @psych0v0yager, apologies for the late reply, I've been out on holiday.

My first suspicion is that PyTorch isn't able to discover the device for some reason. Can you try running the following and sharing the output:

python -c "import torch; print(torch.cuda.device_count())"

@tgaddair tgaddair self-assigned this Dec 30, 2023
@tgaddair tgaddair added the question Further information is requested label Dec 30, 2023
@psych0v0yager
Copy link
Author

psych0v0yager commented Dec 30, 2023

No worries! I ran the command in my conda environment and received the following output.

python -c "import torch; print(torch.cuda.device_count())"
2

@tgaddair
Copy link
Contributor

Was that command run from within the lorax Docker container or outside of it? If you ran it outside the container, it would be worth testing it from within the container (by running docker exec -it <container_id> /bin/bash to SSH in) to see if it gives different results.

Another thing you can try is setting --num_shard 2 explicitly. If it's unable to find the second GPU with that arg, it should hopefully raise a more useful error.

@psych0v0yager
Copy link
Author

Okay here are my results.

Testing the command in the docker container gave the same result as outside, it detected 2 containers.

Furthermore num-shard worked as well, the model was split over 2 layers. Here is the exact command I ran

model= TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
volume=$PWD/data
sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --trust-remote-code --quantize awq --max-batch-prefill-tokens 512 --max-input-length 512 --num-shard 2

However, the container errored out with the following message:

2023-12-31T07:11:04.818168Z INFO lorax_launcher: Args { model_id: "TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ", adapter_id: "", source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: Some(2), quantize: Some(Awq), dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 512, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 512, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "7875655a60f8", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2023-12-31T07:11:04.818186Z WARN lorax_launcher: trust_remote_code is set. Trusting that model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ do not contain malicious code.
2023-12-31T07:11:04.818189Z INFO lorax_launcher: Sharding model on 2 processes
2023-12-31T07:11:04.818250Z INFO download: lorax_launcher: Starting download process.
2023-12-31T07:11:07.026335Z INFO lorax_launcher: cli.py:103 Files are already present on the host. Skipping download.

2023-12-31T07:11:07.320835Z INFO download: lorax_launcher: Successfully downloaded weights.
2023-12-31T07:11:07.320990Z INFO shard-manager: lorax_launcher: Starting shard rank=0
2023-12-31T07:11:07.321025Z INFO shard-manager: lorax_launcher: Starting shard rank=1
2023-12-31T07:11:17.329108Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=1
2023-12-31T07:11:17.329108Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2023-12-31T07:11:17.607159Z INFO lorax_launcher: server.py:269 Server started at unix:///tmp/lorax-server-0

2023-12-31T07:11:17.629335Z INFO shard-manager: lorax_launcher: Shard ready in 10.307929251s rank=0
2023-12-31T07:11:17.706473Z INFO lorax_launcher: server.py:269 Server started at unix:///tmp/lorax-server-1

2023-12-31T07:11:17.729408Z INFO shard-manager: lorax_launcher: Shard ready in 10.407867503s rank=1
2023-12-31T07:11:17.828920Z INFO lorax_launcher: Starting Webserver
2023-12-31T07:11:18.336074Z WARN lorax_router: router/src/main.rs:356: --revision is not set
2023-12-31T07:11:18.336086Z WARN lorax_router: router/src/main.rs:357: We strongly advise to set it to a known supported commit.
2023-12-31T07:11:18.447140Z INFO lorax_router: router/src/main.rs:378: Serving revision 9afb6f0a7d7fe9ecebdda1baa4ff4e13e73e97d7 of model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
2023-12-31T07:11:18.466622Z INFO lorax_router: router/src/main.rs:216: Warming up model
2023-12-31T07:11:18.498551Z ERROR lorax_launcher: interceptor.py:41 Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 864, in warmup
_, batch = self.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 963, in generate_token
raise e
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 960, in generate_token
out = self.forward(batch, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mixtral.py", line 408, in forward
logits = self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 978, in forward
hidden_states = self.model(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 911, in forward
hidden_states = self.embed_tokens(input_ids)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 609, in forward
out = torch.nn.functional.embedding(input, self.weight)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacty of 11.76 GiB of which 4.19 MiB is free. Process 12234 has 11.74 GiB memory in use. Of the allocated memory 11.52 GiB is allocated by PyTorch, and 13.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 277, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(

File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 74, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 867, in warmup
raise RuntimeError(
RuntimeError: Not enough memory to handle 512 prefill tokens. You need to decrease --max-batch-prefill-tokens

2023-12-31T07:11:18.498837Z ERROR warmup{max_input_length=512 max_prefill_tokens=512}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 512 prefill tokens. You need to decrease --max-batch-prefill-tokens
2023-12-31T07:12:19.522290Z ERROR warmup{max_input_length=512 max_prefill_tokens=512}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: transport error
Error: Warmup(Generation("transport error"))
2023-12-31T07:12:19.581935Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=60000) ran for 60334 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=60000) ran for 60334 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=60000) ran for 60334 milliseconds before timing out. rank=0
2023-12-31T07:12:19.581956Z ERROR shard-manager: lorax_launcher: Shard process was signaled to shutdown with signal 6 rank=0
2023-12-31T07:12:19.590594Z ERROR lorax_launcher: Shard 0 crashed
2023-12-31T07:12:19.590617Z INFO lorax_launcher: Terminating webserver
2023-12-31T07:12:19.590628Z INFO lorax_launcher: Waiting for webserver to gracefully shutdown
2023-12-31T07:12:19.590652Z INFO lorax_launcher: webserver terminated
2023-12-31T07:12:19.590659Z INFO lorax_launcher: Shutting down shards
2023-12-31T07:12:19.936191Z INFO shard-manager: lorax_launcher: Shard terminated rank=1
Error: ShardFailed

This is the output from nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 51C P2 38W / 170W | 12040MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A |
| 0% 61C P2 152W / 350W | 12251MiB / 24576MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2306 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 12234 C /opt/conda/bin/python3.10 12026MiB |
| 1 N/A N/A 2306 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 12233 C /opt/conda/bin/python3.10 12234MiB |
+---------------------------------------------------------------------------------------+

It appears as if the container is trying to split the model evenly over both GPUs and it is filling up the 3060, while the 3090 still has a lot of space left over. Is there a way to change the way the layers are split so the 3090 takes the larger chunk? Which part of LoRAX is responsible for sharding?

@tgaddair
Copy link
Contributor

tgaddair commented Jan 3, 2024

Hey @psych0v0yager, that's an interesting scenario. It might be a little tricky (though not impossible) to divide the weights differently across the GPUs.

LoRAX uses tensor parallelism, so we slice tensors along dimensions when loading them and then aggregate the results of computations at certain points during the forward pass. To make this work the way you're describing, we would need a way to chunk the tensors more granularly, and then assign different workers a different number of chunks based on how much GPU memory they have available.

Here are the various tensor parallel layer implementations: https://github.com/predibase/lorax/blob/main/server/lorax_server/utils/layers.py

And here you can see the logic that shards the weights: https://github.com/predibase/lorax/blob/main/server/lorax_server/utils/weights.py#L353

These would roughly be the sections of the code that would need to change for this.

@psych0v0yager
Copy link
Author

psych0v0yager commented Jan 3, 2024

Thank you @tgaddair for the reply. I will look at those sections, it does seem a bit tricky.

Meanwhile I was looking at the following code here

https://github.com/predibase/lorax/blob/main/server/lorax_server/utils/dist.py

I was wondering if it would be simpler to keep the existing tensor parallelism and instead shard the model into 3 slices, putting 2 slices on the 3090 and one slice on the 3060. That way none of the tensor parallelism would need to be rewritten.

If I wanted to implement this, what portions of the code would I need to modify?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants