-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Second GPU is not found when running --sharded true #150
Comments
Hey @psych0v0yager, apologies for the late reply, I've been out on holiday. My first suspicion is that PyTorch isn't able to discover the device for some reason. Can you try running the following and sharing the output:
|
No worries! I ran the command in my conda environment and received the following output.
|
Was that command run from within the lorax Docker container or outside of it? If you ran it outside the container, it would be worth testing it from within the container (by running Another thing you can try is setting |
Okay here are my results. Testing the command in the docker container gave the same result as outside, it detected 2 containers. Furthermore num-shard worked as well, the model was split over 2 layers. Here is the exact command I ran
However, the container errored out with the following message: 2023-12-31T07:11:04.818168Z INFO lorax_launcher: Args { model_id: "TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ", adapter_id: "", source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: Some(2), quantize: Some(Awq), dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 512, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 512, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "7875655a60f8", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false } 2023-12-31T07:11:07.320835Z INFO download: lorax_launcher: Successfully downloaded weights. 2023-12-31T07:11:17.629335Z INFO shard-manager: lorax_launcher: Shard ready in 10.307929251s rank=0 2023-12-31T07:11:17.729408Z INFO shard-manager: lorax_launcher: Shard ready in 10.407867503s rank=1 The above exception was the direct cause of the following exception: Traceback (most recent call last):
2023-12-31T07:11:18.498837Z ERROR warmup{max_input_length=512 max_prefill_tokens=512}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 512 prefill tokens. You need to decrease The argument This is the output from nvidia-smi: +---------------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------------+ It appears as if the container is trying to split the model evenly over both GPUs and it is filling up the 3060, while the 3090 still has a lot of space left over. Is there a way to change the way the layers are split so the 3090 takes the larger chunk? Which part of LoRAX is responsible for sharding? |
Hey @psych0v0yager, that's an interesting scenario. It might be a little tricky (though not impossible) to divide the weights differently across the GPUs. LoRAX uses tensor parallelism, so we slice tensors along dimensions when loading them and then aggregate the results of computations at certain points during the forward pass. To make this work the way you're describing, we would need a way to chunk the tensors more granularly, and then assign different workers a different number of chunks based on how much GPU memory they have available. Here are the various tensor parallel layer implementations: https://github.com/predibase/lorax/blob/main/server/lorax_server/utils/layers.py And here you can see the logic that shards the weights: https://github.com/predibase/lorax/blob/main/server/lorax_server/utils/weights.py#L353 These would roughly be the sections of the code that would need to change for this. |
Thank you @tgaddair for the reply. I will look at those sections, it does seem a bit tricky. Meanwhile I was looking at the following code here https://github.com/predibase/lorax/blob/main/server/lorax_server/utils/dist.py I was wondering if it would be simpler to keep the existing tensor parallelism and instead shard the model into 3 slices, putting 2 slices on the 3090 and one slice on the 3060. That way none of the tensor parallelism would need to be rewritten. If I wanted to implement this, what portions of the code would I need to modify? |
System Info
Lorax version: 0.4.1
Lorax_launcher: 0.1.0
Model: mistralai/Mixtral-8x7B-Instruct-v0.1
GPUS: 3090 (24 gb) 3060 (12 gb)
Information
Tasks
Reproduction
model= mistralai/Mixtral-8x7B-Instruct-v0.1
volume=$PWD/data
sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --trust-remote-code --quantize bitsandbytes-nf4 --max-batch-prefill-tokens 2048 --sharded true
Error Message:
2023-12-24T07:02:10.759386Z INFO lorax_launcher: Parsing num_shard from CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES
Error: NotEnoughCUDADevices("
sharded
is true but only found 1 CUDA devices")Expected behavior
The expected behavior is for LoRAX to find both GPUs. For reference here is the output of nvidia-smi
'''
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 49C P8 15W / 170W | 9MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A |
| 0% 51C P8 18W / 350W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
'''
I checked the documentation and it said that --sharded true is the default setting of the server; however, when I do not pass --sharded true, I get an out of memory error and need to use a much smaller --max-batch-prefill-tokens (1024 to be exact), when I print nvidia-smi I get the following output
'''
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 44C P8 15W / 170W | 12MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A |
| 81% 57C P2 114W / 350W | 23873MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 7439 C /opt/conda/bin/python3.10 23856MiB |
+---------------------------------------------------------------------------------------+
'''
It appears as if the server cannot find the 3060. I swapped the 3060 for one of my other GPUs (a Tesla P100 16gb) yet I still received the same error
The text was updated successfully, but these errors were encountered: