Latency increase when run on multi-GPU #116

prd-tuong-nguyen · 2023-12-08T14:03:26Z

System Info

I run your docker image in 2 cases:

single gpu (--sharded false)
multi-gpu (--sharded false --num_shard 4)
=> When I run single-gpu, the total time around 1.5 second and take ~21GG GPU, but when I run on multi-GPU, it take ~2.4second and 19GB/1GPU :( Seem the lower performance when run multi-gpu.
Do you meet this problem?

{
  "model_id": "Open-Orca/Mistral-7B-OpenOrca",
  "adapter_id": "",
  "source": "hub",
  "adapter_source": "hub",
  "revision": null,
  "validation_workers": 2,
  "sharded": true,
  "num_shard": 4,
  "quantize": "BitsandbytesNF4",
  "dtype": null,
  "trust_remote_code": false,
  "max_concurrent_requests": 128,
  "max_best_of": 1,
  "max_stop_sequences": 4,
  "max_input_length": 2048,
  "max_total_tokens": 4096,
  "waiting_served_ratio": 1.2,
  "max_batch_prefill_tokens": 4096,
  "max_batch_total_tokens": 100000,
  "max_waiting_tokens": 20,
  "max_active_adapters": 10,
  "adapter_cycle_time_s": 2,
  "hostname": "0.0.0.0",
  "port": 8000,
  "shard_uds_path": "/tmp/lorax-server",
  "master_addr": "localhost",
  "master_port": 29500,
  "huggingface_hub_cache": "/data",
  "weights_cache_override": null,
  "disable_custom_kernels": false,
  "cuda_memory_fraction": 1,
  "json_output": true,
  "otlp_endpoint": null,
  "cors_allow_origin": [],
  "watermark_gamma": null,
  "watermark_delta": null,
  "ngrok": false,
  "ngrok_authtoken": null,
  "ngrok_edge": null,
  "env": false,
  "download_only": false
}

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Run dokcer with --sharded true --num_shard 4

Expected behavior

Same or better performace when run multi-gpu

The text was updated successfully, but these errors were encountered:

tgaddair · 2023-12-09T23:07:08Z

Hey @prd-tuong-nguyen, what kind of networking do you have between these GPUs? If they're using PCIe, the frequent communication between devices will likely cause performance to degrade. To get good performance on multiple GPUs, you typically need NVLink.

My recommendation to try to use a single GPU if possible, and only use multi-GPU if you have to due to memory constraints (for example, serving a 70b param model with 40GB of VRAM in fp16).

prd-tuong-nguyen · 2023-12-11T02:08:55Z

@tgaddair Thanks for your reply,
I want to take advance of multi-GPU to increase the concurrency.
So do you have any solution for this issue? Like if I run each instance on a GPU, what is the best solution to load balancer between them?

tgaddair · 2023-12-11T05:28:10Z

Hey @prd-tuong-nguyen, in your case I would recommend using data parallelism rather than model parallelism. Specifically, I would run one replica per GPU then put a load balancer in front of them using something like Kubernetes.

As for the best load balancing strategy, if you do not have enough load to keep the GPUs fully utilized, or the number of adapters you're using is relatively low (<25 or so), then I would suggest using a round robin load balancing strategy. That will keep the replicas equally busy, which should help keep latency low.

If, however, you're operating at very high scale, I would suggest using a load balancer with a consistent hashing policy based on the adapter ID, so that you can more efficiently batch together requests for same adapter and maximize throughput.

prd-tuong-nguyen · 2023-12-13T07:45:09Z

@tgaddair Thank you for your suggestion, I will try it <3

bi1101 · 2024-02-13T20:36:05Z

Hi @prd-tuong-nguyen do you have some performance benchmark on running it with multiple GPUs setup in terms of though put?

tgaddair added the question Further information is requested label Dec 9, 2023

tgaddair changed the title ~~Something wrong when run on multi-GPU~~ Latency increase when run on multi-GPU Dec 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency increase when run on multi-GPU #116

Latency increase when run on multi-GPU #116

prd-tuong-nguyen commented Dec 8, 2023 •

edited

Loading

tgaddair commented Dec 9, 2023 •

edited

Loading

prd-tuong-nguyen commented Dec 11, 2023

tgaddair commented Dec 11, 2023

prd-tuong-nguyen commented Dec 13, 2023

bi1101 commented Feb 13, 2024

Latency increase when run on multi-GPU #116

Latency increase when run on multi-GPU #116

Comments

prd-tuong-nguyen commented Dec 8, 2023 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

tgaddair commented Dec 9, 2023 • edited Loading

prd-tuong-nguyen commented Dec 11, 2023

tgaddair commented Dec 11, 2023

prd-tuong-nguyen commented Dec 13, 2023

bi1101 commented Feb 13, 2024

prd-tuong-nguyen commented Dec 8, 2023 •

edited

Loading

tgaddair commented Dec 9, 2023 •

edited

Loading