Using vllm benchmark script I evaluated llama-3-8B-instruct served with vllm and with NVIDIA NIM and deployed on AWS via HuggingFace Endpoints on two different instance types - L4 and A10G.
For the shareGPT dataset on A10G, vllm has 18% higher request throughput compare to NIM.
Average over 2 runs:
Hardware | vllm | NIM |
---|---|---|
L4 | 2.73 | 2.61 |
A10G | 4.13 | 3.49 |
The repo contains benchmark JSON files for more details.