Benchmarks of request throughput of vllm and NVIDIA NIM

Using vllm benchmark script I evaluated llama-3-8B-instruct served with vllm and with NVIDIA NIM and deployed on AWS via HuggingFace Endpoints on two different instance types - L4 and A10G.

Results

For the shareGPT dataset on A10G, vllm has 18% higher request throughput compare to NIM.

Average over 2 runs:

Hardware	vllm	NIM
L4	2.73	2.61
A10G	4.13	3.49

The repo contains benchmark JSON files for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
sharegpt_nim_A10G		sharegpt_nim_A10G
sharegpt_nim_l4		sharegpt_nim_l4
sharegpt_vllm_A10G		sharegpt_vllm_A10G
sharegpt_vllm_l4		sharegpt_vllm_l4
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
request_throughput.png		request_throughput.png