Nanoflow

NanoFlow is a throughput-oriented high-performance serving framework for LLMs. NanoFlow consistently delivers superior throughput compared to vLLM, Deepspeed-FastGen, and TensorRT-LLM. NanoFlow achieves up to 1.91x throughput boost compared to TensorRT-LLM. The key features of NanoFlow include:

Intra-device parallelism: Maximizes hardware utilization by exploiting nano-batching and execution unit scheduling to overlap different resource demands inside a single device.
Asynchronous CPU scheduling: Achieves highly efficient CPU scheduling by adopting asynchronous control flow for GPU execution, CPU batch formation and KV-cache management.

News

[2024/09] 🚀 Nanoflow now supports Llama2 70B, Llama3 70B, Llama3.1 70B, Llama3 8B, Llama3.1 8B and Qwen2 72B models. We also released experiment scripts to reproduce the evaluation results.

Introduction

The key insight behinds NanoFlow is that traditional pipeline design of existing frameworks under-utilizes hardware resources due to the sequential execution of operations. Therefore, NanoFlow proposes intra-device parallelism (as shown in the following gif), which use nano-batches to schedule the compute-, memory-, network-bound operations for simultaneous execution. Such overlapping leaves compute-bound operations on the critical path and boost the resource utilization.

Overview of NanoFlow's key components

Illustration of intra-device parallelism

With highly utilized GPU, the overhead of CPU, which consists of KV-cache management, batch formation, and retired requests selection, takes significant part ($>10$%) of inference time. Therefore, NanoFlow adopts an asyncronous control flow as shown in the following figure. At any iteration $i$, NanoFlow makes batching decisions and allocates the KV-cache entries for the next iteration before the end of the current iteration. NanoFlow directly launches iteration $i + 1$ without detecting the end-of-sequence (EOS) tokens generated in iteration $i$ and retires completed requests at iteration $i+2$.

Explanation of asyncronous control flow scheduling

To avoid recomputation and reuse the KV-cache from multi-round conversations, NanoFlow eagerly offloads the KV-cache of finished requests to SSDs. In one iteration, NanoFlow selects the KV-cache of the retired requests and copies them to the host in parallel to the on-the-fly inference operations, via a layer-by-layer manner. Our calculation shows that only 5GB/s are needed for the offloading bandwidth of serving LLaMA2-70B, while a single SSD can reach 3GB/s.

With all mentioned techniques implemented, we now open-source NanoFlow of a Cpp-based backend and a Python-based demo frontend in ~4K lines. NanoFlow integrates state-of-the-art kernel libraries including CUTLASS for GEMM, FlashInfer for Attention, and MSCCL++ for Network. This codebase also contains necessary scripts for environment setup and experiment reproduction.

Benchmarks

We list some of the primary benchmarks. Please check our paper for more details. We evaluate on A100 80GB SXM and choose vLLM v0.5.3, Deepspeed-FastGen v0.2.3, and TensorRT-LLM v0.8.0 as baselines. Note that all frameworks turn off specific optimizations like quantization, speculative decoding, prefix cache, etc..

Offline throughput: Llama2-70B on 8xA100 (80GB)

We conduct offline througput in two settings: practical workloads from collected traces (Splitwise, LMSYS-Chat-1M, ShareGPT), and constant input/output length. NanoFlow consistently surpasses all the baselines.

Offline throughput benchmarks

Online latency: Llama2-70B on 8xA100 (80GB)

We test the normalized latency (which is the end-to-end request latency divided by number of output tokens) with the three real-world traces and set different request rate (incoming requests per second). NanoFlow is able to sustain a higher request rate with low latency compared to baselines among all the datasets.

Online latency benchmarks

Feasibility: offline throughput on different models

We ported NanoFlow to 5 representative models to showcase its flexibility. We evaluate the offline throughput of NanoFlow (tokens per second per GPU) on these LLMs with constant length of input 1024 and output 512.

Offline throughput of NanoFlow on different models

Codebase

Abstract

The increasing usage of Large Language Models (LLMs) has resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput (under reasonable latency constraints) has emerged as a key metric that determines serving systems’ performance. To boost throughput, various methods of inter-device parallelism (e.g., data, tensor, pipeline) have been explored. However, existing methods do not consider overlapping the utilization of different resources within a single device, leading to underutilization and sub-optimal performance.

We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of resources including compute, memory, and network within a single device through operation co-scheduling. To exploit intra-device parallelism, NanoFlow introduces two key innovations: First, NanoFlow proposes nano-batching to split requests at the granularity of operations, which breaks the dependency of sequential operations in LLM inference and enables overlapping them; then, to get benefit from overlapping, NanoFlow uses a device-level pipeline with execution unit scheduling, which partitions the device’s functional units and simultaneously executes different operations in each unit. NanoFlow automates the pipeline setup using a parameter search algorithm, which enables easily porting NanoFlow to work with different models. We implement NanoFlow on NVIDIA GPUs and evaluate end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8×7B, LLaMA-3-8B, etc. We show that NanoFlow achieves 68.5% of optimal throughput. With practical workloads, NanoFlow provides 1.91× throughput boost compared to state-of-the-art serving systems achieving 59% to 72% of optimal throughput across ported models.

Installation

Docker setup

mkdir -p ~/framework-test
docker run --gpus all --net=host --privileged -v /dev/shm:/dev/shm --name nanoflow -v ~/framework-test:/code -it nvcr.io/nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04

apt update
apt install pybind11-dev
apt install liburing-dev
apt install libopenmpi-dev
sysctl -w kernel.io_uring_disabled=0
sysctl -w vm.nr_hugepages=65536

Gurobi License Setup (for Docker)

Follow these steps to obtain a Gurobi license and configure it so your Docker container can use it.

1. Request a Gurobi License

Go to the Gurobi website and create an account (https://www.gurobi.com/).
After logging in, navigate to My Gurobi → Get License.
Choose the "WLS Academic" license type and fill out any required fields.
Gurobi will email you a license file named gurobi.lic (or provide you with a license key string).

2. Place the License on Your Host Machine

mkdir -p ~/gurobi/license
mv /path/to/downloaded/gurobi.lic ~/gurobi/license/
ls ~/gurobi/license
echo "export GRB_LICENSE_FILE=$(pwd)/gurobi.lic" >> ~/.bashrc

Install Dependencies

git clone git@github.com:efeslab/Nanoflow.git
cd Nanoflow
chmod +x ./installAnaconda.sh
./installAnaconda.sh
# restart the terminal
source ~/.bashrc

# login to huggingface
mkdir -p hf
echo "export HF_HOME=$(pwd)/hf" >> ~/.bashrc
source ~/.bashrc

huggingface-cli login

# set up the environment
cd Nanoflow-python
yes | bash setup.sh

Build

cd pybind
mkdir -p build
cmake ..
make -j 256

End-to-end Test

cd entry
python run_llama3.py -load_hf_weight=True

Citation

If you use NanoFlow for your research, please cite our paper:

@misc{zhu2024nanoflowoptimallargelanguage,
      title={NanoFlow: Towards Optimal Large Language Model Serving Throughput}, 
      author={Kan Zhu and Yilong Zhao and Liangyu Zhao and Gefei Zuo and Yile Gu and Dedong Xie and Yufei Gao and Qinyu Xu and Tian Tang and Zihao Ye and Keisuke Kamahori and Chien-Yu Lin and Stephanie Wang and Arvind Krishnamurthy and Baris Kasikci},
      year={2024},
      eprint={2408.12757},
      archivePrefix={arXiv},
      primaryClass={cs.DC},
      url={https://arxiv.org/abs/2408.12757}, 
}

Acknowledgement

NanoFlow is inspired by and reuses code from the following projects: CUTLASS, FlashInfer, MSCCL++, and Punica. Development of NanoFlow is made easier thanks to these tools: GoogleTest, NVBench, and spdlog. We thank Siqin Chen for her help in the design of NanoFlow logo.

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
3rdparty		3rdparty
auto_search		auto_search
config_all		config_all
core		core
entry		entry
figures		figures
kvcache		kvcache
matplot		matplot
models		models
operations		operations
playground		playground
pybind		pybind
pybind_amd		pybind_amd
tests		tests
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
installAnaconda.sh		installAnaconda.sh
platform_config.py		platform_config.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nanoflow

News

Introduction

Benchmarks

Offline throughput: Llama2-70B on 8xA100 (80GB)

Online latency: Llama2-70B on 8xA100 (80GB)

Feasibility: offline throughput on different models

Codebase

Abstract

Installation

Docker setup

Gurobi License Setup (for Docker)

1. Request a Gurobi License

2. Place the License on Your Host Machine

Install Dependencies

Build

End-to-end Test

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

efeslab/Nanoflow

Folders and files

Latest commit

History

Repository files navigation

Nanoflow

News

Introduction

Benchmarks

Offline throughput: Llama2-70B on 8xA100 (80GB)

Online latency: Llama2-70B on 8xA100 (80GB)

Feasibility: offline throughput on different models

Codebase

Abstract

Installation

Docker setup

Gurobi License Setup (for Docker)

1. Request a Gurobi License

2. Place the License on Your Host Machine

Install Dependencies

Build

End-to-end Test

Citation

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages