[Hardware][Intel CPU][DOC] Update docs for CPU backend (vllm-project#…

…6212) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Gubrud, Aaron D <aaron.d.gubrud@intel.com> Co-authored-by: adgubrud <96072084+adgubrud@users.noreply.github.com> Signed-off-by: Alvant <alvasian@yandex.ru>
compressa-ai · Oct 26, 2024 · a3fe53d · a3fe53d
1 parent 65c761d
commit a3fe53d
Show file tree

Hide file tree

Showing 3 changed files with 165 additions and 1 deletion.
diff --git a/docs/source/getting_started/cpu-installation.rst b/docs/source/getting_started/cpu-installation.rst
@@ -3,7 +3,13 @@
 Installation with CPU
 ========================
 
-vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32 and BF16.
+vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32 and BF16. vLLM CPU backend supports the following vLLM features:
+
+- Tensor Parallel (``-tp = N``)
+- Quantization (``INT8 W8A8, AWQ``)
+
+.. note::
+    FP16 data type and more advanced features on `chunked-prefill`, `prefix-caching` and `FP8 KV cache` are under development and will be available soon.
 
 Table of contents:
 
@@ -141,5 +147,20 @@ Performance tips
 
 - If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using ``VLLM_CPU_OMP_THREADS_BIND`` to avoid cross NUMA node memory access.
 
+CPU Backend Considerations
+--------------------------
+
+- The CPU backend significantly differs from the GPU backend since the vLLM architecture was originally optimized for GPU use. A number of optimizations are needed to enhance its performance.
+
+- Decouple the HTTP serving components from the inference components. In a GPU backend configuration, the HTTP serving and tokenization tasks operate on the CPU, while inference runs on the GPU, which typically does not pose a problem. However, in a CPU-based setup, the HTTP serving and tokenization can cause significant context switching and reduced cache efficiency. Therefore, it is strongly recommended to segregate these two components for improved performance.
+
+- On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the `topology <https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.md#non-uniform-memory-access-numa>`_. For NUMA architecture, two optimizations are to recommended: Tensor Parallel or Data Parallel.  
+
+  * Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With `TP feature on CPU <https://github.com/vllm-project/vllm/pull/6125>`_ merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
+
+    .. code-block:: console
+
+         $ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
 
 
+  * Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like `Nginx <../serving/deploying_with_nginx.html>`_ or HAProxy are recommended. Anyscale Ray project provides the feature on LLM `serving <https://docs.ray.io/en/latest/serve/index.html>`_. Here is the example to setup a scalable LLM serving with `Ray Serve <https://github.com/intel/llm-on-ray/blob/main/docs/setup.md>`_.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -80,6 +80,7 @@ Documentation
    serving/openai_compatible_server
    serving/deploying_with_docker
    serving/deploying_with_k8s
+   serving/deploying_with_nginx
    serving/distributed_serving
    serving/metrics
    serving/env_vars

diff --git a/docs/source/serving/deploying_with_nginx.rst b/docs/source/serving/deploying_with_nginx.rst
@@ -0,0 +1,142 @@
+.. _nginxloadbalancer:
+
+Deploying with Nginx Loadbalancer
+=================================
+
+This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers. 
+
+Table of contents:
+
+#. :ref:`Build Nginx Container <nginxloadbalancer_nginx_build>`
+#. :ref:`Create Simple Nginx Config file <nginxloadbalancer_nginx_conf>`
+#. :ref:`Build vLLM Container <nginxloadbalancer_nginx_vllm_container>`
+#. :ref:`Create Docker Network <nginxloadbalancer_nginx_docker_network>`
+#. :ref:`Launch vLLM Containers <nginxloadbalancer_nginx_launch_container>`
+#. :ref:`Launch Nginx <nginxloadbalancer_nginx_launch_nginx>`
+#. :ref:`Verify That vLLM Servers Are Ready <nginxloadbalancer_nginx_verify_nginx>`
+
+.. _nginxloadbalancer_nginx_build:
+
+Build Nginx Container
+---------------------
+
+This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.
+
+.. code-block:: console
+
+    export vllm_root=`pwd`
+
+Create a file named ``Dockerfile.nginx``:
+
+.. code-block:: console
+
+    FROM nginx:latest
+    RUN rm /etc/nginx/conf.d/default.conf
+    EXPOSE 80
+    CMD ["nginx", "-g", "daemon off;"]
+
+Build the container:
+
+.. code-block:: console
+
+    docker build . -f Dockerfile.nginx --tag nginx-lb
+
+.. _nginxloadbalancer_nginx_conf:
+
+Create Simple Nginx Config file
+-------------------------------
+
+Create a file named ``nginx_conf/nginx.conf``. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another ``server vllmN:8000 max_fails=3 fail_timeout=10000s;`` entry to ``upstream backend``.
+
+.. code-block:: console
+
+    upstream backend {
+        least_conn;
+        server vllm0:8000 max_fails=3 fail_timeout=10000s;
+        server vllm1:8000 max_fails=3 fail_timeout=10000s;
+    }     
+    server {
+        listen 80;
+        location / {
+            proxy_pass http://backend;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+            proxy_set_header X-Forwarded-Proto $scheme;
+        }
+    }
+
+.. _nginxloadbalancer_nginx_vllm_container:
+
+Build vLLM Container
+--------------------
+
+.. code-block:: console
+
+    cd $vllm_root
+    docker build -f Dockerfile . --tag vllm
+
+
+If you are behind proxy, you can pass the proxy settings to the docker build command as shown below:
+
+.. code-block:: console
+
+    cd $vllm_root
+    docker build -f Dockerfile . --tag vllm --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy
+
+.. _nginxloadbalancer_nginx_docker_network:
+
+Create Docker Network
+---------------------
+
+.. code-block:: console
+
+    docker network create vllm_nginx
+
+
+.. _nginxloadbalancer_nginx_launch_container:
+
+Launch vLLM Containers
+----------------------
+
+Notes:
+
+* If you have your HuggingFace models cached somewhere else, update ``hf_cache_dir`` below. 
+* If you don't have an existing HuggingFace cache you will want to start ``vllm0`` and wait for the model to complete downloading and the server to be ready. This will ensure that ``vllm1`` can leverage the model you just downloaded and it won't have to be downloaded again.
+* The below example assumes GPU backend used. If you are using CPU backend, remove ``--gpus all``, add ``VLLM_CPU_KVCACHE_SPACE`` and ``VLLM_CPU_OMP_THREADS_BIND`` environment variables to the docker run command.
+* Adjust the model name that you want to use in your vLLM servers if you don't want to use ``Llama-2-7b-chat-hf``. 
+
+.. code-block:: console
+
+    mkdir -p ~/.cache/huggingface/hub/
+    hf_cache_dir=~/.cache/huggingface/
+    docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
+    docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
+
+.. note::
+    If you are behind proxy, you can pass the proxy settings to the docker run command via ``-e http_proxy=$http_proxy -e https_proxy=$https_proxy``.
+
+.. _nginxloadbalancer_nginx_launch_nginx:
+
+Launch Nginx
+------------
+
+.. code-block:: console
+
+    docker run -itd -p 8000:80 --network vllm_nginx -v ./nginx_conf/:/etc/nginx/conf.d/ --name nginx-lb nginx-lb:latest
+    
+.. _nginxloadbalancer_nginx_verify_nginx:
+
+Verify That vLLM Servers Are Ready
+----------------------------------
+
+.. code-block:: console
+    
+    docker logs vllm0 | grep Uvicorn
+    docker logs vllm1 | grep Uvicorn
+
+Both outputs should look like this:
+
+.. code-block:: console
+
+    INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)