[Performance]: Throughput and Latency degradation with a  single LoRA adapter on A100 40 GB

### Proposal to improve performance

_No response_

### Report of performance regression

_No response_

### Misc discussion on performance



---

**Setup Summary for vLLM Benchmarking with Llama-2 Model:**

- **Hardware**: A100 40 GB (a2-highgpu-2g) on Google Kubernetes Engine (GKE)
- **Model**: `meta-llama/Llama-2-7b-hf`
- **GPU Count**: 1
- **Experiments**:
  - **Experiment 1**: Requests using the base model `meta-llama/Llama-2-7b-hf`.
  - **Experiment 2**: vLLM deployed with LoRA adapter `vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm` (size 160 MB).
  - **Experiment 3**: vLLM deployed with LoRA adapter `xtuner/Llama-2-7b-qlora-moss-003-sft` (size 640 MB).
  
  For all three experiments, we used the same input prompt (ShareGPT) and observed a similar output length.

**Settings**:
- **Eager Mode**: Not enabled.
- **Max GPU Utilization**: Default at 90%.

**Benchmark Metrics**:
We measured:
  - **Latency per output token**
  - **Throughput** (output tokens per second)

You can view detailed results in the benchmark document: [Benchmark 1 server - Sheet7.pdf](https://github.com/user-attachments/files/17640153/Benchmark.1.server.-.Sheet7.pdf).

---

**Observations and Questions**:

- Using LoRA adapters led to a notable degradation in throughput and latency compared to the base model. Specifically, we observed up to a 50% drop in maximum throughput with LoRA compared to the base model.  
- **Is this performance degradation expected with LoRA adapters?** 
- **Are there parameters or tuning options that could improve LoRA performance?**

**Deployment Command**:

```yaml
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
  - "--model"
  - "meta-llama/Llama-2-7b-hf"
  - "--tensor-parallel-size"
  - "1"
  - "--port"
  - "8000"
  - "--disable-log-requests"
  - "--enable-lora"
  - "--max-loras"
  - "3"
  - "--max-cpu-loras"
  - "15"
  - "--max-lora-rank"
  - "64"
  - "--gpu-memory-utilization"
  - "0.9"
  - "--lora-modules"
  - xtuner/Llama-2-7b-qlora-moss-003-sft
```

--- 


### Your current environment (if you think it is necessary)

---

**Sample Query**:

```bash
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
  "model": "tweet-summary",
  "prompt": "Write as if you were a critic: San Francisco",
  "max_tokens": 100,
  "temperature": 0
}'
```

**Deployment YAML Configuration**:

```yaml
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama2-7b-pool
spec:
  selector:
    app: vllm-llama2-7b-pool
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
  type: LoadBalancer

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-7b-pool
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama2-7b-pool
  template:
    metadata:
      labels:
        app: vllm-llama2-7b-pool
    spec:
      containers:
        - name: lora
          image: "vllm/vllm-openai:latest"
          imagePullPolicy: Always
          command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
          args:
            - "--model"
            - "meta-llama/Llama-2-7b-hf"
            - "--tensor-parallel-size"
            - "1"
            - "--port"
            - "8000"
            - "--disable-log-requests"
            - "--enable-lora"
            - "--max-loras"
            - "3"
            - "--max-cpu-loras"
            - "15"
            - "--max-lora-rank"
            - "64"
            - "--gpu-memory-utilization"
            - "0.9"
            - "--lora-modules"
            - "tweet-summary-0=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0"
          env:
            - name: PORT
              value: "8000"
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          livenessProbe:
            failureThreshold: 240
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 600
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1
          volumeMounts:
            - mountPath: /data
              name: data
            - mountPath: /dev/shm
              name: shm
            - name: adapters
              mountPath: "/adapters"
      initContainers:
        - name: adapter-loader
          image: ghcr.io/tomatillo-and-multiverse/adapter-puller:demo
          command: ["python"]
          args:
            - ./pull_adapters.py
            - --adapter
            - xtuner/Llama-2-7b-qlora-moss-003-sft
            - --adapter
            - yard1/llama-2-7b-sql-lora-test
            - --adapter
            - vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
            - --duplicate-count
            - "5"
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
            - name: HF_HOME
              value: /adapters
          volumeMounts:
            - name: adapters
              mountPath: "/adapters"
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      volumes:
        - name: data
          emptyDir: {}
        - name: shm
          emptyDir:
            medium: Memory
        - name: adapters
          emptyDir: {}
```

This deployment configuration sets up the vLLM server with LoRA adapters on GKE, with health probes, GPU limits, and a volume configuration for adapter management.

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: Throughput and Latency degradation with a single LoRA adapter on A100 40 GB #10062

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: Throughput and Latency degradation with a single LoRA adapter on A100 40 GB #10062

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions