Skip to content

[Performance]: Throughput and Latency degradation with a single LoRA adapter on A100 40 GB #10062

@kaushikmitr

Description

@kaushikmitr

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance


Setup Summary for vLLM Benchmarking with Llama-2 Model:

  • Hardware: A100 40 GB (a2-highgpu-2g) on Google Kubernetes Engine (GKE)

  • Model: meta-llama/Llama-2-7b-hf

  • GPU Count: 1

  • Experiments:

    • Experiment 1: Requests using the base model meta-llama/Llama-2-7b-hf.
    • Experiment 2: vLLM deployed with LoRA adapter vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm (size 160 MB).
    • Experiment 3: vLLM deployed with LoRA adapter xtuner/Llama-2-7b-qlora-moss-003-sft (size 640 MB).

    For all three experiments, we used the same input prompt (ShareGPT) and observed a similar output length.

Settings:

  • Eager Mode: Not enabled.
  • Max GPU Utilization: Default at 90%.

Benchmark Metrics:
We measured:

  • Latency per output token
  • Throughput (output tokens per second)

You can view detailed results in the benchmark document: Benchmark 1 server - Sheet7.pdf.


Observations and Questions:

  • Using LoRA adapters led to a notable degradation in throughput and latency compared to the base model. Specifically, we observed up to a 50% drop in maximum throughput with LoRA compared to the base model.
  • Is this performance degradation expected with LoRA adapters?
  • Are there parameters or tuning options that could improve LoRA performance?

Deployment Command:

command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
  - "--model"
  - "meta-llama/Llama-2-7b-hf"
  - "--tensor-parallel-size"
  - "1"
  - "--port"
  - "8000"
  - "--disable-log-requests"
  - "--enable-lora"
  - "--max-loras"
  - "3"
  - "--max-cpu-loras"
  - "15"
  - "--max-lora-rank"
  - "64"
  - "--gpu-memory-utilization"
  - "0.9"
  - "--lora-modules"
  - xtuner/Llama-2-7b-qlora-moss-003-sft

Your current environment (if you think it is necessary)


Sample Query:

curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
  "model": "tweet-summary",
  "prompt": "Write as if you were a critic: San Francisco",
  "max_tokens": 100,
  "temperature": 0
}'

Deployment YAML Configuration:

---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama2-7b-pool
spec:
  selector:
    app: vllm-llama2-7b-pool
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
  type: LoadBalancer

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-7b-pool
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama2-7b-pool
  template:
    metadata:
      labels:
        app: vllm-llama2-7b-pool
    spec:
      containers:
        - name: lora
          image: "vllm/vllm-openai:latest"
          imagePullPolicy: Always
          command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
          args:
            - "--model"
            - "meta-llama/Llama-2-7b-hf"
            - "--tensor-parallel-size"
            - "1"
            - "--port"
            - "8000"
            - "--disable-log-requests"
            - "--enable-lora"
            - "--max-loras"
            - "3"
            - "--max-cpu-loras"
            - "15"
            - "--max-lora-rank"
            - "64"
            - "--gpu-memory-utilization"
            - "0.9"
            - "--lora-modules"
            - "tweet-summary-0=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0"
          env:
            - name: PORT
              value: "8000"
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          livenessProbe:
            failureThreshold: 240
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 600
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1
          volumeMounts:
            - mountPath: /data
              name: data
            - mountPath: /dev/shm
              name: shm
            - name: adapters
              mountPath: "/adapters"
      initContainers:
        - name: adapter-loader
          image: ghcr.io/tomatillo-and-multiverse/adapter-puller:demo
          command: ["python"]
          args:
            - ./pull_adapters.py
            - --adapter
            - xtuner/Llama-2-7b-qlora-moss-003-sft
            - --adapter
            - yard1/llama-2-7b-sql-lora-test
            - --adapter
            - vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
            - --duplicate-count
            - "5"
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
            - name: HF_HOME
              value: /adapters
          volumeMounts:
            - name: adapters
              mountPath: "/adapters"
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      volumes:
        - name: data
          emptyDir: {}
        - name: shm
          emptyDir:
            medium: Memory
        - name: adapters
          emptyDir: {}

This deployment configuration sets up the vLLM server with LoRA adapters on GKE, with health probes, GPU limits, and a volume configuration for adapter management.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance-related issuesstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions