-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Closed as not planned
Closed as not planned
Copy link
Labels
performancePerformance-related issuesPerformance-related issuesstaleOver 90 days of inactivityOver 90 days of inactivity
Description
Proposal to improve performance
No response
Report of performance regression
No response
Misc discussion on performance
Setup Summary for vLLM Benchmarking with Llama-2 Model:
-
Hardware: A100 40 GB (a2-highgpu-2g) on Google Kubernetes Engine (GKE)
-
Model:
meta-llama/Llama-2-7b-hf -
GPU Count: 1
-
Experiments:
- Experiment 1: Requests using the base model
meta-llama/Llama-2-7b-hf. - Experiment 2: vLLM deployed with LoRA adapter
vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm(size 160 MB). - Experiment 3: vLLM deployed with LoRA adapter
xtuner/Llama-2-7b-qlora-moss-003-sft(size 640 MB).
For all three experiments, we used the same input prompt (ShareGPT) and observed a similar output length.
- Experiment 1: Requests using the base model
Settings:
- Eager Mode: Not enabled.
- Max GPU Utilization: Default at 90%.
Benchmark Metrics:
We measured:
- Latency per output token
- Throughput (output tokens per second)
You can view detailed results in the benchmark document: Benchmark 1 server - Sheet7.pdf.
Observations and Questions:
- Using LoRA adapters led to a notable degradation in throughput and latency compared to the base model. Specifically, we observed up to a 50% drop in maximum throughput with LoRA compared to the base model.
- Is this performance degradation expected with LoRA adapters?
- Are there parameters or tuning options that could improve LoRA performance?
Deployment Command:
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model"
- "meta-llama/Llama-2-7b-hf"
- "--tensor-parallel-size"
- "1"
- "--port"
- "8000"
- "--disable-log-requests"
- "--enable-lora"
- "--max-loras"
- "3"
- "--max-cpu-loras"
- "15"
- "--max-lora-rank"
- "64"
- "--gpu-memory-utilization"
- "0.9"
- "--lora-modules"
- xtuner/Llama-2-7b-qlora-moss-003-sftYour current environment (if you think it is necessary)
Sample Query:
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'Deployment YAML Configuration:
---
apiVersion: v1
kind: Service
metadata:
name: vllm-llama2-7b-pool
spec:
selector:
app: vllm-llama2-7b-pool
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama2-7b-pool
spec:
replicas: 1
selector:
matchLabels:
app: vllm-llama2-7b-pool
template:
metadata:
labels:
app: vllm-llama2-7b-pool
spec:
containers:
- name: lora
image: "vllm/vllm-openai:latest"
imagePullPolicy: Always
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model"
- "meta-llama/Llama-2-7b-hf"
- "--tensor-parallel-size"
- "1"
- "--port"
- "8000"
- "--disable-log-requests"
- "--enable-lora"
- "--max-loras"
- "3"
- "--max-cpu-loras"
- "15"
- "--max-lora-rank"
- "64"
- "--gpu-memory-utilization"
- "0.9"
- "--lora-modules"
- "tweet-summary-0=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0"
env:
- name: PORT
value: "8000"
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
ports:
- containerPort: 8000
name: http
protocol: TCP
livenessProbe:
failureThreshold: 240
httpGet:
path: /health
port: http
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 600
httpGet:
path: /health
port: http
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /data
name: data
- mountPath: /dev/shm
name: shm
- name: adapters
mountPath: "/adapters"
initContainers:
- name: adapter-loader
image: ghcr.io/tomatillo-and-multiverse/adapter-puller:demo
command: ["python"]
args:
- ./pull_adapters.py
- --adapter
- xtuner/Llama-2-7b-qlora-moss-003-sft
- --adapter
- yard1/llama-2-7b-sql-lora-test
- --adapter
- vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
- --duplicate-count
- "5"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
- name: HF_HOME
value: /adapters
volumeMounts:
- name: adapters
mountPath: "/adapters"
restartPolicy: Always
schedulerName: default-scheduler
terminationGracePeriodSeconds: 30
volumes:
- name: data
emptyDir: {}
- name: shm
emptyDir:
medium: Memory
- name: adapters
emptyDir: {}This deployment configuration sets up the vLLM server with LoRA adapters on GKE, with health probes, GPU limits, and a volume configuration for adapter management.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
performancePerformance-related issuesPerformance-related issuesstaleOver 90 days of inactivityOver 90 days of inactivity