Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 35 additions & 14 deletions site-src/guides/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,32 @@ This guide describes the current state of exposed metrics and how to scrape them

## Requirements

To have response metrics, ensure the body mode is set to `Buffered` or `Streamed` (this should be the default behavior for all implementations).
=== "EPP"

If you want to include usage metrics for vLLM model server streaming request, send the request with `include_usage`:
To have response metrics, ensure the body mode is set to `Buffered` or `Streamed` (this should be the default behavior for all implementations).

If you want to include usage metrics for vLLM model server streaming request, send the request with `include_usage`:

```
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "food-review",
"prompt": "whats your fav movie?",
"max_tokens": 10,
"temperature": 0,
"stream": true,
"stream_options": {"include_usage": "true"}
}'
```

=== "Dynamic LoRA Adapter Sidecar"

To have response metrics, ensure the vLLM model server is configured with the dynamic LoRA adapter as a sidecar container and a ConfigMap to configure which models to load/unload. See [this doc](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/tools/dynamic-lora-sidecar#example-configuration) for an example.

```
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "food-review",
"prompt": "whats your fav movie?",
"max_tokens": 10,
"temperature": 0,
"stream": true,
"stream_options": {"include_usage": "true"}
}'
```

## Exposed metrics

### EPP

| **Metric name** | **Metric Type** | <div style="width:200px">**Description**</div> | <div style="width:250px">**Labels**</div> | **Status** |
|:---------------------------------------------|:-----------------|:------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:------------|
| inference_model_request_total | Counter | The counter of requests broken out for each model. | `model_name`=&lt;model-name&gt; <br> `target_model_name`=&lt;target-model-name&gt; | ALPHA |
Expand All @@ -38,10 +47,20 @@ curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
| inference_pool_ready_pods | Gauge | The number of ready pods for an inference server pool. | `name`=&lt;inference-pool-name&gt; | ALPHA |
| inference_extension_info | Gauge | The general information of the current build. | `commit`=&lt;hash-of-the-build&gt; <br> `build_ref`=&lt;ref-to-the-build&gt; | ALPHA |

### Dynamic LoRA Adapter Sidecar

| **Metric name** | **Metric Type** | <div style="width:200px">**Description**</div> | <div style="width:250px">**Labels**</div> | **Status** |
|:---------------------------|:-----------------|:-------------------------------------------------|:------------------------------------------|:------------|
| lora_syncer_adapter_status | Gauge | Status of LoRA adapters (1=loaded, 0=not_loaded) | `adapter_name`=&lt;adapter-id&gt; | ALPHA |

## Scrape Metrics

Metrics endpoint is exposed at port 9090 by default. To scrape metrics, the client needs a ClusterRole with the following rule:
The metrics endpoints are exposed on different ports by default:

- EPP exposes the metrics endpoint at port 9090
- Dynamic LoRA adapter sidecar exposes the metrics endpoint at port 8080

To scrape metrics, the client needs a ClusterRole with the following rule:
`nonResourceURLs: "/metrics", verbs: get`.

Here is one example if the client needs to mound the secret to act as the service account
Expand Down Expand Up @@ -86,7 +105,9 @@ metadata:
kubernetes.io/service-account.name: inference-gateway-sa-metrics-reader
type: kubernetes.io/service-account-token
```
Then, you can curl the 9090 port like following

Then, you can curl the appropriate port as follows. For EPP (port 9090)

```
TOKEN=$(kubectl -n default get secret inference-gateway-sa-metrics-reader-secret -o jsonpath='{.secrets[0].name}' -o jsonpath='{.data.token}' | base64 --decode)
Expand Down
3 changes: 3 additions & 0 deletions tools/dynamic-lora-sidecar/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,9 @@ spec:
- name: reconciler
image: your-image:tag
command: ["python", "sidecar.py", "--health-check-timeout", "600", "--health-check-interval", "5", "--reconcile-trigger", "10"] #optional if overriding default values
ports:
- containerPort: 8080
name: metrics
volumeMounts:
- name: config-volume
mountPath: /config
Expand Down
5 changes: 4 additions & 1 deletion tools/dynamic-lora-sidecar/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,10 @@ spec:
image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main
restartPolicy: Always
imagePullPolicy: Always
env:
ports:
- containerPort: 8080
name: metrics
env:
- name: DYNAMIC_LORA_ROLLOUT_CONFIG
value: "/config/configmap.yaml"
volumeMounts: # DO NOT USE subPath
Expand Down
1 change: 1 addition & 0 deletions tools/dynamic-lora-sidecar/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
aiohttp==3.12.12
jsonschema==4.24.0
prometheus_client==0.22.1
PyYAML==6.0.2
requests==2.32.4
watchfiles==1.0.5
Expand Down
40 changes: 33 additions & 7 deletions tools/dynamic-lora-sidecar/sidecar/sidecar.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,17 @@
import datetime
import os
import sys
from prometheus_client import Gauge, start_http_server
from watchdog.observers.polling import PollingObserver as Observer
from watchdog.events import FileSystemEventHandler

# Initialize Prometheus metrics
ADAPTER_STATUS_METRICS = Gauge(
'lora_syncer_adapter_status',
'Status of LoRA adapters (1=loaded, 0=not_loaded)',
['adapter_name']
)

CONFIG_MAP_FILE = os.environ.get(
"DYNAMIC_LORA_ROLLOUT_CONFIG", "/config/configmap.yaml"
)
Expand Down Expand Up @@ -58,6 +66,8 @@ def parse_arguments():
help=f'Path to config map file (default: {CONFIG_MAP_FILE})')
parser.add_argument('--config-validation', action='store_true', default=True,
help='Enable config validation (default: True)')
parser.add_argument('--metrics-port', type=int, default=8080,
help='Port to listen for Prometheus metrics (default: 8080)')
return parser.parse_args()


Expand Down Expand Up @@ -226,7 +236,7 @@ def check_health() -> bool:
time.sleep(self.health_check_interval.seconds)
return False

def load_adapter(self, adapter: LoraAdapter):
def load_adapter(self, adapter: LoraAdapter) -> None | str:
"""Sends a request to load the specified model."""
if adapter in self.registered_adapters:
logging.info(
Expand All @@ -243,10 +253,12 @@ def load_adapter(self, adapter: LoraAdapter):
response = requests.post(url, json=payload)
response.raise_for_status()
logging.info(f"loaded model {adapter.id}")
return None
except requests.exceptions.RequestException as e:
logging.error(f"error loading model {adapter.id}: {e}")
return f"error loading model {adapter.id}: {e}"

def unload_adapter(self, adapter: LoraAdapter):
def unload_adapter(self, adapter: LoraAdapter) -> None | str:
"""Sends a request to unload the specified model."""
if adapter not in self.registered_adapters:
logging.info(
Expand Down Expand Up @@ -284,28 +296,42 @@ def reconcile(self):
adapters_to_load_id = ", ".join(str(a.id) for a in adapters_to_load)
logging.info(f"adapter to load {adapters_to_load_id}")
for adapter in adapters_to_load:
self.load_adapter(adapter)
err = self.load_adapter(adapter)
if err is None:
self.update_adapter_status_metrics(adapter.id, is_loaded=True)
adapters_to_unload = self.ensure_not_exist_adapters - self.ensure_exist_adapters
adapters_to_unload_id = ", ".join(str(a.id) for a in adapters_to_unload)
logging.info(f"adapters to unload {adapters_to_unload_id}")
for adapter in adapters_to_unload:
self.unload_adapter(adapter)
err = self.unload_adapter(adapter)
if err is None:
self.update_adapter_status_metrics(adapter.id, is_loaded=False)

def update_adapter_status_metrics(self, adapter_id: str, is_loaded: bool):
"""Update adapter status metrics"""
status = 1 if is_loaded else 0
ADAPTER_STATUS_METRICS.labels(adapter_name=adapter_id).set(status)



async def main():
args = parse_arguments()

# Update CONFIG_MAP_FILE with argument value
config_file = args.config

reconciler_instance = LoraReconciler(
config_file=config_file,
health_check_timeout=args.health_check_timeout,
health_check_interval=args.health_check_interval,
reconcile_trigger_seconds=args.reconcile_trigger,
config_validation=args.config_validation
)


# Start metrics server
logging.info(f"Starting metrics server on port {args.metrics_port}")
start_http_server(args.metrics_port)

logging.info(f"Running initial reconcile for config map {config_file}")
reconciler_instance.reconcile()

Expand Down
47 changes: 45 additions & 2 deletions tools/dynamic-lora-sidecar/sidecar/test_sidecar.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
import yaml
import os
import datetime
from sidecar import LoraReconciler, LoraAdapter, CONFIG_MAP_FILE, BASE_FIELD
from sidecar import LoraReconciler, LoraAdapter, CONFIG_MAP_FILE, BASE_FIELD, ADAPTER_STATUS_METRICS

# Update TEST_CONFIG_DATA to include the new configuration parameters
TEST_CONFIG_DATA = {
Expand Down Expand Up @@ -227,12 +227,55 @@ def test_health_check_settings(self):
reconcile_trigger_seconds=45,
config_validation=False
)

# Check that values are properly set
self.assertEqual(reconciler.health_check_timeout, datetime.timedelta(seconds=240))
self.assertEqual(reconciler.health_check_interval, datetime.timedelta(seconds=15))
self.assertEqual(reconciler.reconcile_trigger_seconds, 45)

def test_update_adapter_status_metrics(self):
"""Test that update_adapter_status_metrics method works correctly"""
# Clear any existing metrics
ADAPTER_STATUS_METRICS.clear()

# Create reconciler
reconciler = LoraReconciler(
config_file=CONFIG_MAP_FILE,
health_check_timeout=180,
health_check_interval=10,
reconcile_trigger_seconds=30,
config_validation=False
)

# Test setting loaded status
reconciler.update_adapter_status_metrics("test-adapter-1", is_loaded=True)
reconciler.update_adapter_status_metrics("test-adapter-2", is_loaded=False)

# Get all metric samples
metric_samples = list(ADAPTER_STATUS_METRICS.collect())[0].samples

# Check that metrics were set correctly
adapter_metrics = {}
for sample in metric_samples:
adapter_name = sample.labels['adapter_name']
adapter_metrics[adapter_name] = sample.value

self.assertEqual(adapter_metrics.get('test-adapter-1'), 1.0, "test-adapter-1 should be marked as loaded")
self.assertEqual(adapter_metrics.get('test-adapter-2'), 0.0, "test-adapter-2 should be marked as not loaded")

def test_metrics_endpoint(self):
"""Test that Prometheus metrics can be collected"""
from prometheus_client import generate_latest

# Clear metrics and set a test value
ADAPTER_STATUS_METRICS.clear()
ADAPTER_STATUS_METRICS.labels(adapter_name='test-adapter').set(1)

# Test that generate_latest produces valid output
metrics_bytes = generate_latest()
metrics = metrics_bytes.decode('utf-8')
self.assertIn('lora_syncer_adapter_status{adapter_name="test-adapter"} 1.0', metrics)


if __name__ == "__main__":
unittest.main()