A reverse proxy for distributing requests across OpenAI model deployments (e.g. GPT-4) hosted in Azure OpenAI Service (AOAI).
Important
This is a highly experimental solution, and it's not an official Microsoft product.
An Azure OpenAI deployment model throttling is designed taking into consideration two configurable rate limits:
Tokens-per-minute (TPM)
: Estimated number of tokens that can processed over a one-minute periodRequests-per-minute (RPM)
: Estimated number of requests over a one-minute period
A deployment model is considered overloaded when at least one of these rate limits is reached, and Azure OpenAI returns an HTTP 429 ("Too Many Requests") response code to the client with a "Retry-After" HTTP header indicating how many seconds the deployment model will be unavailable before starting to accept more requests.
What if there is an increasing demand for requests and/or tokens that can't be met with the deployment model's rate limits? Currently the alternatives are:
-
Increase the model deployment capacity by requesting Provisioned throughput units (PTU).
-
Build a load balancing component to distribute the requests to the model deployments, hosted on a single or multiple Azure OpenAI resources, optimizing resources utilization and maximizing throughput.
-
Adopt a failover strategy by forwarding the requests from an overloaded model deployment to another one.
These approaches can be combined to achieve enhanced scalability, performance and availability.
This repository showcases a proof-of-concept solution for the alternative #2: A reverse proxy built in ASP.NET Core with YARP.
sequenceDiagram
Client->>Load Balancer: Proxy HTTP request<br/> /<azure-openai-route><br/><azure-openai-credentials>
box Gray Reverse Proxy
participant Load Balancer
participant HTTP Forwarder
participant Passive Health Check
participant Destination Health Updater
participant Transformer
participant Custom Metrics Publisher
end
Load Balancer->>HTTP Forwarder: Selected<br/> deployment destination
par
HTTP Forwarder->>Passive Health Check: HTTP response
Note over Passive Health Check: Evaluate response and mark destination<br/> health state as healthy or unhealthy
Passive Health Check ->> Destination Health Updater: Update destination state
and
HTTP Forwarder->>Transformer: HTTP response
Note over Transformer: Append x-absolute-uri response header<br /> with the destination address
Transformer->>Client: HTTP response
and
HTTP Forwarder->>Custom Metrics Publisher: HTTP response
Note over Custom Metrics Publisher: Remaining requests + tokens
end
- Support YARP's built-in load balancing algorithms.
-
Custom Passive Health Check middleware that intercepts HTTP responses from model deployments selected by the load balancer, and assign health states. For more info, see the Passive Health Check section.
-
Custom OpenTelemetry metrics with built-in support for Prometheus and Azure Monitor exporters to help getting insights about how the proxy is performing the requests distribution. For more info, see the Metrics section.
The following diagram gives a state management overview and the logic implemented on the AzureOpenAIPassiveHealthCheckPolicy
middleware.
stateDiagram-v2
state if_state <<choice>>
[*] --> AzureOpenAIPassiveHealthCheckPolicy
AzureOpenAIPassiveHealthCheckPolicy --> if_state
if_state --> Unhealthy: if HTTP status code<br/> >= 400 and <= 599
if_state --> Unhealthy: if tokens or requests<br/> threshold is reached (optional)
Unhealthy --> Unknown
note right of Unknown
On hold for X seconds<br/> from Retry-After header value
end note
Unknown --> Healthy
if_state --> Healthy : else
The proxy provides custom metrics compliant to OpenTelemetry, giving the flexibility to easily integrate it with many monitoring solutions with minimal effort (e.g. Azure Monitor, Prometheus).
These are the custom metrics the proxy emits:
Metric name | Type | Description | Attributes (dimensions) |
---|---|---|---|
reverseproxy_azure_openai_remaining_requests |
Gauge | Remaining HTTP requests. |
|
reverseproxy_azure_openai_remaining_tokens |
Gauge | Remaining Azure OpenAI tokens. |
|
These metrics may help have a better understanding on how the requests are being distributed among model deployments and run experiments to establish a better configuration that fits your needs (e.g. switching load balancing algorithms, adjusting thresholds, customizing health check policies).
The reverse proxy can be used as:
- A gateway to serve as an entrypoint for one or more LLM apps;
- A sidecar app to run alongside an LLM app (e.g. in a Kubernetes environment such as Azure Kubernetes Service or Azure Container Apps).
- Resiliency: Currently, when a model deployment request fails (i.e. HTTP response an error status code), the proxy returns the failed request as is to the client.
- Deployments priority: Currently there's no concept of priority groups of weights to model deployments (e.g. prioritizing PTU-based deployments).
The repository provides the following containerized services out of the box to simplify local development:
- An Azure OpenAI Service with 2 or more model deployments. For more information about model deployment, see the resource deployment guide.
- Docker, or Podman with podman-compose.
Create an appsettings.Local.json
file on src/proxy
directory to start the proxy configuration for your local environment. There are two options to configure the load balancer and passive health check:
-
Using YARP's built-in ReverseProxy config section to manually set the route and cluster. Check out the YARP-based configuration section for a config sample.
-
Using a
ModelDeploymentsDiscovery
config section to dynamically discover model deployments on the Azure OpenAI resource tailored to your filter pattern (e.g. discovering only GPT 3.5 deployments viagpt-35*
pattern) and create the route and cluster properties. Check out the [Model deployments discovery configuration] section for a config sample.
{
"ReverseProxy": {
"Routes": {
"route1": {
"ClusterId": "cluster1",
"Match": {
"Path": "{**catch-all}"
}
}
},
"Clusters": {
"cluster1": {
"LoadBalancingPolicy": "RoundRobin",
"HealthCheck": {
"Passive": {
"Enabled": "true",
"Policy": "AzureOpenAIPassiveHealthCheckPolicy"
}
},
"Metadata": {
"RemainingRequestsThreshold": "100",
"RemainingTokensThreshold": "1000"
},
"Destinations": {
"deployment1": {
"Address": "https://my-account.openai.azure.com/openai/deployments/deployment-1"
},
"deployment2": {
"Address": "https://my-account.openai.azure.com/openai/deployments/deployment-2"
}
}
}
}
}
}
{
"ModelDeploymentsDiscovery": {
"SubscriptionId": "<subscription id>",
"ResourceGroupName": "<resource group name",
"AccountId": "<azure openai account name>",
"FilterPattern": "gpt-35*",
"FrequencySeconds": 5,
"LoadBalancingPolicy": "RoundRobin",
"PassiveHealthCheck": {
"Policy": "AzureOpenAIPassiveHealthCheckPolicy",
"Metadata": {
"RemainingRequestsThreshold": "100",
"RemainingTokensThreshold": "1000"
}
}
}
}
The proxy is configured by default to export custom metrics to Prometheus via /metrics
HTTP route. If you want to export metrics to Azure Monitor, add the following ApplicationInsights
section on the app settings:
{
...,
"ApplicationInsights": {
"ConnectionString": "<app-insights-connection-string"
}
}
Create a .env
file on the root directory and add the Azure OpenAI API key:
AZURE_OPENAI_API_KEY=<api-key>
The
PROXY_ENDPOINT
environment variable is set by default on thecompose.yml
file.
Spin services up with Docker compose:
docker-compose up
Important
For any code changes, make sure you build the image again before running using the --build
flag: docker-compose up --build
The repository provides the following ways of sending HTTP requests to Azure OpenAI Chat Completions API through the proxy:
-
Sequential requests via bash script, available on the
scripts
folder:./scripts/client.sh
or via powershell
.\scripts\client.ps1
-
Concurrent requests via
k6
, a load testing tool:docker-compose run k6 run /scripts/client.js
For stopping and removing the containers, networks, volumes and images:
docker-compose down --volumes --rmi all