Open
Description
Is your feature request related to a problem? Please describe.
I want to load-balance between different OpenAI deployments based on their real-time percentage usage.
Describe the solution you'd like
As part of the Deployment class, or any of its underlying properties, it would be helpful to have not only the configured rate limits, but also the real-time usage (especially for TPM, not just RPM).
Describe alternatives you've considered
I have looked into Azure Metrics, but I found two problems with this approach:
- the percentage usage metrics are only for Provisioned Managed Throughput, whereas I am looking for Standard TPM usage
- metrics are not available in real-time
Additional context
sample code:
client = CognitiveServicesManagementClient(credential=DefaultAzureCredential(), subscription_id=sub_id)
deployments = client.deployments.list(resource_group_name="some_rg", account_name="some_acct")
print('Deployments:')
for d in deployments:
...
sample out:
Deployments:
================
gpt-4-turbo
================
{
"additional_properties": {},
"id": "...",
"name": "gpt-4-turbo",
"type": "Microsoft.CognitiveServices/accounts/deployments",
"sku": "{'additional_properties': {}, 'name': 'Standard', 'tier': None, 'size': None, 'family': None, 'capacity': 10}",
"system_data": "{'additional_properties': {}, 'created_by': '...', 'created_by_type': 'User', 'created_at': datetime.datetime(2024, 4, 10, 15, 19, 31, 6416, tzinfo=<isodate.tzinfo.Utc object at 0x10673f050>), 'last_modified_by': 'radu@gocascade.ai', 'last_modified_by_type': 'User', 'last_modified_at': datetime.datetime(2024, 4, 10, 15, 19, 31, 6416, tzinfo=<isodate.tzinfo.Utc object at 0x10673f050>)}",
"etag": "...",
"properties": "{'additional_properties': {}, 'provisioning_state': 'Succeeded', 'model': <azure.mgmt.cognitiveservices.models._models_py3.DeploymentModel object at 0x106bf7710>, 'scale_settings': None, 'capabilities': {'chatCompletion': 'true'}, 'rai_policy_name': 'Microsoft.Default', 'call_rate_limit': None, 'rate_limits': [<azure.mgmt.cognitiveservices.models._models_py3.ThrottlingRule object at 0x106bf79d0>, <azure.mgmt.cognitiveservices.models._models_py3.ThrottlingRule object at 0x106bf7a90>], 'version_upgrade_option': 'OnceCurrentVersionExpired'}"
}
================
Properties:
{
"additional_properties": {},
"provisioning_state": "Succeeded",
"model": "{'additional_properties': {}, 'format': 'OpenAI', 'name': 'gpt-4', 'version': '1106-Preview', 'source': None, 'call_rate_limit': None}",
"scale_settings": null,
"capabilities": {
"chatCompletion": "true"
},
"rai_policy_name": "Microsoft.Default",
"call_rate_limit": null,
"rate_limits": [
"{'additional_properties': {}, 'key': 'request', 'renewal_period': 10.0, 'count': 10.0, 'min_count': None, 'dynamic_throttling_enabled': None, 'match_patterns': None}",
"{'additional_properties': {}, 'key': 'token', 'renewal_period': 60.0, 'count': 10000.0, 'min_count': None, 'dynamic_throttling_enabled': None, 'match_patterns': None}"
],
"version_upgrade_option": "OnceCurrentVersionExpired"
}
If there is an existing solution to this, please let me know.
Thank you!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Metadata
Assignees
Labels
This issue is related to a management-plane library.Workflow: This issue is responsible by Azure service team.Issues that are reported by GitHub users external to the Azure organization.This issue requires a new behavior in the product in order be resolved.Workflow: This issue needs attention from Azure service team or SDK teamThe issue doesn't require a change to the product in order to be resolved. Most issues start as that