cognitiveservices Deployment real-time percentage usage

https://learn.microsoft.com/en-us/python/api/azure-mgmt-cognitiveservices/azure.mgmt.cognitiveservices.models.deployment?view=azure-python

### **Is your feature request related to a problem? Please describe.**
I want to load-balance between different OpenAI deployments based on their real-time percentage usage.

### **Describe the solution you'd like**
As part of the [Deployment class](https://learn.microsoft.com/en-us/python/api/azure-mgmt-cognitiveservices/azure.mgmt.cognitiveservices.models.deployment?view=azure-python), or any of its underlying properties, it would be helpful to have not only the configured rate limits, but also the real-time usage (especially for TPM, not just RPM). 

### **Describe alternatives you've considered**
I have looked into Azure Metrics, but I found two problems with this approach:
- the percentage usage metrics are only for Provisioned Managed Throughput, whereas I am looking for Standard TPM usage
- metrics are not available in real-time

### **Additional context**
sample code:
```
client = CognitiveServicesManagementClient(credential=DefaultAzureCredential(), subscription_id=sub_id)
deployments = client.deployments.list(resource_group_name="some_rg", account_name="some_acct")
print('Deployments:')
for d in deployments:
   ...
```

sample out:
```
Deployments:
================
gpt-4-turbo
================
{
  "additional_properties": {},
  "id": "...",
  "name": "gpt-4-turbo",
  "type": "Microsoft.CognitiveServices/accounts/deployments",
  "sku": "{'additional_properties': {}, 'name': 'Standard', 'tier': None, 'size': None, 'family': None, 'capacity': 10}",
  "system_data": "{'additional_properties': {}, 'created_by': '...', 'created_by_type': 'User', 'created_at': datetime.datetime(2024, 4, 10, 15, 19, 31, 6416, tzinfo=<isodate.tzinfo.Utc object at 0x10673f050>), 'last_modified_by': 'radu@gocascade.ai', 'last_modified_by_type': 'User', 'last_modified_at': datetime.datetime(2024, 4, 10, 15, 19, 31, 6416, tzinfo=<isodate.tzinfo.Utc object at 0x10673f050>)}",
  "etag": "...",
  "properties": "{'additional_properties': {}, 'provisioning_state': 'Succeeded', 'model': <azure.mgmt.cognitiveservices.models._models_py3.DeploymentModel object at 0x106bf7710>, 'scale_settings': None, 'capabilities': {'chatCompletion': 'true'}, 'rai_policy_name': 'Microsoft.Default', 'call_rate_limit': None, 'rate_limits': [<azure.mgmt.cognitiveservices.models._models_py3.ThrottlingRule object at 0x106bf79d0>, <azure.mgmt.cognitiveservices.models._models_py3.ThrottlingRule object at 0x106bf7a90>], 'version_upgrade_option': 'OnceCurrentVersionExpired'}"
}
================
Properties:
{
  "additional_properties": {},
  "provisioning_state": "Succeeded",
  "model": "{'additional_properties': {}, 'format': 'OpenAI', 'name': 'gpt-4', 'version': '1106-Preview', 'source': None, 'call_rate_limit': None}",
  "scale_settings": null,
  "capabilities": {
    "chatCompletion": "true"
  },
  "rai_policy_name": "Microsoft.Default",
  "call_rate_limit": null,
  "rate_limits": [
    "{'additional_properties': {}, 'key': 'request', 'renewal_period': 10.0, 'count': 10.0, 'min_count': None, 'dynamic_throttling_enabled': None, 'match_patterns': None}",
    "{'additional_properties': {}, 'key': 'token', 'renewal_period': 60.0, 'count': 10000.0, 'min_count': None, 'dynamic_throttling_enabled': None, 'match_patterns': None}"
  ],
  "version_upgrade_option": "OnceCurrentVersionExpired"
}
```

If there is an existing solution to this, please let me know.

Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cognitiveservices Deployment real-time percentage usage #28825

vandreiradu
openedon Apr 17, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Assignees

Labels

Type

Projects

Milestone

Relationships

Development