Perf: Analayze the roofline of the inference endpoints

We need to understand the roofline of:

- In offline, the maximum number of queries/reponses we can handle each second
- In online (concurrency), the maximum concurrency that we can measure for the endpoints 
- In online, the maximum SSE chunks we can stream each second (which will impact our TPS roofline)

We can use SemiAnalsysis data as a reference: https://inferencemax.ai/ 

This will prepare us for the future when we need to horizontally scale to measure endpoints served on a larger cluster

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: Analayze the roofline of the inference endpoints #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Perf: Analayze the roofline of the inference endpoints #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions