We need to understand the roofline of:
- In offline, the maximum number of queries/reponses we can handle each second
- In online (concurrency), the maximum concurrency that we can measure for the endpoints
- In online, the maximum SSE chunks we can stream each second (which will impact our TPS roofline)
We can use SemiAnalsysis data as a reference: https://inferencemax.ai/
This will prepare us for the future when we need to horizontally scale to measure endpoints served on a larger cluster
We need to understand the roofline of:
We can use SemiAnalsysis data as a reference: https://inferencemax.ai/
This will prepare us for the future when we need to horizontally scale to measure endpoints served on a larger cluster