Closed
Description
Description
Currently, the user must tune an API's CPU request for horizontal pod autoscaling to behave as expected. An approach based on concurrent requests per container may be better (similar to what Knative uses).
This would also make autoscaling for GPU workloads behave more as expected
It may make sense to have both request-based and CPU/GPU-based autoscaling active at the same time, i.e. it will scale when either of the thresholds are met, and won't scale back until both metrics have backed off.