Description
Based on an (old) Gitter message from @black-adder, it would be good to let the Jaeger Operator automatically scale up/down Jaeger:
As a rule of thumb, if you see that spans are being dropped, I'd go ahead and add more jaeger-collector hosts. If that doesn't mitigate, then scale out your cassandra cluster.
@objectiser then shared that:
We should look to add some autonomic behaviour to the management of jaeger, by monitoring relevant metrics to understand when reporting/storing trace data is resulting in issues and take action (e.g. scale up).
Possibly an action can initially try scaling up the collector, but if that does not change the no. of dropped spans significantly after a specified time, then it tries to scale up the storage.
Trickier situation would be how to determine when to scale down. That may be a combination of jaeger metrics with possibly some other factors (e.g. cpu utilisation, etc) - but only scaling down to a predefined minimum config.
With today's knowledge and tools, I think we can start with a simple approach of just using the Horizontal Pod Autoscaler (HPA) from Kubernetes, to scale up/down based on CPU and memory. The idea is that when there's a shortage of workers, the CPU will be close to the limit, and when the queues are full, the memory will be close to the limit as well (once jaegertracing/jaeger#943 is closed).
To me, the only piece of infra that should be scaled for now is the collector. The query isn't typically heavily used to the point of requiring dynamic scaling, and agent's are either scaled with the application (sidecar), or can't be scaled at all (daemon sets). The only other remaining component is the ingester, which could be dealt with in a second phase.
This leaves scaling of the storage out of the equation for now: we could either add them in a second phase, or delegate this action to the storage's operator.