v0.1
- Controlled scheduling of deployments according to the CRD spec
- Allow updates of deployments (store generation of the deployment in status)
- Propagate partition ID to the managed deployment
- Adding or removing deployments based on CRD spec
v0.2
- Update GOMAXPROCS based on number of CPUs
- Initial resource allocation based on current Kafka metrics + static multipliers
- Auto-scaling based on Production/Consumption/Offset
- Store MetricsMap from the last query in consumer object status
- Rename static predictor to
naive
- Load metricsProvider from the status
v0.3
- Setup travis-ci
- Query multiple prometheus for metrics
- Write readme
v0.4
- Validate/Fix RBAC permissions
- build simple service to produce pseudo-data to local kafka/prometheus
- Update readme with the steps to configure dev env in linux/macos
v0.5 scaling
- Scale down only when no lag present
- Scale only after X periods of lag/no lag
- Introduce another deployment status -
SATURATED
to indicate that we don't have enough resources for it - Need a way to expose resource saturation level (how many CPUs are lacking)
- Per-deployment auto-scaling pause
v0.6 - observability
- Post behaviour updates to Kubernetes events
- Cleanup logging
- Expose metrics about own health and behaviour
- Grafana dashboard
- Update spec to deploy 3 instances of operator
- Add totalMaxAllowed which will limit total number of cores available for the consumer
v0.7 testing
- Guest-mode. Ability to run operator without cluster-wide permissions
- Verify that scaling works with multi-container pods
- Verify that disabling all auto-scaling and setting resources in the deployment itself works
- Verify that HA mode works
- Verify that system operates as expected when autoscaling is disabled
- Scaling up only if there is a lag, down if there is no lag
- Update owner. If operator was restarted it gets new UID, we need to update ownerRef in reconcile
- [TEST] Add more integration tests
- [TEST] add test to verify that env variables are always set
- [BUG] updating operator spec to scale all deployments down works, but resume doesn't
v0.8 bugfixes
- Verify that system works without ResourcePolicy set
- make
scaleStatePendingPeriod
configurable - profile slow reconcile (15s for ~300 deployments)
- Fix statuses after the change in scaling logic (scale based on lag)
v0.9 Multi-partition assignment
- Static multi-partition assignment
- Improve logging
v1.0
- Promote v1alpha1 version to v1
- Resync all metrics on reconcile. Status metric is wrong (seems to be updated only during creating of deployments)
- Replace partition labels with ranges, validate length
v1.1
- If there is no metrics at all allocate avg/mean instead of minimal
- cpu increment should be configurable
- cleanup logging
- Log not only scaling cmp value (-1, 0, 1), but also how many cores were estimated per instance
- introduce verbose mode to make it easier to debug a single instance issues.
vNext
- Implement
bounce
mode, keep track of the last few node names per instance and add antiaffinitity rule to deployment to avoid scheduling to that node during the next scale up. - Implement progressive updates (canary).
- Rework configuration of RAM estimation, make it possible to provide some formula, i.e. (fixed + ramPerCore)
- Consider replacing DeploymentSpec with PodSpec/PodLabels/PodAnnotations Ability to set additional deployment-level annotations/labels
- Recreate deployments from scratch, if any of the immutable fields were changed in the deploymentSpec Now, it requires manual deleting of all deployments.
- Use annotations to pause/resume configmap-based consumers
Unsorted
- Consider using number of messages in all estimates instead of
projected lag time
- Add jitter to the scaling time
- Dynamic multi-partition assignment. Instead of static
numPartitionsPerInstance
: Configure min/max values fornumPartitionsPerInstance
Configure min/max number of pods per consumer Based on production rate decide the value of partitions to assign. Scale each one vertically until fits. If per-pod resource limit is exhausted, but global one is not, scale horizontally and reduce number of partitions per instance. - [BUG] update of the auto-scaler spec (ratePerCore, ramPerCore) should ? trigger reconciliation
- Reset status annotation if MANUAL mode is enabled
- Consider making number of partitions optional in the spec
- [Feature] implement defaulting/validating webhooks
- [Feature] call external webhooks on scaling events
- [Feature] Vertical auto-scaling of balanced workloads (single deployment)
- [Feature] Fully dynamic resource allocations based on historic data
- [Feature] ? consider adding support for VPA/HPA
- [Feature] ? Tool for operations
consumerctl stop/start consumer
- [Feature] ? Consider getting all the pods to estimate uptime and avoid frequent restarts.
- [Feature] Implement second metrics provider (Kafka)
- [Feature] scale up without restart blocked
- [Feature] Get kafka lag directly from the prometheus blocked