Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added auto-scale to the collector #856

Merged
merged 3 commits into from
Jan 20, 2020

Conversation

jpkrohling
Copy link
Contributor

@jpkrohling jpkrohling commented Jan 16, 2020

A Horizontal Pod Autoscaler (HPA) was added in this PR, along with a new MinReplicas and MaxReplicas. With that, the collector should now automatically scale up and down based on the CPU and/or memory consumption. When none of the new properties are specified, the minimum amount of replicas is 1, while the maximum number of replicas is 100. The HPA configuration is added only when the deployment strategy is either production or streaming.

Closes #848, even though the scaling of the storage isn't implemented by this one.

Signed-off-by: Juraci Paixão Kröhling juraci@kroehling.de

@jpkrohling
Copy link
Contributor Author

This shows how the auto-scaling works in OpenShift. Note that this should also work in plain Kubernetes, but I'm not able to generate enough load on my local machine with minikube + ES (1
GiB) + tracegen.

image

@jpkrohling
Copy link
Contributor Author

@kevinearls not sure we want to add a new e2e test for this, but perhaps you might have a good idea that wouldn't be too fragile?

@jpkrohling jpkrohling changed the title Added auto-scale to the collector WIP - Added auto-scale to the collector Jan 16, 2020
@jpkrohling jpkrohling changed the title WIP - Added auto-scale to the collector Added auto-scale to the collector Jan 17, 2020
@jpkrohling
Copy link
Contributor Author

I just ran a longer test this morning, showing that it scales up and down. First, I deployed the simple-prod instance, along with tracegen (10 replicas). After about 20 minutes, 10 replicas of collector were available. Removing the tracegen deployment caused the number of replicas to go down back to 1 after about 20 minutes. Then, I changed simple-prod to add a maxReplicas set to 5. Then, I deployed tracegen again, and verified that the collector gets scaled only up to 5 replicas. Removing the tracegen causes the collector to eventually settle at 1 replica again.

$ kubectl get deployments
NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
simple-prod-collector   1/1     1            1           2m2s
simple-prod-query       1/1     1            1           2m2s
tracegen                10/10   10           10          30s

$ kubectl get deployments
NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
simple-prod-collector   10/10   10           10          19m
simple-prod-query       1/1     1            1           19m
tracegen                10/10   10           10          17m

$ kubectl get deployments
NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
simple-prod-collector   1/1     1            1           44m
simple-prod-query       1/1     1            1           44m

$ kubectl get deployments
NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
simple-prod-collector   5/5     5            5           59m
simple-prod-query       1/1     1            1           59m
tracegen                10/10   10           10          11m

$ kubectl get deployments
NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
simple-prod-collector   5/5     5            5           71m
simple-prod-query       1/1     1            1           71m
tracegen                10/10   10           10          24m

$ kubectl get deployments
NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
simple-prod-collector   5/5     5            5           71m
simple-prod-query       1/1     1            1           71m

$ kubectl get deployments
NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
simple-prod-collector   1/1     1            1           88m
simple-prod-query       1/1     1            1           88m

And here's a set of screenshots from OpenShift (older events are at the bottom):

image

image

image

A Horizontal Pod Autoscaler (HPA) was added in this PR, along with a new MinReplicas and MaxReplicas. With that, the collector should now automatically scale up and down based on the CPU and/or memory consumption. When none of the new properties are specified, the minimum amount of replicas is 1, while the maximum number of replicas is 100. The HPA configuration is added only when the deployment strategy is either production or streaming.

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Copy link
Contributor

@objectiser objectiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just some minor comments.

deploy/crds/jaegertracing.io_jaegers_crd.yaml Show resolved Hide resolved
deploy/crds/jaegertracing.io_jaegers_crd.yaml Show resolved Hide resolved
spec:
containers:
- name: tracegen
image: jaegertracing/jaeger-tracegen:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't tracegen image be versioned inline with the other jaeger components?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, opened an issue (#866) to track this. The first image will probably be 1.17 (next release).

pkg/apis/jaegertracing/v1/jaeger_types.go Outdated Show resolved Hide resolved
pkg/deployment/collector.go Outdated Show resolved Hide resolved
pkg/deployment/collector.go Outdated Show resolved Hide resolved
@@ -21,7 +21,7 @@ func init() {

func TestNegativeReplicas(t *testing.T) {
size := int32(-1)
jaeger := v1.NewJaeger(types.NamespacedName{Name: "TestNegativeReplicas"})
jaeger := v1.NewJaeger(types.NamespacedName{Name: "my-instance"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to the convention of naming instances after the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We started naming them all "my-instance" some time ago, as individual names don't bring much value and we had a few copy/paste mistakes.

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@objectiser
Copy link
Contributor

@jpkrohling Approval subject to tests passing :)

@jpkrohling
Copy link
Contributor Author

Local run has shown that a new permission is missing. I'm testing it locally and will update the PR once I confirm the tests are passing.

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Auto-scale collector instances
2 participants