Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chore] Fix E2E autoscale test for OpenShift #1365

Merged
merged 2 commits into from
Jan 31, 2023

Conversation

iblancasa
Copy link
Contributor

Signed-off-by: Israel Blancas iblancasa@gmail.com

Fixes #1364
Also, this PR makes the test more reliable.

@iblancasa iblancasa requested a review from a team January 12, 2023 09:12
@frzifus frzifus added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Jan 12, 2023
Copy link
Member

@frzifus frzifus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until when do we support hpav1?

@iblancasa
Copy link
Contributor Author

@iblancasa while we support K8s 1.23, if I'm not wrong

hack/wait-until-hpa-ready/main.go Outdated Show resolved Hide resolved
hack/wait-until-hpa-ready/main.go Outdated Show resolved Hide resolved
@@ -0,0 +1,7 @@
apiVersion: kuttl.dev/v1beta1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? The

should be cleaned automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is: depending on the cluster where you are running the test, the previous duration parameter provided to tracegen can be not enough to trigger the HPA and scale the collector.

With this approach, we ran tracegen until the number of replicas is increased. Later, we stop tracegen to reduce the metrics and make the HPA to scale down.

tests/e2e/autoscale/01-check-simplest-collector-hpa.yaml Outdated Show resolved Hide resolved
@pavolloffay
Copy link
Member

From the issue

Later, it tries to match them with the CRs from tests/e2e/autoscale/00-assert.yaml. Those CRs check for HPAs whose apiVersion are autoscaling/v1 (simplest-collector) and autoscaling/v2beta2 (simplest-set-utilization-collector). In OpenShift (4.11) both are created with an apiVersion with the value autoscaling/v2

@iblancasa do we know the root cause why only autoscaling/v2 was created? The OCP 4.11 is based on k8s 1.24. autoscaling/v2beta1 was removed 1.25

@iblancasa
Copy link
Contributor Author

iblancasa commented Jan 25, 2023

@pavolloffay

From the issue

Later, it tries to match them with the CRs from tests/e2e/autoscale/00-assert.yaml. Those CRs check for HPAs whose apiVersion are autoscaling/v1 (simplest-collector) and autoscaling/v2beta2 (simplest-set-utilization-collector). In OpenShift (4.11) both are created with an apiVersion with the value autoscaling/v2

@iblancasa do we know the root cause why only autoscaling/v2 was created? The OCP 4.11 is based on k8s 1.24. autoscaling/v2beta1 was removed 1.25

I didn't research too much but I think it can be related to this comment in 00-install.yaml:

# TODO: these tests use .Spec.MaxReplicas and .Spec.MinReplicas. These fields are
# deprecated and moved to .Spec.Autoscaler. Fine to use these fields to test that old CRD is
# still supported but should eventually be updated.

If you want, I can create a new issue for that.

"k8s.io/client-go/util/homedir"
)

func main() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add \n as a last character to all Printf statements

pollInterval := time.Second

// Search in v2 and v1 for an HPA with the given name
err = wait.Poll(pollInterval, 0, func() (done bool, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not sure why actually v1 and v2 HPAs are created in a test.

My understanding is that only a single HPA version should be used in a given cluster. Could you please explain why both are created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't write the test. I'm just trying to make it work on OpenShift. But this is what I found while checking for the purpose of this E2E test:

  • The test creates 2 OpenTelemetryCollector instances to test 2 ways of creating HPAs. From the 00-install.yaml file:
# This creates two different deployments:
#   * The first one will be used to see if we scale properly
#   * The second is to check the targetCPUUtilization option
  • This creates 2 HPAs (I wait for their creation and metrics reporting in steps 1 and 2)
  • We start tracegen in step 3 and wait for one of the OpenTelemetryCollector instances to scale up to 2 replicas
  • We remove the tracegen deployment to stop reporting traces in step 4
  • Wait until the OpenTelemetryCollector scales down in step 5

When the HPAs are created, they will be created using autoscaling/v1 or autoscaling/v2beta2. If you check how this test was written before my changes, you can see how in 00-assert.yaml , the test tries to assert the simplest-collector HPA with autoscaling/v1 and the simplest-set-utilization-collector HPA with autoscaling/v2beta2.

When I ran this in OpenShift 4.11, both of them were created using autoscaling/v2beta2 (as I pointed in this comment).

So, since in KUTTL there is no way to conditionally check for one resource or another, I created the wait-until-hpa-ready.go script to (given a name) dynamically:

  • Look for an HPA in the autoscaling/v2beta2 API. If found, check if the HPA status is different from unknown
  • If the HPA was not found in autoscaling/v2beta2, look for it in autoscaling/v1. If found, check if the HPA status is different from unknown

Another thing: why the HPAs are created using different autoscaling API versions (as we can see in 00-assert.yaml) in the Kubernetes versions tested during the CI? I think this is because one is setting the .spec.minReplicas and .spec.maxReplicas values and the other is setting them in .spec.autoscaler (as the comment in 00-install.yaml explains). If you want, I can do a deeper investigation about why this happens, but in a separate issue since is not related to the current PR.

@pavolloffay
Copy link
Member

@iblancasa could you please fix the CI?

@iblancasa iblancasa requested a review from a team January 31, 2023 09:37
@iblancasa
Copy link
Contributor Author

I broke something in my branch. Fixing...

@iblancasa iblancasa marked this pull request as draft January 31, 2023 09:39
Signed-off-by: Israel Blancas <iblancasa@gmail.com>
@iblancasa iblancasa marked this pull request as ready for review January 31, 2023 12:04
@iblancasa
Copy link
Contributor Author

Fixed!

Signed-off-by: Israel Blancas <iblancasa@gmail.com>
Copy link
Member

@pavolloffay pavolloffay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this and book a ticket to make sure only a single HPA version is used by the operator for a given k8s version.

@pavolloffay pavolloffay merged commit 55c37bc into open-telemetry:main Jan 31, 2023
@iblancasa iblancasa deleted the fix/1364 branch January 31, 2023 13:06
iblancasa added a commit to iblancasa/opentelemetry-operator that referenced this pull request Feb 1, 2023
* Improve the reliability of the autoscale E2E test

Signed-off-by: Israel Blancas <iblancasa@gmail.com>

* Revert change

Signed-off-by: Israel Blancas <iblancasa@gmail.com>

---------

Signed-off-by: Israel Blancas <iblancasa@gmail.com>
ItielOlenick pushed a commit to ItielOlenick/opentelemetry-operator that referenced this pull request May 1, 2024
* Improve the reliability of the autoscale E2E test

Signed-off-by: Israel Blancas <iblancasa@gmail.com>

* Revert change

Signed-off-by: Israel Blancas <iblancasa@gmail.com>

---------

Signed-off-by: Israel Blancas <iblancasa@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Skip Changelog PRs that do not require a CHANGELOG.md entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

autoscale E2E test doesn't pass in OpenShift
3 participants