Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chore] Fix E2E autoscale test for OpenShift #1365

Merged
merged 2 commits into from
Jan 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 4 additions & 23 deletions tests/e2e/autoscale/00-assert.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,28 +5,9 @@ metadata:
status:
readyReplicas: 1
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: simplest-collector
spec:
minReplicas: 1
maxReplicas: 2
# This is not neccesarily exact. We really just want to wait until this is no longer <unknown>
status:
currentCPUUtilizationPercentage: 20
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
apiVersion: apps/v1
kind: Deployment
metadata:
name: simplest-set-utilization-collector
spec:
minReplicas: 1
maxReplicas: 2
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
status:
readyReplicas: 1
5 changes: 3 additions & 2 deletions tests/e2e/autoscale/00-install.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# This creates two different deployments. The first one will be used to see if we scale properly. (Note that we are
# only scaling up to 2 because of limitations of KUTTL). The second is to check the targetCPUUtilization option.
# This creates two different deployments:
# * The first one will be used to see if we scale properly
# * The second is to check the targetCPUUtilization option
#
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
Expand Down
4 changes: 4 additions & 0 deletions tests/e2e/autoscale/01-check-simplest-collector-hpa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: kuttl.dev/v1beta1
kind: TestStep
commands:
- script: go run ./wait-until-hpa-ready.go --hpa simplest-collector
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: kuttl.dev/v1beta1
kind: TestStep
commands:
- script: go run ./wait-until-hpa-ready.go --hpa simplest-set-utilization-collector
Original file line number Diff line number Diff line change
@@ -1,17 +1,6 @@
# Wait until tracegen has completed and the simplest deployment has scaled up to 2
apiVersion: batch/v1
kind: Job
metadata:
name: tracegen
status:
conditions:
- status: "True"
type: Complete

---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector

metadata:
name: simplest
status:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ spec:
args:
- -otlp-endpoint=simplest-collector-headless:4317
- -otlp-insecure
# High duration to ensure the trace creation doesn't stop.
# It'll be stopped in step 4
- -duration=1m
- -workers=20
restartPolicy: Never
Expand Down
7 changes: 7 additions & 0 deletions tests/e2e/autoscale/04-delete.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
apiVersion: kuttl.dev/v1beta1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? The

should be cleaned automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is: depending on the cluster where you are running the test, the previous duration parameter provided to tracegen can be not enough to trigger the HPA and scale the collector.

With this approach, we ran tracegen until the number of replicas is increased. Later, we stop tracegen to reduce the metrics and make the HPA to scale down.

kind: TestStep
delete:
- apiVersion: batch/v1
kind: Job
metadata:
name: tracegen
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# Wait for the collector to scale back down to 1
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector

metadata:
name: simplest
status:
Expand Down
109 changes: 109 additions & 0 deletions tests/e2e/autoscale/wait-until-hpa-ready.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
// Copyright The OpenTelemetry Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package main

import (
"context"
"fmt"
"os"
"path/filepath"
"time"

"github.com/spf13/pflag"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/util/wait"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/clientcmd"
"k8s.io/client-go/util/homedir"
)

func main() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add \n as a last character to all Printf statements

var hpaName string
var timeout int
var kubeconfigPath string

defaultKubeconfigPath := filepath.Join(homedir.HomeDir(), ".kube", "config")

pflag.IntVar(&timeout, "timeout", 600, "The timeout for the check.")
pflag.StringVar(&hpaName, "hpa", "", "HPA to check")
pflag.StringVar(&kubeconfigPath, "kubeconfig-path", defaultKubeconfigPath, "Absolute path to the KubeconfigPath file")
pflag.Parse()

if len(hpaName) == 0 {
fmt.Println("hpa flag is mandatory")
os.Exit(1)
}

config, err := clientcmd.BuildConfigFromFlags("", kubeconfigPath)
if err != nil {
fmt.Printf("Error reading the kubeconfig: %s\n", err)
os.Exit(1)
}

client, err := kubernetes.NewForConfig(config)
if err != nil {
fmt.Println(err)
os.Exit(1)
}

namespace, err := client.CoreV1().Namespaces().Get(context.Background(), os.Getenv("NAMESPACE"), metav1.GetOptions{})
if err != nil {
fmt.Println(err)
os.Exit(1)
}

hpaClientV2 := client.AutoscalingV2().HorizontalPodAutoscalers(namespace.Name)
hpaClientV1 := client.AutoscalingV1().HorizontalPodAutoscalers(namespace.Name)

pollInterval := time.Second

// Search in v2 and v1 for an HPA with the given name
err = wait.Poll(pollInterval, 0, func() (done bool, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not sure why actually v1 and v2 HPAs are created in a test.

My understanding is that only a single HPA version should be used in a given cluster. Could you please explain why both are created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't write the test. I'm just trying to make it work on OpenShift. But this is what I found while checking for the purpose of this E2E test:

  • The test creates 2 OpenTelemetryCollector instances to test 2 ways of creating HPAs. From the 00-install.yaml file:
# This creates two different deployments:
#   * The first one will be used to see if we scale properly
#   * The second is to check the targetCPUUtilization option
  • This creates 2 HPAs (I wait for their creation and metrics reporting in steps 1 and 2)
  • We start tracegen in step 3 and wait for one of the OpenTelemetryCollector instances to scale up to 2 replicas
  • We remove the tracegen deployment to stop reporting traces in step 4
  • Wait until the OpenTelemetryCollector scales down in step 5

When the HPAs are created, they will be created using autoscaling/v1 or autoscaling/v2beta2. If you check how this test was written before my changes, you can see how in 00-assert.yaml , the test tries to assert the simplest-collector HPA with autoscaling/v1 and the simplest-set-utilization-collector HPA with autoscaling/v2beta2.

When I ran this in OpenShift 4.11, both of them were created using autoscaling/v2beta2 (as I pointed in this comment).

So, since in KUTTL there is no way to conditionally check for one resource or another, I created the wait-until-hpa-ready.go script to (given a name) dynamically:

  • Look for an HPA in the autoscaling/v2beta2 API. If found, check if the HPA status is different from unknown
  • If the HPA was not found in autoscaling/v2beta2, look for it in autoscaling/v1. If found, check if the HPA status is different from unknown

Another thing: why the HPAs are created using different autoscaling API versions (as we can see in 00-assert.yaml) in the Kubernetes versions tested during the CI? I think this is because one is setting the .spec.minReplicas and .spec.maxReplicas values and the other is setting them in .spec.autoscaler (as the comment in 00-install.yaml explains). If you want, I can do a deeper investigation about why this happens, but in a separate issue since is not related to the current PR.

hpav2, err := hpaClientV2.Get(
context.Background(),
hpaName,
metav1.GetOptions{},
)
if err != nil {
hpav1, err := hpaClientV1.Get(
context.Background(),
hpaName,
metav1.GetOptions{},
)
if err != nil {
fmt.Printf("HPA %s not found\n", hpaName)
return false, nil
}

if hpav1.Status.CurrentCPUUtilizationPercentage == nil {
fmt.Printf("Current metrics are not set yet for HPA %s\n", hpaName)
return false, nil
}
return true, nil
}

if hpav2.Status.CurrentMetrics == nil {
fmt.Printf("Current metrics are not set yet for HPA %s\n", hpaName)
return false, nil
}
return true, nil
})
if err != nil {
fmt.Println(err)
os.Exit(1)
}

fmt.Printf("%s is ready!\n", hpaName)
}