Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement a Keptn metrics provider #1630

Merged
merged 1 commit into from
Jun 12, 2024

Conversation

bacherfl
Copy link
Contributor

Closes #1627

This PR adds the implementation of a keptn metrics provider

The main difference to other providers in Flagger is that this one is using a Kubernetes client to interact with Keptn resources (KeptnMetrics and Analyses). Therefore, in addition to the code in keptn.go, the ClusterRole needed to be adapted to be able to access the required Keptn resources.
The provider supports the use of the following resources:

  • KeptnMetric: This resource represents the current value of a certain metric. The metric is retrieved by Keptn from one of the following data sources: prometheus, thanos, cortex, dynatrace, datadog.
  • Analysis: This Keptn resource can be used to apply a more complex grading logic over a set of different metrics. More information about this concept can be found in the Keptn Docs

Comment on lines 694 to 695
query: |
keptnmetric/my-namespace/response-time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
query: |
keptnmetric/my-namespace/response-time
query: "keptnmetric/my-namespace/response-time"

I would make this a one-liner to suggest you can't have more than one object reference in there. I would also add an example for a real Istio metric like Istio 404 rate https://docs.flagger.app/usage/metrics#prometheus

@@ -18,6 +18,7 @@ package controller

import (
"fmt"
"k8s.io/client-go/rest"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sort the imports into groups, run make fmt

pkg/metrics/providers/keptn.go Show resolved Hide resolved
Copy link
Member

@aryan9600 aryan9600 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @bacherfl for this! i did a quick first pass, will take a second look soon.


// delete the created analysis at the end of the function
defer func() {
_ = k.client.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of ignoring the error, we should log it.

}
return &KeptnProvider{
client: client,
analysisTimeout: 10 * time.Second,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has this been chosen randomly? how does this play with the Canary analysis's interval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the maximum time the provider will wait for the result of a Keptn Analysis to be available - I added this to prevent the provider from blocking the flagger controller indefinitely in case the created Analysis cannot be completed.
I chose this value as this should be sufficient in most cases for the created Keptn Analysis to be reconciled by the Keptn Analysis controller, however I can also adjust this If you'd like, or if there are some guidelines on how long a provider might take to retrieve a value.

@bacherfl
Copy link
Contributor Author

thanks @bacherfl for this! i did a quick first pass, will take a second look soon.

Thanks for the review @aryan9600! I just addressed your comments and adapted the code accordingly - If there are some further questions feel free to reach out :)

Copy link
Member

@aryan9600 aryan9600 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took a second pass, its looking good, just had some minor nits.

can the docs be improved a bit more? a few suggestions:

  • the query field in MetricTemplate example mentions the timeframe and arguments as well
  • make it clearer that resource name is actually that of the AnalysisDefinition and not the to be created Analysis.
  • an example around Analysis, which shows a sample AnalysisDefinition and what the Analysis object generated by Flagger would look like

}

var analysisResource = schema.GroupVersionResource{
Group: "metrics.keptn.sh",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets define a constant for this as well

}
}
}
return 0, fmt.Errorf("could not retrieve KeptnMetric - no value found int resource %s/%s", queryObj.Namespace, queryObj.ResourceName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return 0, fmt.Errorf("could not retrieve KeptnMetric - no value found int resource %s/%s", queryObj.Namespace, queryObj.ResourceName)
return 0, fmt.Errorf("could not retrieve KeptnMetric - no value found in resource %s/%s", queryObj.Namespace, queryObj.ResourceName)

select {
case <-ctx.Done():
return 0, fmt.Errorf("encountered timeout while waiting for Keptn Analysis %s/%s to be finished", obj.Namespace, obj.ResourceName)
case <-time.After(500 * time.Millisecond):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case <-time.After(500 * time.Millisecond):
case <-time.After(time.Second):

as mentioned in the above comment

require.NotNil(t, provider)

isOnline, err := provider.IsOnline()
require.Nil(t, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require.Nil(t, err)
require.NoError(t, err)

func TestNewKeptnProviderNoKubeConfig(t *testing.T) {
provider, err := NewKeptnProvider(nil)

require.NotNil(t, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require.NotNil(t, err)
require.Error(t, err)

require.True(t, isOnline)
}

func TestNewKeptnProviderNoKubeConfig(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func TestNewKeptnProviderNoKubeConfig(t *testing.T) {
func TestNewKeptnProvider_NoKubeConfig(t *testing.T) {

require.Nil(t, provider)
}

func TestKeptnProvider_RunQueryKeptnMetric(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func TestKeptnProvider_RunQueryKeptnMetric(t *testing.T) {
func TestKeptnProvider_RunQuery_KeptnMetric(t *testing.T) {

tests := []struct {
name string
setupClient func() *fake.FakeDynamicClient
args args
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just put query string here?

tests := []struct {
name string
setupClient func() *fake.FakeDynamicClient
// verificationFunc() will run in a separate go routing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// verificationFunc() will run in a separate go routing
// verificationFunc() will run in a separate go routine

// verificationFunc() will run in a separate go routing
// and check if the expected resources are created
verificationFunc func(fakeClient *fake.FakeDynamicClient) error
args args
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

@stefanprodan
Copy link
Member

@aryan9600 could you please push your suggestions on top of this PR. I see that @bacherfl has enabled us to push commit.

@bacherfl
Copy link
Contributor Author

bacherfl commented Jun 5, 2024

Hi @stefanprodan @aryan9600 sorry for the late reply - I had a busy schedule recently. I'm not sure if i can get to incorporate the suggested changes this week, but I may have some time to do so next week - otherwise, feel free to add directly to the PR, as @stefanprodan suggested

@aryan9600 aryan9600 force-pushed the poc/keptn-provider branch 2 times, most recently from c559cdf to 29c8bf5 Compare June 10, 2024 14:47
@aryan9600
Copy link
Member

PR updated, the docs could still use a bit more info around Anlysis/AnlysisDefinition, but we can tackle that in another PR @bacherfl

@stefanprodan
Copy link
Member

@aryan9600 please sync the CRDs, the helm chart has the old one.

Add a Keptn metrics provider for two resources:
* KeptnMetric: Verify the value of a single metric.
* Analysis (via AnalysisDefinition): Run a Keptn analysis over an
  interval validating SLOs.

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
@aryan9600
Copy link
Member

aryan9600 commented Jun 11, 2024

kuma e2e test failure is unrelated, happening because the tarball has moved to another location. i think we should not let it block this PR. we can fix and bump the kuma version later.

Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @bacherfl and @aryan9600

@aryan9600 aryan9600 merged commit 97d1ef0 into fluxcd:main Jun 12, 2024
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Keptn Metrics Provider
3 participants