Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(helm): implementation of integrations chart #7356

Closed
wants to merge 11 commits into from

Conversation

pkoutsovasilis
Copy link
Contributor

I open this draft PR to capture the footprint of the changes required to satisfy elastic/elastic-agent#3848 and kickoff further discussions for elastic/elastic-agent#3847. So please don't merge this PR as nothing is concrete yet. With that said if any comments more than welcomed.

@pkoutsovasilis pkoutsovasilis added the discuss We need to figure this out label Nov 30, 2023
@botelastic botelastic bot added the triage label Nov 30, 2023
Copy link

cla-checker-service bot commented Nov 30, 2023

💚 CLA has been signed

@thbkrkr thbkrkr added the >feature Adds or discusses adding a feature to the product label Nov 30, 2023
@botelastic botelastic bot removed the triage label Nov 30, 2023
@joshdover
Copy link

Overall this is looking great.

Copy link

@joshdover joshdover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and found the following issues on my cluster:

  • Many errors from metricbeat in the form of Error fetching data for metricset kubernetes.system: error doing HTTP request to fetch 'system' Metricset data: error making http request: Get \"https://jophia-01:10250/stats/summary\": lookup jophia-01 on 10.152.183.10:53: no such host
  • host.name on logs-elastic_agent.* was being reported as the pod name instead of the node name
  • No kubectl or kube-state metrics were being ingested

I added these to the pod spec for the daemonset agent and it resolved most of the issues above:

hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet

What remains after that fix:

  • No container logs are ingested - it seems that the /var/log volume mounts were applied due to a bug with the if condition. I've included a fix for that in the PR linked below. For some reason though, I'm still not getting any files mounted on /var/log/containers for a reason I can't yet understand.
  • System metrics are missing, but I think this just isn't in the helm chart yet.

Here's a PR with some of the fixes: #7363

@gizas
Copy link

gizas commented Dec 5, 2023

@pkoutsovasilis really nice start here. For me everything worked right away. Adding some general comments here for further discussion:

-CCing https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/values.yaml as an example. Shall we start using this naming convention and schema even from day1? I know is a big change for now but maybe worths the effort in order to start being compliant with Otel

@pkoutsovasilis pkoutsovasilis changed the title feat: (#3848) POC implementation of eck-integrations Helm chart feat: POC implementation of eck-integrations Helm chart Dec 5, 2023

### Installing Kubernetes ECK Integration

// TODO(panosk)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some possible examples to put here:

Suggested change
// TODO(panosk)
To enable all Kubernetes monitoring capabilities for metrics and logs:
helm install k8s-monitoring elastic/eck-integrations -n elastic-system --create-namespace \
--set elasticsearch.host=https://elasticsearch:9200,elasticsearch.api_key=XXXX \
--set kubernetes.enabled=true
To only enable logs:
helm install k8s-monitoring elastic/eck-integrations -n elastic-system --create-namespace \
--set elasticsearch.host=https://elasticsearch:9200,elasticsearch.api_key=XXXX \
--set kubernetes.enabled=true,kubernetes.metrics.enabled=false
To enable Cloud Defend:
helm install k8s-monitoring elastic/eck-integrations -n elastic-system --create-namespace \
--set elasticsearch.host=https://elasticsearch:9200,elasticsearch.api_key=XXXX \
--set cloudDefend.enabled=true

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these examples need to mention the secret or the elasticearchRef setup required.

deploy/eck-integrations/values.yaml Outdated Show resolved Hide resolved
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this following the multiple integration example and it worked as expected.

I think it would be nice to have an example of how setup an equivalent set of Fleet managed agents with the separate daemonset and deployment with the initial agent container configuration overridden (e.g. with leader election disabled everywhere). Perhaps this is intended as a follow up, but users regularly ask for this and it would have simplified several recent support cases.

The only trouble I ran into was when experimenting with the es-ref-secret. If you omit the expected values (for example trying to use the not yet implemented api-key instead of password) the operator doesn't create anything and there's no obvious error.

You can figure out what happened by looking at the operator logs, but users coming from the current deployment pattern aren't going to expect this or know to look for it. Is there a way to make these failures more immediately obvious? Or is this something we have to solve with documentation?

For example this is the error message the controller gave me, which was very clear, but you have to know to look at the controller logs to get it.

{"log.level":"error","@timestamp":"2024-04-05T19:23:47.829Z","log.logger":"manager.eck-operator","message":"Reconciler error","service.version":"2.12.1+a56215f7","service.type":"eck","ecs.version":"1.4.0","controller":"agent-es-association-controller","object":{"name":"agent-pernode","namespace":"default"},"namespace":"default","name":"agent-pernode","reconcileID":"5ee3a9cb-cafb-4aaa-af40-d8a2e221711f","error":"username secret key doesn't exist in secret es-ref-secret","errorCauses":[{"error":"username secret key doesn't exist in secret es-ref-secret"}],"error.stack_trace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}

Comment on lines +490 to +491
kubernetes_leaderelection:
enabled: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works to disable leader election properly via the agent CRD's config section? Does this work to override the initial configurations for Fleet managed agents as well?

If so it is significantly easier than the hoops users are jumping through to do this without ECK today, see https://www.elastic.co/guide/en/fleet/current/advanced-kubernetes-managed-by-fleet.html.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

essentially via configRef which a secret that the chart populates accordingly and then the operator parses it and constructs properly the agent.yml but I think we mean the same here 🙂 I am not fully aware of how something similar could like for fleet-managed agents but definitely we can have a look

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I hope happens is the same mechanism overrides the initial configuration of the agent to seed it with the providers disabled, and then Fleet takes over from that starting base configuration.

The providers can't be configured from Fleet right now so this is a pain point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, with the Helm chart being able to create both a Deployment and DaemonSet I don't think we need leaderelection at all anymore so it should be disabled everywhere by default (pending a closer review of the configurations to make sure we did this right).

The only reason leader election exists is so you can just deploy a DaemonSet and emulate having a dedicated deployment to simplify the configuration needed (AFAIK).

@pkoutsovasilis
Copy link
Contributor Author

I tried this following the multiple integration example and it worked as expected.

I think it would be nice to have an example of how setup an equivalent set of Fleet managed agents with the separate daemonset and deployment with the initial agent container configuration overridden (e.g. with leader election disabled everywhere). Perhaps this is intended as a follow up, but users regularly ask for this and it would have simplified several recent support cases.

The only trouble I ran into was when experimenting with the es-ref-secret. If you omit the expected values (for example trying to use the not yet implemented api-key instead of password) the operator doesn't create anything and there's no obvious error.

You can figure out what happened by looking at the operator logs, but users coming from the current deployment pattern aren't going to expect this or know to look for it. Is there a way to make these failures more immediately obvious? Or is this something we have to solve with documentation?

For example this is the error message the controller gave me, which was very clear, but you have to know to look at the controller logs to get it.

{"log.level":"error","@timestamp":"2024-04-05T19:23:47.829Z","log.logger":"manager.eck-operator","message":"Reconciler error","service.version":"2.12.1+a56215f7","service.type":"eck","ecs.version":"1.4.0","controller":"agent-es-association-controller","object":{"name":"agent-pernode","namespace":"default"},"namespace":"default","name":"agent-pernode","reconcileID":"5ee3a9cb-cafb-4aaa-af40-d8a2e221711f","error":"username secret key doesn't exist in secret es-ref-secret","errorCauses":[{"error":"username secret key doesn't exist in secret es-ref-secret"}],"error.stack_trace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}

thanks for trying this out @cmacknz. The case you stumbled upon is coming from the operator itself. You are right in terms of plain k8s objects nothing appears to be created but Agent CRDs (this is what we deploy here) should be there and their status should indicate that something is happening but not getting ready. With that said ofc we will fortify this through documentation and have a look around how we could make such cases more "visible"

@cmacknz
Copy link
Member

cmacknz commented Apr 9, 2024

Agent CRDs (this is what we deploy here) should be there and their status should indicate that something is happening but not getting ready.

Yes I actually forgot to check the agent CRDs and was just doing kubectl get pods when working since I'm not used to working with ECK by reflex yet.

@pkoutsovasilis pkoutsovasilis force-pushed the pkoutsovasilis/eck_integrations branch 2 times, most recently from 4daedf8 to 41fbe2c Compare April 10, 2024 06:33
@pkoutsovasilis pkoutsovasilis changed the title feat: POC implementation of eck-integrations Helm chart feat: implementation of integrations Helm chart Apr 10, 2024
@pkoutsovasilis pkoutsovasilis changed the title feat: implementation of integrations Helm chart feat(helm): implementation of integrations chart Apr 10, 2024
@pkoutsovasilis pkoutsovasilis force-pushed the pkoutsovasilis/eck_integrations branch from 41fbe2c to ec4db68 Compare April 10, 2024 06:56
@pkoutsovasilis pkoutsovasilis force-pushed the pkoutsovasilis/eck_integrations branch from ec4db68 to b2da0a9 Compare April 18, 2024 09:57
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review April 18, 2024 17:18
Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able to review everything as this is a big PR but what I tested looks good and worked 👍 including the e2e test. We have a feature freeze coming up at the end of this month so if we want to ship this in the next ECK version we would need to get this in a mergeable state soon. There are a few open TODOs mostly around the READMEs etc. that we should address

@@ -0,0 +1,493 @@
## @section elasticsearchRef define outputs of elasticsearch type and connection parameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with this syntax, what is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was inspired by bitnami charts, e.g. this. In theory, hjaving such comment metadata should help in autogenerating the documentation for the values.yaml. That said, I am not sure if it adds value for us here

deploy/eck-integrations/values.yaml Outdated Show resolved Hide resolved
test/e2e/helm/integrations/builder.go Outdated Show resolved Hide resolved
test/e2e/helm/integrations/chart_test.go Outdated Show resolved Hide resolved
@pkoutsovasilis pkoutsovasilis force-pushed the pkoutsovasilis/eck_integrations branch from 8f2c159 to 80c62c5 Compare April 26, 2024 06:38
@thbkrkr
Copy link
Contributor

thbkrkr commented Apr 26, 2024

builkite test this -f p=gke,t=TestIntegrationsChartDefaultKubernetes,s=8.13.2

@pkoutsovasilis pkoutsovasilis force-pushed the pkoutsovasilis/eck_integrations branch from b887313 to 6d886b2 Compare May 21, 2024 09:25
@pkoutsovasilis
Copy link
Contributor Author

pkoutsovasilis commented Aug 16, 2024

sorry for keeping this open for so long, but based on some internal discussions it was decided that this Helm chart will find its home in elastic-agent repo. Expect that, ECK operator will still be supported by the "new" helm chart there, but there will be also an option to deploy elastic-agent as plain k8s objects. Thank you for all the help @thbkrkr and @pebrc 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss We need to figure this out >feature Adds or discusses adding a feature to the product :helm-charts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants