-
Notifications
You must be signed in to change notification settings - Fork 198
Fix otel extension status reporting #8696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request does not have a backport label. Could you fix it @swiatekm? 🙏
|
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
b5a7833 to
8e92e8a
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did merge main and run this locally with subprocess mode and I still see this behaviour everytime I "kill" the collector subprocess
┌─ fleet
│ └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
├─ status: (HEALTHY) Running
└─ extensions
├─ status: StatusOK
├─ extension:healthcheckv2/0018a7f6-cbea-4a2f-b314-9e75eb9aa55f
│ └─ status: StatusOK
├─ extension:healthcheckv2/5c5f54ad-54f4-4ed9-bf9d-a7e8a8b3a161
│ └─ status: StatusOK
└─ extension:healthcheckv2/6c2bbefd-f0cf-4a76-aba5-5df3412d0185
└─ status: StatusOK
Is the above something that this PR should handle? I am still not quite sure why I even see extensions in the elastic-agent status 😄
And this only happens in subprocess mode? |
Is there any other case where we actively have an extension loaded at the moment?! IIUC, as this regards only hybrid elastic agent, for the embedded mode no extensions are loaded at the moment, right? |
|
@pkoutsovasilis I found the root cause of that bug. It has nothing to do with status reporting, we're actually running multiple extensions in the subprocess collector. The reason is that we don't make a copy of the collector config at any point, so we continue adding new healthcheckv2 extensions every time the collector is restarted. #8529 fixes this by accident because it always creates its own config file. Up to you if we want to fix it separately first. In any case, this PR fixes a different unrelated issue, and we shouldn't block it on that one. |
okkk now I see, under the subprocess mode we mutate the cfg but the same cfg is always mutated. If we didn't generate a random UUID for the healthcheck extension that would be fine as well I guess. Agreed let's deal with that separately |
* Fix otel extension status reporting * Explicitly handle errors from otel status id parsing (cherry picked from commit de39cae)
* upstream: (39 commits) Fix otel extension status reporting (#8696) Refactor user change on service (#8347) [AutoOps] Add `autoops-es.yml` to Packages (#8728) EDOT collector: include the forward connector. (#8753) Revert "ci: pin elastic-agent version (#8736)" (#8754) bk: retry Start ESS stack for integration tests (#8553) Re-enable TestStandaloneUpgradeRollbackOnRestarts on windows (#8718) removed reviewers from dependabot.yml (#8709) Pass `--header` enrollment option to fleet-server (#8071) Add ability for local output configuration to add to policy configuration (#8766) Bump up github.com/go-viper/mapstructure/v2 dependency (#8764) [Synthetics] Upgrade node to latest lts v20 (#8712) [CI] BK Vault plugin for EC access (#8377) feat: singleTest mage target for each integration test package (#8691) ci: always include 8.19 LTS release branch in snapshots of test versions (#8761) build(deps): bump github.com/elastic/mito from 1.19.0 to 1.20.0 (#8755) chore: fix elastic-agent helm chart examples (#8765) feat: support onboarding-id for kubernetes (#8692) [main][Automation] Bump VM Image version to 1751072471 (#8734) ci: revert deployment_csp_configuration.yaml to create_deployment_csp_configuration.yaml (#8746) ...
* Fix otel extension status reporting * Explicitly handle errors from otel status id parsing (cherry picked from commit de39cae)




What does this PR do?
Fix a bug where Elastic Agent would enter a failed state if components were running as beats receivers and the otel collector also had an extension defined via hybrid mode.
The result would be the following agent status:
Why is it important?
We should report status for otel extensions correctly.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added an entry in./changelog/fragmentsusing the changelog tool[ ] I have added an integration test or an E2E testHow to test this PR locally
Build agent locally and run it with the following configuration:
Then run
elastic-agent status. You should see the extension status and the statuses of monitoring components:Related issues