Skip to content

Conversation

@swiatekm
Copy link
Contributor

@swiatekm swiatekm commented Jun 26, 2025

What does this PR do?

Fix a bug where Elastic Agent would enter a failed state if components were running as beats receivers and the otel collector also had an extension defined via hybrid mode.

The result would be the following agent status:

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (FAILED) OTel manager failed: pipeline status id extensions is not a pipeline
   ├─ extensions
   │  ├─ status: StatusOK
   │  └─ extension:health_check/v2
   │     └─ status: StatusOK

Why is it important?

We should report status for otel extensions correctly.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

How to test this PR locally

Build agent locally and run it with the following configuration:

outputs:
  default:
    type: elasticsearch
    hosts: [127.0.0.1:9200]
    username: "elastic"
    password: ""

agent:
  monitoring:
    enabled: true
    _runtime_experimental: otel

inputs: []

# Embedded Otel configuration
receivers:
  nop:
exporters:
  nop:
extensions:
  health_check/v2:
service:
  extensions: [health_check/v2]
  pipelines:
    logs:
      receivers: [nop]
      exporters: [nop]

Then run elastic-agent status. You should see the extension status and the statuses of monitoring components:

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a degraded state
   ├─ beat/metrics-monitoring
   │  ├─ status: (DEGRADED) DEGRADED
   │  └─ beat/metrics-monitoring
   │     └─ status: (DEGRADED) Elasticsearch request failed: dial tcp 127.0.0.1:9200: connect: connection refused
   ├─ filestream-monitoring
   │  ├─ status: (DEGRADED) DEGRADED
   │  └─ filestream-monitoring
   │     └─ status: (DEGRADED) Elasticsearch request failed: dial tcp 127.0.0.1:9200: connect: connection refused
   ├─ http/metrics-monitoring
   │  ├─ status: (DEGRADED) DEGRADED
   │  └─ http/metrics-monitoring
   │     └─ status: (DEGRADED) Elasticsearch request failed: dial tcp 127.0.0.1:9200: connect: connection refused
   ├─ extensions
   │  ├─ status: StatusOK
   │  └─ extension:health_check/v2
   │     └─ status: StatusOK
   └─ pipeline:logs
      ├─ status: StatusOK
      ├─ exporter:nop
      │  └─ status: StatusOK
      └─ receiver:nop
         └─ status: StatusOK

Related issues

@mergify
Copy link
Contributor

mergify bot commented Jun 26, 2025

This pull request does not have a backport label. Could you fix it @swiatekm? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@swiatekm swiatekm added skip-changelog backport-8.19 Automated backport to the 8.19 branch labels Jun 26, 2025
@swiatekm swiatekm marked this pull request as ready for review June 26, 2025 14:54
@swiatekm swiatekm requested a review from a team as a code owner June 26, 2025 14:54
@swiatekm swiatekm added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 26, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@swiatekm swiatekm force-pushed the fix/otel-extension-status branch from b5a7833 to 8e92e8a Compare June 27, 2025 12:00
@elastic-sonarqube
Copy link

@swiatekm swiatekm requested a review from pkoutsovasilis June 27, 2025 13:53
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did merge main and run this locally with subprocess mode and I still see this behaviour everytime I "kill" the collector subprocess

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   └─ extensions
      ├─ status: StatusOK
      ├─ extension:healthcheckv2/0018a7f6-cbea-4a2f-b314-9e75eb9aa55f
      │  └─ status: StatusOK
      ├─ extension:healthcheckv2/5c5f54ad-54f4-4ed9-bf9d-a7e8a8b3a161
      │  └─ status: StatusOK
      └─ extension:healthcheckv2/6c2bbefd-f0cf-4a76-aba5-5df3412d0185
         └─ status: StatusOK

Is the above something that this PR should handle? I am still not quite sure why I even see extensions in the elastic-agent status 😄

@swiatekm
Copy link
Contributor Author

swiatekm commented Jul 2, 2025

I did merge main and run this locally with subprocess mode and I still see this behaviour everytime I "kill" the collector subprocess

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   └─ extensions
      ├─ status: StatusOK
      ├─ extension:healthcheckv2/0018a7f6-cbea-4a2f-b314-9e75eb9aa55f
      │  └─ status: StatusOK
      ├─ extension:healthcheckv2/5c5f54ad-54f4-4ed9-bf9d-a7e8a8b3a161
      │  └─ status: StatusOK
      └─ extension:healthcheckv2/6c2bbefd-f0cf-4a76-aba5-5df3412d0185
         └─ status: StatusOK

Is the above something that this PR should handle? I am still not quite sure why I even see extensions in the elastic-agent status 😄

And this only happens in subprocess mode?

@pkoutsovasilis
Copy link
Contributor

And this only happens in subprocess mode?

Is there any other case where we actively have an extension loaded at the moment?! IIUC, as this regards only hybrid elastic agent, for the embedded mode no extensions are loaded at the moment, right?

@swiatekm
Copy link
Contributor Author

swiatekm commented Jul 4, 2025

@pkoutsovasilis I found the root cause of that bug. It has nothing to do with status reporting, we're actually running multiple extensions in the subprocess collector. The reason is that we don't make a copy of the collector config at any point, so we continue adding new healthcheckv2 extensions every time the collector is restarted.

#8529 fixes this by accident because it always creates its own config file. Up to you if we want to fix it separately first.

In any case, this PR fixes a different unrelated issue, and we shouldn't block it on that one.

@pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis I found the root cause of that bug. It has nothing to do with status reporting, we're actually running multiple extensions in the subprocess collector. The reason is that we don't make a copy of the collector config at any point, so we continue adding new healthcheckv2 extensions every time the collector is restarted.

#8529 fixes this by accident because it always creates its own config file. Up to you if we want to fix it separately first.

In any case, this PR fixes a different unrelated issue, and we shouldn't block it on that one.

okkk now I see, under the subprocess mode we mutate the cfg but the same cfg is always mutated. If we didn't generate a random UUID for the healthcheck extension that would be fine as well I guess. Agreed let's deal with that separately

@swiatekm swiatekm merged commit de39cae into main Jul 4, 2025
19 checks passed
@swiatekm swiatekm deleted the fix/otel-extension-status branch July 4, 2025 13:20
mergify bot pushed a commit that referenced this pull request Jul 4, 2025
* Fix otel extension status reporting

* Explicitly handle errors from otel status id parsing

(cherry picked from commit de39cae)
swiatekm added a commit that referenced this pull request Jul 4, 2025
* Fix otel extension status reporting

* Explicitly handle errors from otel status id parsing

(cherry picked from commit de39cae)

Co-authored-by: Mikołaj Świątek <mail@mikolajswiatek.com>
v1v added a commit that referenced this pull request Jul 6, 2025
* upstream: (39 commits)
  Fix otel extension status reporting (#8696)
  Refactor user change on service (#8347)
  [AutoOps] Add `autoops-es.yml` to Packages (#8728)
  EDOT collector: include the forward connector. (#8753)
  Revert "ci: pin elastic-agent version (#8736)" (#8754)
  bk: retry Start ESS stack for integration tests (#8553)
  Re-enable TestStandaloneUpgradeRollbackOnRestarts on windows (#8718)
  removed reviewers from dependabot.yml (#8709)
  Pass `--header` enrollment option to fleet-server (#8071)
  Add ability for local output configuration to add to policy configuration (#8766)
  Bump up github.com/go-viper/mapstructure/v2 dependency (#8764)
  [Synthetics] Upgrade node to latest lts v20 (#8712)
  [CI] BK Vault plugin for EC access (#8377)
  feat: singleTest mage target for each integration test package (#8691)
  ci: always include 8.19 LTS release branch in snapshots of test versions (#8761)
  build(deps): bump github.com/elastic/mito from 1.19.0 to 1.20.0 (#8755)
  chore: fix elastic-agent helm chart examples (#8765)
  feat: support onboarding-id for kubernetes (#8692)
  [main][Automation] Bump VM Image version to 1751072471 (#8734)
  ci: revert deployment_csp_configuration.yaml to create_deployment_csp_configuration.yaml (#8746)
  ...
@khushijain21 khushijain21 added the backport-9.1 Automated backport to the 9.1 branch label Sep 29, 2025
mergify bot pushed a commit that referenced this pull request Sep 29, 2025
* Fix otel extension status reporting

* Explicitly handle errors from otel status id parsing

(cherry picked from commit de39cae)
khushijain21 pushed a commit that referenced this pull request Sep 29, 2025
* Fix otel extension status reporting

* Explicitly handle errors from otel status id parsing

(cherry picked from commit de39cae)

Co-authored-by: Mikołaj Świątek <mail@mikolajswiatek.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.19 Automated backport to the 8.19 branch backport-9.1 Automated backport to the 9.1 branch skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants