Skip to content

monitoring: update apm-server metrics collection to avoid conflicts #17512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

isaacaflores2
Copy link
Contributor

Motivation/summary

Currently apm-server.sampling.tail.storage.lsm_size and apm-server.sampling.tail.dynamic_service_groups are never reported together.

Testing locally this is caused because both sets of metrics use the same namespace (metric name prefix) apm-server.sampling but they are created using different instances of a Meter.

The related monitoring func calls addAPMServerMetrics multiple times for each scoped metric. Metric names with the same prefix in different scopes are somehow overwriting each other. This approach opts to collect all "apm-server" metrics and add them to the snapshot once. Another approach would be to update the elastic-agent-lib to prevent metrics from overwriting each other.

Checklist

  • View individual metrics documents to confirm reported metrics are correct

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

  1. Unit test cover the expected behavior

manual test

  1. Run apm-server with TBS enabled
  2. Send data to the server ./sendotlp -insecure -endpoint=http://localhost:8200 -secret-token=<token>
  3. View metrics via the stats endpoint (http://localhost:5066/stats?pretty) . storage should be visible along with other metrics under apm-server.sampling.tail
"sampling": {
		"tail": {
			"dynamic_service_groups": 0,
			"events": {
				"processed": 3,
				"sampled": 6,
				"stored": 3
			},
			"storage": {
				"lsm_size": 7431,
				"value_log_size": 0
			}
		}
	}

Related issues

Closes #17342
Alternate approach to #17427

@isaacaflores2 isaacaflores2 requested a review from a team as a code owner July 7, 2025 20:07
@isaacaflores2 isaacaflores2 added backport-9.0 Automated backport to the 9.0 branch backport-8.19 Automated backport to the 8.19 branch backport-9.1 Automated backport to the 9.1 branch labels Jul 7, 2025
Copy link
Contributor

github-actions bot commented Jul 7, 2025

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

mergify bot commented Jul 8, 2025

This pull request is now in conflicts. Could you fix it @isaacaflores2? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b monitoring-stats-apm-server upstream/monitoring-stats-apm-server
git merge upstream/main
git push upstream monitoring-stats-apm-server

@rubvs
Copy link
Contributor

rubvs commented Jul 8, 2025

Quickly tested this manually.

  • Stats before the PR change
"sampling": {
  "tail": {
    "dynamic_service_groups": 1,
    "events": {
      "processed": 3,
      "stored": 3
    }
  }
}
  • Stats after the PR change
"sampling": {
  "tail": {
    "dynamic_service_groups": 1,
    "events": {
      "processed": 3,
      "sampled": 3,
      "stored": 3
    },
    "storage": {
      "lsm_size": 5407,
      "value_log_size": 0
    }
  }
}
  • Cat of APM Server config
apm-server:
  host: "127.0.0.1:8200"

  sampling.tail:
    enabled: true
    policies:
      - sample_rate: 1.0

output.elasticsearch:
  hosts: ["https://216c4d53f99e4e48a556e4cfe72cb680.us-central1.gcp.qa.cld.elstc.co:443"]
  username: "elastic"
  password: "<REDACTED>"

http:
  enabled: true
  host: localhost
  port: 5066

rubvs
rubvs previously approved these changes Jul 8, 2025
carsonip
carsonip previously approved these changes Jul 8, 2025
Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for adding the test

@isaacaflores2 isaacaflores2 enabled auto-merge (squash) July 8, 2025 22:54
@isaacaflores2 isaacaflores2 merged commit f1c279b into elastic:main Jul 8, 2025
19 checks passed
mergify bot pushed a commit that referenced this pull request Jul 8, 2025
…17512)

* monitoring: update apm-server metrics collection to avoid conflicts

* refactor beats monitoring since there are no more global registries

* removed redundant temp slice from apm-server monitoring func

(cherry picked from commit f1c279b)

# Conflicts:
#	internal/beatcmd/beat_test.go
mergify bot pushed a commit that referenced this pull request Jul 8, 2025
…17512)

* monitoring: update apm-server metrics collection to avoid conflicts

* refactor beats monitoring since there are no more global registries

* removed redundant temp slice from apm-server monitoring func

(cherry picked from commit f1c279b)
mergify bot pushed a commit that referenced this pull request Jul 8, 2025
…17512)

* monitoring: update apm-server metrics collection to avoid conflicts

* refactor beats monitoring since there are no more global registries

* removed redundant temp slice from apm-server monitoring func

(cherry picked from commit f1c279b)
isaacaflores2 added a commit that referenced this pull request Jul 8, 2025
…17512) (#17527)

* monitoring: update apm-server metrics collection to avoid conflicts

* refactor beats monitoring since there are no more global registries

* removed redundant temp slice from apm-server monitoring func

(cherry picked from commit f1c279b)

Co-authored-by: Isaac Flores <34590010+isaacaflores2@users.noreply.github.com>
isaacaflores2 added a commit that referenced this pull request Jul 9, 2025
…ion to avoid conflicts (#17526)

* monitoring: update apm-server metrics collection to avoid conflicts (#17512)

* monitoring: update apm-server metrics collection to avoid conflicts

* refactor beats monitoring since there are no more global registries

* removed redundant temp slice from apm-server monitoring func

(cherry picked from commit f1c279b)

* fix merge conflict issues caused by global registries

---------

Co-authored-by: Isaac Flores <34590010+isaacaflores2@users.noreply.github.com>
Co-authored-by: Isaac Flores <isaac.flores@elastic.co>
// register all metrics once
// this prevents metrics with the same prefix in the name
// from different scoped meters from overwriting each other
reportOnKey(v, beatsMetrics)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great finding!

@carsonip carsonip self-assigned this Jul 11, 2025
@carsonip
Copy link
Member

✔️ test-plan-ok

Testing notes

Tested with 9.0.4 BC on ECH. lsm_size and dynamic_service_groups appear in the same monitoring metric document. It does not happen on 9.0.3.

image

mergify bot added a commit that referenced this pull request Jul 14, 2025
…tion to avoid conflicts (#17525)

* monitoring: update apm-server metrics collection to avoid conflicts (#17512)

* monitoring: update apm-server metrics collection to avoid conflicts

* refactor beats monitoring since there are no more global registries

* removed redundant temp slice from apm-server monitoring func

(cherry picked from commit f1c279b)

# Conflicts:
#	internal/beatcmd/beat_test.go

* fix merge conflict in beat_test.go

---------

Co-authored-by: Isaac Flores <34590010+isaacaflores2@users.noreply.github.com>
Co-authored-by: Isaac Flores <isaac.flores@elastic.co>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.19 Automated backport to the 8.19 branch backport-9.0 Automated backport to the 9.0 branch backport-9.1 Automated backport to the 9.1 branch test-plan test-plan-ok v9.0.4
Projects
None yet
Development

Successfully merging this pull request may close these issues.

monitoring: TBS metrics are not reported together
5 participants