Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "cluster" scope in Metricbeat elasticsearch module #18547

Merged
merged 26 commits into from
Aug 3, 2020
Merged

Support "cluster" scope in Metricbeat elasticsearch module #18547

merged 26 commits into from
Aug 3, 2020

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented May 14, 2020

What does this PR do?

This PR introduces a new scope setting for the elasticsearch Metricbeat module. This setting can take one of two values:

  • node (default): indicates that each item in the hosts list points to a distinct Elasticsearch node in a cluster, or
  • cluster: indicates that each item in the hosts lists points to a single endpoint for a distinct Elasticsearch cluster (e.g. a load-balancing proxy fronting the cluster).

Why is it important?

Sometimes it may not be possible for Metricbeat to reach individual Elasticsearch nodes. It might only have access to a single endpoint that fronts the entire Elasticsearch cluster.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Manual testing

Testing the new scope: cluster functionality introduced in this PR requires an Elasticsearch cluster with multiple nodes. The easiest way is probably to spin up an Elastic Cloud cluster.

  1. Enable the elasticsearch-xpack module.

    ./metricbeat modules enable elasticsearch-xpack
    
  2. Configure the module (./modules.d/elasticsearch-xpack.yml) like so:

    - module: elasticsearch
      xpack.enabled: true
      period: 10s
      hosts:
        - XXXXX
      scope: cluster
      username: XXXXX
      password: XXXXX
    

    Obviously, replace the XXXXX placeholders with your own.

    The key is that there should be exactly one item under hosts, pointing to single endpoint for your cluster. This could be the Elasticsearch endpoint obtained from Elastic Cloud or the address of a single node in your on-prem/local Elasticsearch cluster.

  3. Configure Metricbeat to send the collected stats to an Elasticsearch cluster. This will act as your Monitoring Cluster. It could be the same cluster you're using to collect stats from (as configured in your elasticsearch-xpack module configuration above) or it could be an entirely separate cluster.

    metricbeat.config.modules:
    path: ${path.config}/modules.d/*.yml
    reload.enabled: false
    
    output.elasticsearch:
      enabled: true
      hosts:
        - XXXXXX
      username: XXXXXX
      password: XXXXXX
    
  4. Start Metricbeat.

    ./metricbeat -e
    
  5. Let Metricbeat run for ~30 seconds. Make sure there are no errors in the Metricbeat logs.

  6. Perform the following query against your Monitoring Cluster.

    POST .monitoring-es-7-mb-*/_search
    {
      "size": 0,
      "aggs": {
        "by_cluster_uuid": {
          "terms": {
            "field": "cluster_uuid"
          }
        },
        "cluster_stats": {
          "filter": {
            "term": {
              "type": "cluster_stats"
            }
          },
          "aggs": {
            "by_period": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "10s"
              }
            }
          }
        },
        "node_stats": {
          "filter": {
            "term": {
              "type": "node_stats"
            }
          },
          "aggs": {
            "by_period": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "10s"
              }
            }
          }
        }
      }
    }
    

    This query checks 3 things:

    1. The aggs.by_cluster_uuid aggregation checks that we are only seeing data for a single cluster and that all documents contain that single cluster UUID.
      • Verify that aggregations.by_cluster_uuid.buckets only contains a single bucket, for the single cluster UUID of the Elasticsearch cluster you are monitoring.
      • Verify that aggregations.by_cluster_uuid.buckets[0].doc_count is the same as the hits.total.value.
    2. The aggs.cluster_stats aggregation checks that type: cluster_stats documents are only indexed once every collection period (10 seconds) and that there is at most one document per collection period. We are using type: cluster_stats here as an example; the same should be true for any types other than type: node_stats.
      • Verify that aggregations.cluster_stats.by_period.buckets have several buckets, each corresponding to a time period. Each buckets should be 10 seconds "wide". Within each bucket, verify that doc_count is <= 1.
    3. The aggs.node_stats aggregation checks that type: node_stats documents are only indexed once every collection period (10 seconds) and that there are at most N documents per collection period, where N is the number of nodes in the cluster you are monitoring.
      • Verify that aggregations.node_stats.by_period.buckets have several buckets, each corresponding to a time period. Each buckets should be 10 seconds "wide". Within each bucket, verify that doc_count is <= N, where N is the number of nodes in the cluster you are monitoring.

Related issues

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 14, 2020
@elasticmachine
Copy link
Collaborator

elasticmachine commented May 14, 2020

💚 Build Succeeded

Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

Test stats 🧪

Test Results
Failed 0
Passed 3975
Skipped 935
Total 4910

@andresrc andresrc added the Team:Integrations Label for the Integrations team label May 15, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 15, 2020
Comment on lines 59 to 65
// If we're talking to a set of ES nodes directly, only collect stats from the master node so
// we don't collect the same stats from every node and end up duplicating them.
if m.HostsMode == elasticsearch.HostsModeNode {
isMaster, err := elasticsearch.IsMaster(m.HTTP, m.GetServiceURI())
if err != nil {
return errors.Wrap(err, "error determining if connected Elasticsearch node is master")
}

// Not master, no event sent
if !isMaster {
m.Logger().Debug("trying to fetch ccr stats from a non-master node")
return nil
// Not master, no event sent
if !isMaster {
m.Logger().Debug("trying to fetch ccr stats from a non-master node")
return nil
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic repeats across the board, I wonder if it should go into a helper function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored in 348c9b9.

// it will provid the data for multiple nodes. This will mean the detection of the
// master node will not be accurate anymore as often in these cases a proxy is in front
// of ES and it's not know if the request will be routed to the same node as before.
// TODO: call GET _nodes/_master?filter_path=nodes.*.name to figure out the ID of the master node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this TODO is done by GetMasterNodeID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 34cd849.

# a distinct Elasticsearch cluster, e.g. a load-balancer fronting the cluster.
# Set to node (default) to treat each item in the hosts list as an individual
# node in the Elasticsearch cluster.
#hosts_mode: node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if mode is enough, in other places we have used scope too. Anyway I don't have a strong opinion here

Copy link
Contributor Author

@ycombinator ycombinator May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was reserving mode for this change: #9424 (comment).

Using scope here instead of hosts_mode sounds good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in f2a5618.

Copy link
Contributor

@exekias exekias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great Shaunak! thank you for working on this

@ycombinator ycombinator requested a review from dedemorton May 15, 2020 11:36
@ycombinator
Copy link
Contributor Author

I tested the changes in this PR for heap usage w.r.t # of nodes in the ES cluster being monitored. There was no significant difference in heap usage, no matter how many nodes were in the the ES cluster being monitored. Details below.

Setup

  1. Spun up an Elastic Cloud deployment, initially with just 1 node in 1 zone. Pre-allocated 64GB RAM since that would be necessary later to scale up the number of nodes.

  2. Modified the elasticsearch module code to only load the index metricset.

  3. Enabled the elasticsearch-xpack module and configured it like so:

- module: elasticsearch
  xpack.enabled: true
  period: 1s
  hosts:
  - https://XXXXXX.us-central1.gcp.cloud.es.io:9243
  username: elastic
  password: XXXXXXXXXX
  scope: cluster
  1. Configured Metricbeat like so:
metricbeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false

output.console.enabled: true
  1. Ran Metricbeat with the HTTP profiling endpoint enabled.
./metricbeat -c metricbeat.test.yml --httpprof localhost:9000 -e
  1. Collected Metricbeat heap usage every 200ms.
while true; do curl -s 'http://localhost:9000/debug/vars' | jq '.["beat.memstats.memory_alloc"]'; sleep 0.2; done
  1. Stopped collecting heap usage and Metricbeat. Recorded usage in Google sheet (see below).

  2. Incremented number of nodes in ES cluster by 1, repeated above steps.

Results

https://docs.google.com/spreadsheets/d/1Boi0uw846OSY3vnGqC604Dj8Lh28lD3G9OsKFgPGz8I/edit?usp=sharing

@ycombinator ycombinator marked this pull request as ready for review May 26, 2020 20:09
@ycombinator ycombinator requested a review from exekias May 26, 2020 20:09
@elasticmachine
Copy link
Collaborator

Pinging @elastic/stack-monitoring (Stack monitoring)

@ycombinator ycombinator added enhancement v7.9.0 v8.0.0 needs_backport PR is waiting to be backported to other branches. labels May 26, 2020
@ycombinator ycombinator changed the title Support "cluster" mode in Metricbeat elasticsearch module Support "cluster" scope in Metricbeat elasticsearch module May 26, 2020
@ycombinator ycombinator merged commit aaeead0 into elastic:master Aug 3, 2020
@ycombinator ycombinator deleted the mb-es-cluster-mode branch August 3, 2020 19:25
@ycombinator ycombinator removed needs_backport PR is waiting to be backported to other branches. Team:Automation Label for the Observability productivity team labels Aug 3, 2020
ycombinator added a commit that referenced this pull request Aug 4, 2020
…20413)

* Adding configuration for hosts_mode

* Only perform master check in HostsModeNode

* Only ask the node if it's the master node if we're in HostsModeNode

* Unpack host_mode string into enum

* Adding some specific TODOs in node_stats code

* Updating x-pack/metricbeat reference config

* Set correct service URI

* Get master node ID

* Adding CHANGELOG entry

* Rename hosts_mode => scope

* Removing stale TODO comment

* Adding docs

* Refactoring common code into helper method

* Do not set service URI up front

* Updating documentation per review

* Remove comments from doc examples

* Adding configuration for hosts_mode

* Set correct service URI

* Adding CHANGELOG entry

* Rename hosts_mode => scope

* Do not set service URI up front

* Update metricbeat/docs/modules/elasticsearch.asciidoc

Co-authored-by: DeDe Morton <dede.morton@elastic.co>

* Update metricbeat/module/elasticsearch/_meta/docs.asciidoc

Co-authored-by: DeDe Morton <dede.morton@elastic.co>

* Update reference config

* Cleaning up CHANGELOG

* Updating generated files

Co-authored-by: DeDe Morton <dede.morton@elastic.co>

Co-authored-by: DeDe Morton <dede.morton@elastic.co>
v1v added a commit to v1v/beats that referenced this pull request Aug 6, 2020
…ne-2.0

* upstream/master:
  [docs] Promote ingest management to beta (elastic#20295)
  Upgrade elasticsearch client library used in tests (elastic#20405)
  Disable logging when pulling on python integration tests (elastic#20397)
  Remove pillow from testing requirements.txt (elastic#20407)
  [Filebeat][ATP Module]Setting user agent field required by the API (elastic#20440)
  [Ingest Manager] Send datastreams fields (elastic#20402)
  Add event.ingested to all Filebeat modules (elastic#20386)
  [Elastic Agent] Fix agent control socket path to always be less than 107 characters (elastic#20426)
  Improve cgroup_regex docs with examples (elastic#20425)
  Makes `metrics` config option required in app_insights (elastic#20406)
  Ensure install scripts only install if needed (elastic#20349)
  Update container name for the azure filesets (elastic#19899)
  Group same timestamp metrics values in app_insights metricset (elastic#20403)
  add_process_metadata processor adds container id even if process metadata not accessible (elastic#19767)
  Support "cluster" scope in Metricbeat elasticsearch module (elastic#18547)
  [Filebeat][SophosXG Module] Renaming module and fileset (elastic#20396)
  Update Suricata dashboards (elastic#20394)
  [Elastic Agent] Improve version, restart, enroll CLI commands (elastic#20359)
  Prepare home directories for docker images in a different stage (elastic#20356)
v1v added a commit to v1v/beats that referenced this pull request Aug 6, 2020
…allation

* upstream/master: (23 commits)
  [docs] Promote ingest management to beta (elastic#20295)
  Upgrade elasticsearch client library used in tests (elastic#20405)
  Disable logging when pulling on python integration tests (elastic#20397)
  Remove pillow from testing requirements.txt (elastic#20407)
  [Filebeat][ATP Module]Setting user agent field required by the API (elastic#20440)
  [Ingest Manager] Send datastreams fields (elastic#20402)
  Add event.ingested to all Filebeat modules (elastic#20386)
  [Elastic Agent] Fix agent control socket path to always be less than 107 characters (elastic#20426)
  Improve cgroup_regex docs with examples (elastic#20425)
  Makes `metrics` config option required in app_insights (elastic#20406)
  Ensure install scripts only install if needed (elastic#20349)
  Update container name for the azure filesets (elastic#19899)
  Group same timestamp metrics values in app_insights metricset (elastic#20403)
  add_process_metadata processor adds container id even if process metadata not accessible (elastic#19767)
  Support "cluster" scope in Metricbeat elasticsearch module (elastic#18547)
  [Filebeat][SophosXG Module] Renaming module and fileset (elastic#20396)
  Update Suricata dashboards (elastic#20394)
  [Elastic Agent] Improve version, restart, enroll CLI commands (elastic#20359)
  Prepare home directories for docker images in a different stage (elastic#20356)
  New multiline mode in Filebeat: while_pattern (elastic#19662)
  ...
@andresrc andresrc added the test-plan-added This PR has been added to the test plan label Oct 3, 2020
melchiormoulin pushed a commit to melchiormoulin/beats that referenced this pull request Oct 14, 2020
…8547)

* Adding configuration for hosts_mode

* Only perform master check in HostsModeNode

* Only ask the node if it's the master node if we're in HostsModeNode

* Unpack host_mode string into enum

* Adding some specific TODOs in node_stats code

* Updating x-pack/metricbeat reference config

* Set correct service URI

* Get master node ID

* Adding CHANGELOG entry

* Rename hosts_mode => scope

* Removing stale TODO comment

* Adding docs

* Refactoring common code into helper method

* Do not set service URI up front

* Updating documentation per review

* Remove comments from doc examples

* Adding configuration for hosts_mode

* Set correct service URI

* Adding CHANGELOG entry

* Rename hosts_mode => scope

* Do not set service URI up front

* Update metricbeat/docs/modules/elasticsearch.asciidoc

Co-authored-by: DeDe Morton <dede.morton@elastic.co>

* Update metricbeat/module/elasticsearch/_meta/docs.asciidoc

Co-authored-by: DeDe Morton <dede.morton@elastic.co>

* Update reference config

* Cleaning up CHANGELOG

* Updating generated files

Co-authored-by: DeDe Morton <dede.morton@elastic.co>
@zube zube bot removed the [zube]: Done label Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature:Stack Monitoring Team:Integrations Label for the Integrations team test-plan Add this PR to be manual test plan test-plan-added This PR has been added to the test plan v7.10.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metricbeat elasticsearch module should be able to collect from a single cluster endpoint
5 participants