Skip to content

Conversation

MichelLosier
Copy link
Contributor

@MichelLosier MichelLosier commented Oct 6, 2025

Proposed commit message

  • Adds an initial set of alerting rule templates to the Elastic Agent package.

Extended description

Here is an initial exploration of alerting rule templates for monitoring elastic agent health. This PR can just include the ones we feel the most confident about, and defer others for further refinement and exploration.

Install the rules

How to install the rules:

  • Pull this PR locally: Add elastic agent alerting rule templates #15572
  • Go to the Elastic agent package your-local-dir/integrations/packages/elastic_agent
    • If on remote cluster: Change the version in packages/elastic_agent/manifest.yml from 2.6.4 to 2.6.3
      • Just so you don't miss the actual release later
    • Build the package with elastic-package build --skip-validation. Run this in the elastic_agent package directory
      • This should build the package in build/packages/elastic_agent-2.6.3.zip
  • Install the package in your cluster:
    • Upload the package through the Integrations UI
      • Click the Create new integration CTA at the top right
      • Click the upload it as a .zip link, and upload the zip you built
    • Once complete check the Rules management UI for the created rules
      • All titles should start with [Elastic Agent] and are tagged with Elastic Agent for filtering.

Rule templates:

So that the ESQL is clear, here is a summary of their definitions.

Resource Utilization

  • [Elastic Agent] CPU usage spike
    • Checks if individual processes launched from directory like *elastic*agent* are above 80% of total cpu utilization. Calculate the max for 1 minute buckets and check if there are 5 occurrences when looking back 7 minutes. Rows are distinct by agent id and process name.
    FROM metrics-*
    | WHERE process.executable LIKE "*elastic*agent*"
    | STATS cpu_process_pct = MAX(system.process.cpu.total.pct) * 100
        BY elastic_agent.id, process.name,
          time_bucket = BUCKET(@timestamp, 1 minute)
    // Count the 1 minute timebuckets that are above 80% by process and agent
    | WHERE cpu_process_pct >= 80
    | STATS count_above_threshold = COUNT(*)
        BY elastic_agent.id, process.name
    // Alert if there are 5 or more occurences
    | WHERE count_above_threshold >= 5
    
  • [Elastic Agent] Excessive memory usage
    • Checks if the sum of the individual processes launched from the directory like *elastic*agent* are above 50% of total memory usage. Rows are distinct by agent id.
    • Mileage may vary on this one, and may need fine tuning. Assumption here, is that agent processes should not exceed 50% of memory usage on node.
FROM metrics-*
| WHERE process.executable LIKE "*elastic*agent*"
| STATS max_memory_per_process = MAX(system.process.memory.rss.pct * 100) BY agent.id, process.name
| STATS total_memory_usage = SUM(max_memory_per_process) BY agent.id
| WHERE total_memory_usage > 50

Beats Pipelines and Queues

  • [Elastic Agent] High pipeline queue
    • Checks if max of beat.stats.libbeat.pipeline.queue.filled.pct exceeds 90%. Rows are distinct by agent id and component id
TS metrics-*
| WHERE data_stream.dataset == "elastic_agent.*beat"
| STATS pipeline_queue_pct = MAX(beat.stats.libbeat.pipeline.queue.filled.pct) * 100 BY elastic_agent.id, process.name
| WHERE pipeline_queue_pct >= 90
  • [Elastic Agent] Dropped events
    • Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id
TS metrics-*
| WHERE data_stream.dataset LIKE "elastic_agent.*beat"
| STATS events_dropped_rate = max(rate(beat.stats.libbeat.pipeline.events.dropped)), pipeline_acked_rate = max(rate(beat.stats.libbeat.pipeline.queue.acked)) BY time_bucket = bucket(@timestamp,5minute), elastic_agent.id, component.id
| EVAL percent_drop_rate = (events_dropped_rate / pipeline_acked_rate)
| WHERE percent_drop_rate >= 0.05
  • [Elastic Agent] Output errors
    • Checks if the errors per minute from an agent component is greater than 5. Rows are distinct by agent id and component id.
TS metrics-*
| WHERE data_stream.dataset LIKE "elastic_agent.*beat"
| STATS errors_rate = MAX(RATE(beat.stats.libbeat.output.write.errors)) BY time_bucket = BUCKET(@timestamp,5minute), elastic_agent.id, component.id
| EVAL errors_per_min = errors_rate * 60
| WHERE errors_per_min > 5

Agent Stability

  • [Elastic Agent] Excessive restarts
    • Checks if there are greater than 10 distinct startup timestamps from an agent or component process in a 5 minute window. Rows distinct by agent id, and process name
FROM metrics-* 
| WHERE process.executable LIKE "*elastic*agent*"
| STATS restart_count = COUNT_DISTINCT(process.cpu.start_time) BY host.name, process.name, bucket(@timestamp,5min) 
| WHERE restart_count > 10
  • [Elastic Agent] Unhealthy status
    • Checks for log occurrence of an agent status change to "error" using the new elastic_agent.status_change datastream
FROM logs-* 
| WHERE data_stream.dataset == "elastic_agent.status_change" and agentless == false and status == "error"

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.
  • I have verified that any added dashboard complies with Kibana's Dashboard good practices

Author's Checklist

  • [ ]

How to test this PR locally

Built and Install the elastic agent package locally:

// In the elastic_agent package directory:
elastic-package build
elastic-package install --zip /dir/to/integrations/build/packages/elastic_agent-2.6.4.zip

Related issues

Screenshots

@MichelLosier MichelLosier requested a review from a team as a code owner October 6, 2025 20:27
@MichelLosier MichelLosier added the enhancement New feature or request label Oct 6, 2025
@MichelLosier MichelLosier requested a review from a team October 6, 2025 21:45
@andrewkroh andrewkroh added Integration:elastic_agent Elastic Agent Team:Elastic-Agent Platform - Ingest - Agent [elastic/elastic-agent] labels Oct 7, 2025
@elasticmachine
Copy link

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@elasticmachine
Copy link

💚 Build Succeeded

History

@MichelLosier
Copy link
Contributor Author

Putting this back in draft temporarily to avoid accidental merge. We want to validate these more against running agents -- but still open for config review.

@MichelLosier MichelLosier marked this pull request as draft October 8, 2025 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Integration:elastic_agent Elastic Agent Team:Elastic-Agent Platform - Ingest - Agent [elastic/elastic-agent]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants