[APM][Stack Monitoring] Changes for integrating APM with Elastic Agent

## Motivation and Overview
Integrating APM Server with Elastic Agent has some impact on collected metrics. Continue to provide useful insights into running deployments to users requires some changes to the APM Stack Monitoring UI. The focus will stay on APM Server specific metrics where an isolated view on APM Server makes sense (processed events, number of requests, etc.), and on Elastic Agent aggregated metrics otherwise (system metrics when running inside a container). The system related metrics are the most important metrics for scaling decisions, showing them for the overall group seems the most useful when running inside a container. 
There is an existing issue to switch to using `cgroups` data for system metrics https://github.com/elastic/kibana/issues/79050 (planned for `7.12`). Container resource limits are reflected in the `cgroup` data, giving better insights into how much of the actually available resources are used. When running inside a container and as an Elastic Agent integration, potential resource limits will be set for the whole group (Elastic Agent + sub processes). To be clear about the semantics of the system resouce metrics, showing a correct and precise terminology is important. 

Adding other, Elastic Agent or integrations specific, information to the Stack Monitoring UI is not scope of this issue, and not generally planned. For more details related to Elastic Agent related visualisations refer to [kibana#81872](https://github.com/elastic/kibana/issues/81872). 

### Problems to Solve with the Stack Monitoring UI
* When to scale up APM Server, or in the future Elastic Agent?
 - system resource usage: CPU, memory
 - Response Errors Intake - 503 Queue is Full
* When to change the internal memory queue settings
 - system resource usage: CPU, memory - not using 100% CPU, while seeing `503 Queue is Full` Response Errors Intake
* Identify potential issues between APM Agents and Server, some examples:
 - Response Errors Intake - Validate (e.g. version incompatibility)
 - Response Errors Intake - Unauthorized (invalid secret_token or API Key configured in APM Agent)
 - Response Errors Intake - Too large (events are larger than usually allowed, users can customize configuration)
 - Response Errors Intake - Rate limit (more RUM requests than expected, rate limiting settings can be customized by users)
* Identify potential APM Server internal issues (e.g. invalid events filling up the queue)
 - Output Events Rate, Output Failed Events Rate, Processed Events
* Identify potential issues with agent remote configuration (APM Server querying information from Kibana)
 - Response Count Agent Config Management, Response Errors Agent Config Management
* Troubleshoot APM Server and APM Agents even when the Observability Cluster is severely damaged
 * Use a monitoring system different from the Elastic Observability cluster to monitor this Elastic Observability cluster


Changes mostly concern renaming and moving around components, but also involve some conditional logic for deciding on the right terminology and metrics to show. 

## Break up per View
<details>
<summary> Cluster Listing (no changes required)</summary>
 No changes are required for the Cluster Listing. 
<img width="1902" alt="Screenshot 2021-02-03 at 10 57 48" src="https://user-images.githubusercontent.com/5555349/106730567-d2f85f00-660e-11eb-93bf-169de13aa7cc.png">
</details>

<details>
<summary> Cluster Overview</summary>
This overview is designed to act as a high level health indicator for the APM Server instances. Currently it shows Processed Events and Last Events for the APM Server overview (all instances combined) and Memory Usage for a concrete APM Server instance. 

When running as Elastic Agent sub process, the system resources might be shared with other Agent sub processes. Showing the Memory Usage of APM Server would still be possible, but seems less important. The suggested change is to keep this overview focused on APM Server and also show the Processed Events and Last Events for the concrete APM Server instance. See mock up below. 
<img width="884" alt="Screenshot 2021-01-29 at 21 43 19" src="https://user-images.githubusercontent.com/5555349/106736118-5452f000-6615-11eb-93f5-1d2bdd9be156.png">
</details>


<details>
<summary> APM server overview</summary>
* Move resource related metrics (CPU, memory, load) up in the page into a dedicated section (between Alerts and Response Count metrics)
<img width="1781" alt="Screenshot 2021-02-03 at 12 01 39" src="https://user-images.githubusercontent.com/5555349/106738084-cd534700-6617-11eb-917d-7bf5a6d75208.png">

* Show APM Server - Resource Usage or Elastic Agent Group - Resource Usage in the title of the resource usage section, based on below described logic
<img width="1780" alt="Screenshot 2021-02-03 at 12 04 25" src="https://user-images.githubusercontent.com/5555349/106738563-5c605f00-6618-11eb-93e1-ed6b9d2e4a09.png">

* Show the rest of the metrics in a dedicated section with the header APM Server - Custom Metrics
<img width="1762" alt="Screenshot 2021-02-03 at 12 08 07" src="https://user-images.githubusercontent.com/5555349/106738874-c5e06d80-6618-11eb-9a84-128647dac8ac.png">

* Nice to have: if available, calculate the relative resource usage for CPU and memory per process and show inside the CPU and Memory graphs. The data would be distinguishable by beat.type (e.g. fleet-server, apm, ..). 
 In case this can be added to the Stack Monitoring UI, it requires some small additional changes on the metrics collection, so it would be good to know if this will be planned or not. 

Conditional Logic to distinguish between apm-server and elastic-agent-group:
* not running inside a container -> apm-server
* running inside a container but not detecting Elastic Agent integration -> apm-server 
* running inside a container and detecting Elastic Agent integration -> elastic-agent-group

For the detection of whether or not `cgroup` values should be used @chrisronline mentioned that other apps set a flag in the Kibana config options. We could do something similar for APM. I am wondering how this works when using a dedicated monitoring cluster, to which data from multiple other clusters are shipped, where other clusters could partially be running inside containers, partially directly on a host system? 
For the Elastic Agent detection let's follow a similar approach as for the `cgroup`/container decision.

</details>

<details>
<summary> APM server instances (no changes required)</summary>
 No changes are required.
<img width="1784" alt="Screenshot 2021-02-03 at 12 22 50" src="https://user-images.githubusercontent.com/5555349/106740338-8dda2a00-661a-11eb-920a-f3d5c411e6dd.png">

</details>

<details>
<summary> APM server instance xyz</summary>
Same changes should be made as for the APM Server overview page (moving system resource usage up and into dedicated section, conditionally change title)

</details>

## Timeline
`7.13`: APM Server integration with Elastic Agent (beta)
`7.14`: APM Server integration with Elastic Agent (GA)

It would be great to get the changes in for `7.13`.

@cyrille-leclerc could you review the proposed changes, and also have a focus on the used terminology and involved design changes. 
cc @ruflin and @elastic/apm-server 
cc @jasonrhodes @ravikesarwani @chrisronline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[APM][Stack Monitoring] Changes for integrating APM with Elastic Agent #90157

Motivation and Overview

Problems to Solve with the Stack Monitoring UI

Break up per View

Timeline

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[APM][Stack Monitoring] Changes for integrating APM with Elastic Agent #90157

Description

Motivation and Overview

Problems to Solve with the Stack Monitoring UI

Break up per View

Timeline

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions