Description
NOTE: this issue will evolve as we scope out this work.
Description
"Why is Elasticsearch slow?" is a common question from users. We have tools to investigate certain aspects of this question already, for instance the search slowlog (good if the shard-level searches are slow) and the hot threads API (good if the slowness is an ongoing thing) but there are many gaps too. For instance, how would we discover that a Kibana dashboard triggers unreasonably many searches if each of those searches completes fairly quickly? How would we discover that requests are spending unexpectedly long in queues? How do we see if the slow steps all involve a particular node? What if that node is on a remote cluster? It's hard to take a structured approach to performance questions with the tools we have today.
Distributed tracing is a great way to answer questions of this nature. Elastic has a distributed tracing product, APM, which sits on top of Elasticsearch, but today Elasticsearch itself is opaque to APM: we cannot trace the execution of a request through Elasticsearch. Let's fix that.
This work will build on an existing exploratory project that instrumented a number of "tasks" in Elasticsearch. More types of tasks will be instrumented, as well as requests / responses at the REST level.
Tasks
- Technical integration
- Update the proof-of-concept branch of APM in ES
- Engage with the APM team to check what is the currently recommended way of integrating a Java application with APM
- Benchmark the overhead of running with APM disabled, to determine if we can merge safely
- Ensure that distributed tracing works with other Elastic stack components
- Add testing that uses the APM agent
- Make sure using APM agent gives the same results
- Code changes
- Refactor tasks to improve APM support #87917
- Introduce tracing interfaces #87921
- Provide tracing implementation using OpenTelemetry + APM agent #88443
- Add docs for tracing
⚠️ Need to check whether we want to document APM at this time⚠️ - Developer education / awareness
- Raise Cloud issue for wiring up APM
- Documentation
- Document how to extend tracing in Elasticsearch e.g. coding conventions, what attributes to apply, how to model spans.
- Span / transaction modelling
- Check whether the way we are capturing spans in Elasticsearch works with what Kibana etc are doing. The end-to-end picture needs be coherent.
- Configuration:
- Provide richer options for filtering captured spans
-
Make sampling rate configurable- handled by APM agent -
More flexible configuration of connection to APM server (TLS features, proxy support, protocol selection etc)- handled by APM agent - Ensure configuration can be dynamically changed, whether the options sits in ES or on the APM agent.
- Ensure potentially sensitive fields are filtered out (depends on difference between OTLP intake and OTel bridge apm-server#8067)
Out-of-scope
The focus of this work is making is instrumenting Elasticsearch for Elastic's own purposes. Making it available to users and licensing it for that purpose is not currently in scope.