Skip to content

RFC for Open Telemetry Implementation for Presto. #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

sureshbabu-areekara
Copy link

No description provided.

span-sampling=true
```

***otel-factory.name***: unique identifier for OpenTelemetry factory implementation to be registered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to imply that the only tracing providers supported must implement the Open Telemetry spec. The runtime should have no knowledge of any aspect of the Open Telemetry spec. The Open Telemetry spec should be implemented as a plugin and the SPI integration with the runtime should have a generic concept of tracing without any knowledge of any particular spec.

As an example, the Presto runtime and execution engines do not have any knowledge of the Iceberg spec either. They communicate through the connector SPI.

Please update your proposal with specifics on the abstractions that will be used within the SPI to accomplish this richer tracing functionality. Thank you.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tdcmeehan The abstraction in SPI is generic and can be implemented by any tracing providers. If the key with "otel" and "OpenTelemetry factory" in the description are confusing here we could modify to some generic name like "tracing-factory-name" and "factory".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add details on the changes being made to the SPI in this document.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tdcmeehan Added the SPI details. Please review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it. Can you add actual interfaces and update the flow chart to denote runtime/engine -> libray, with the arrow indicating which SPI method will be invoked?

Copy link
Author

@sureshbabu-areekara sureshbabu-areekara Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tdcmeehan Added the flow chart. Could you please check. Thank you!!

@tdcmeehan tdcmeehan added the from:IBM PRs from IBM label Nov 13, 2024
@ethanyzhang ethanyzhang added from:IBM PRs from IBM and removed from:IBM PRs from IBM labels Nov 19, 2024
@prestodb-ci
Copy link

Saved that user @sureshbabu-areekara is from IBM

@sureshbabu-areekara
Copy link
Author

Hi @tdcmeehan Please find the draft PR prestodb/presto#24133

@@ -0,0 +1,113 @@
# **RFC0009 for Presto**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to change this RFC0011 since 9 & 10 were already commited

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 70 to 79
`BaseSpan` - SPI can be extended with the specific implementation which contains the actual SDK Span. BaseSpan propagates through the main presto code.
`Tracer` - SPI for the span operations which can be implemented by specific serviceability framework.
`TraceProvider` - SPI to supply different `Tracer` implementations.

As of now we have one implementation for Open Telemetry can be configured by modifying the values in presto-main/etc/telemetry.properties
`BaseSpan` introduced as part of new design. Hooks inside `Tracer` and `TraceProvider` is added/updated to match the design.
`NoopTracer` removed as we have a default `Tracer` implementation in `presto-main/..tracing/DefaultTelemetryTracer`

`Tracer` implementation get loaded by `PluginManager` at startup as before.

Trace can be configured by modifying the values in presto-main/etc/telemetry-tracing.properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line endings are incorrect here. You need to add double spaces to get a newline in the rendered markdown

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of pasting the various fields of the SPI, please paste the additions and deletions made to the SPI classes as code blocks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


***exporter-timeout***: how long the span exporter will wait for a batch of spans to be successfully sent before timing out
***trace-sampling-ratio***: Double between 0.0 and 1.0 to specify the percentage of queries to be traced.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is sampling done ? Is there a way to control which queries get sampled ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.0 : No requests are traced.
1.0 : Every request is traced.
Values between 0.0 to 1.0 : A fraction of requests are randomly traced. So we can decide how much % of queries to be sampled to reduce overhead.

Custom Sampling can be implemented for a particular type of query by overriding shouldSample method in Sampler class. But currently this is not implemented.

Propagation is usually handled by instrumentation libraries and is transparent to the user. In the event that you need to manually propagate context, you can use the Propagators API.

OpenTelemetry maintains several official propagators. The default propagator is using the headers specified by the W3C TraceContext specification.
- In Presto in areas where REST calls involved, we use the header for context propagation as per the above image.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe the format of the header ? What are the key and value storing ? How big is the header ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format :
traceparent: {version}-{trace-id}-{parent-id}-{trace-flags}

Ex :
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

Length :
image

55 bytes is the total length.


OpenTelemetry is a powerful serviceability framework that helps to gain insights into the performance and behaviour of the systems. It facilitates generation, collection, and management of telemetry data such as traces.

The OSS Presto had a basic implementation of Open Telemetry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a new section called Existing Open Telemetry implementation and call out

  • What flows are traced via the existing implementation
  • What flows you think are missing and should be traced

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


The OSS Presto had a basic implementation of Open Telemetry.

![Traces existing implementation](/RFC-0009-open-telemetry/traces-existing-implementation-oss-presto.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This screenshot is not helpful. Instead let's use the list of flows traced (describe them in a few sentences) to motivate the problem

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


## Proposed Implementation

Presto can be manually instrumented and will have the following advantages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have an existing OpenTelemetry related SPI. Let's change this to -

Suggested change
Presto can be manually instrumented and will have the following advantages.
With additional changes to the OpenTelemetry SPI, we get the following advantages -

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


![Instrumentation flow](/RFC-0009-open-telemetry/tracing-instrumentation-flow.png)
- Open Telemetry SDK provides libraries for instrumenting applications to capture telemetry data(traces). It includes built-in integrations for common frameworks and supports custom instrumentation.
- Presto application is getting instrumented using OpenTelemetry API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line, adds no value

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

- Ability to pass additional information as span attributes and events

![Instrumentation flow](/RFC-0009-open-telemetry/tracing-instrumentation-flow.png)
- Open Telemetry SDK provides libraries for instrumenting applications to capture telemetry data(traces). It includes built-in integrations for common frameworks and supports custom instrumentation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're describing the interaction diagram from the image. Can you add a subsection heading like ### Interactions in the new flow ?
Instead of describing the full interaction (which the diagram is covering), only add details that are missing or not covered in the diagram

Copy link
Author

@sureshbabu-areekara sureshbabu-areekara Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

- The SDK continuously checks the flush trigger, and when it reaches the batch size, all spans are batched and sent to the backend.
- Backend is a system to store, analyse and visualize this telemetry data. Common backends include systems like Jaeger, Instana, Grafana stack, etc.

![Context propagation](/RFC-0009-open-telemetry/context-propagation-coordinator-to-worker.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use sub-headings to describe what you're covering.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

@@ -67,9 +67,16 @@ OpenTelemetry maintains several official propagators. The default propagator is
Based on the discussion, this may need to be updated with feedback from reviewers.

## Adoption Plan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a sub-heading called SPI Changes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.



## Summary
The existing Open Telemetry implementation https://github.com/prestodb/presto/pull/18534 was an experimental feature, had a limited set of telemetry data(Query state changes) and did not include a child span concept. The recent implementation will make Presto more flexible, allowing support for both parent and child spans. Additionally, traces can now be propagated to the worker nodes as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should summarize what the document is proposing, there can be a sample implementation, but the design is more important than implementation in the RFC. We need to highlight the shortcomings of the current tracing provider SPI and propose changes to the current SPI to support the desired features.

It seems like the desired features are:

  1. child spans
  2. tracing propagated to workers

In the later sections we need to explain why the current interface does not allow these two features.

If there are more features required, they should be called out explicitly

Suggested change
The existing Open Telemetry implementation https://github.com/prestodb/presto/pull/18534 was an experimental feature, had a limited set of telemetry data(Query state changes) and did not include a child span concept. The recent implementation will make Presto more flexible, allowing support for both parent and child spans. Additionally, traces can now be propagated to the worker nodes as well.
Presto's original tracing implementation was merged in https://github.com/prestodb/presto/pull/18534. It was an experimental feature, had a limited set of telemetry data, and did not include a child span concept. This document proposes an expansion to the tracing provider SPI to
1. Support child spans in traces to make understanding nested operations simpler
2. Allow traces to be propagated to workers so that the entire lifespan of a query can be traced from analysis through execution

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.



## Background
OpenTelemetry is a powerful serviceability framework that helps to gain insights into the performance and behaviour of the systems. It facilitates generation, collection, and management of telemetry data such as traces.
Copy link
Contributor

@ZacBlanco ZacBlanco Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should drop the term "Open Telemetry" as much as possible from the document because we're trying to design an SPI interface that open telemetry is capable of implementing. The SPI tracing provider interface is the important thing to highlight.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

OpenTelemetry is a powerful serviceability framework that helps to gain insights into the performance and behaviour of the systems. It facilitates generation, collection, and management of telemetry data such as traces.


### Existing Open Telemetry
Copy link
Contributor

@ZacBlanco ZacBlanco Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update this to describe the current TracerProvider/Tracer interfaces and specifically describe why they are insufficient

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

![Traces existing implementation](/RFC-0011-open-telemetry/traces-existing-implementation-oss-presto.png)


The OSS Presto had a basic implementation of Open Telemetry with few limitations listed below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The OSS Presto had a basic implementation of Open Telemetry with few limitations listed below.
Presto has a basic `TracerProvider` interface which makes it difficult to implement the following desirable features:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

- No correlation between traceId and queryId.
- Does not identify failed queries
- No context propagation to track worker nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you should point out specifically which SPI methods used for tracing make implementing the particular features difficult

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment on lines 106 to 118
Runnable getCurrentContextWrap(Runnable runnable);
boolean isRecording();
Map<String, String> getHeadersMap(T span);
void endSpanOnError(T querySpan, Throwable throwable);
void addEvent(T span, String eventName);
void setAttributes(T span, Map<String, String> attributes);
void recordException(T span, String message, RuntimeException runtimeException, ErrorCode errorCode);
void setSuccess(T span);
T getInvalidSpan();
T getRootSpan(String traceId);
T getSpan(String spanName);
U scopedSpan(String name, Boolean... skipSpan);
Optional<String> spanString(T span);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all should have javadoc comments explaining what they do. Ideally the RFC should also explain how and where they are used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment on lines 121 to 129
#### Deletions
```java
String tracerName = "com.facebook.presto";
void addPoint(String annotation);
void startBlock(String blockName, String annotation);
void addPointToBlock(String blockName, String annotation);
void endBlock(String blockName, String annotation);
void endTrace(String annotation);
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot just delete from the SPI, so we must add to the transition plan how we will either (1) maintain compatibility with the existing tracing methods and (2) eventually transition to the new API entirely. Will there be a feature flag?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZacBlanco @aaneja As the existing OTel feature was and experimental one with limited spans, coexistence of both is relevant and a feasible approach?

Comment on lines 132 to 135
```java
class TraceProvider {}
```
SPI to supply different Tracer implementations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot delete this because we must be able to supply multiple tracer implementations

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TraceProvider completely not removed. only some methods in it.



### Configuration
Trace can be configured by modifying the values in presto-main/etc/telemetry-tracing.properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems specific to the OpenTelemetry implementation. I think it's fine to propose changes to how we configure the existing OpenTelemtry tracing plugin, but we need to be very clear about the any configuration properties that would be added to TracingConfig since it affects config.properties

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! We'll make sure to clearly document it. Also Added the current configuration along with proposed configuration in RFC.

Comment on lines 35 to 44
## Proposed Implementation
With additional changes to the OpenTelemetry SPI, we get the following advantages -
- Presto can be manually instrumented which gives more flexibility and control. Easier to customize what operations can be monitored.
- More number of spans with parent and child span concept which helps to get the insight on internal operations.
- Able to pass queryId as attribute so that traces and query get connected each other.
- Can identify failed queries and filter it out.
- Ability to pass additional information as span attributes and events.
- Context propagation from coordinator to worker nodes.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section needs to spell out exactly what changes are being made to the TracerProvider and Tracer interfaces in java code, and also any additional architectural changes, such as things that the C++ and Java workers might have to implement. The diagrams are good, but we need to understand the exact changes to the existing interfaces. It isn't clear what the changes are just looking at the diagrams.

I see there is an SPI section below, but it needs to be moved up here and the exact changes to the flow of tracing spans/blocks must be described in text. The descriptions also need to spell out why the existing Tracer/TracerProvider interfaces are insufficient and how the new SPI changes make the desired features possible

Copy link
Author

@sureshbabu-areekara sureshbabu-areekara Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ZacBlanco @aaneja Moved SPI changes as per the comment #33 (comment) . Please let me know where to keep. Under Adoption Plan or Proposed implementation ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated what all additions in instrumentation flow with the new proposal. Context propagation from Coordinator to Worker is entirely new for the proposed implementation.

scheduling. ie. it does not cover most of the internal operations during a query execution(The implementation for methods startBlock and endBlock are based on Map and cannot be used to handle large number of spans as the span name can be same for many.).
- No additional information is being passed as query attributes or events. No suitable methods in SPI to set events and attributes.
- No correlation between traceId and queryId. No suitable methods in SPI to set queryId.
- Does not identify failed queries. No suitable methods in SPI to pass Error/Exception information and deals with failed query spans.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Does not identify failed queries. No suitable methods in SPI to pass Error/Exception information and deals with failed query spans.
- Does not identify failed queries. No suitable methods in SPI to pass Error/Exception information and details with failed query spans.

Is this supposed to be "details"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

SPI can be extended with the specific implementation which contains the actual SDK Span. BaseSpan propagates through the main presto code. Introduced as part of new design.

```java
default void close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these methods on the class of BaseSpan? It is not clear. Also, default is not valid unless BaseSpan is defined as an interface.

How do the methods close() and end() differ? When should they be used?

Copy link
Author

@sureshbabu-areekara sureshbabu-areekara Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseSpan is an SPI interface which is autoclosable. close() came from AutoCloseable and end() part of BaseSpan.
close() can be used to end span in a try-with-resource and end() to manually end the span.

```java
class Tracer {}
```
SPI for the span operations which can be implemented by specific serviceability framework.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is meant to be a comment on the Tracer class, can you add it as a javadoc above the class within the code block?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

```java
class TraceProvider {}
```
SPI to supply different Tracer implementations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment on moving this to a javadoc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment on lines 222 to 224
```java
T create();
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this belongs on TracerProvider please put it in the same code block

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

* @param runtimeException the runtime exception
* @param errorCode the error code
*/
void recordException(T span, String message, RuntimeException runtimeException, ErrorCode errorCode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Exception instead of RuntimeException to make this more generic?

Also, this method seems similar to addEvent. It seems more like a convenience method. We could define the interface in a different way that would make it more generic and allow flexibility in the event implementation. If we remove recordException but maintain addEvent we could have

public void addEvent(T span, Event event);

public interface Event {}

public class ExceptionEvent implements Event {
    private final String message;
    private final Exception exception;
    ....
}

public class GenericEvent implements event {
    private final String eventName;
    ....
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ZacBlanco Thanks for the feedback. Updated accordingly.

*
* @return the invalid span
*/
T getInvalidSpan();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a different method for this? If tracing is disabled, the implementation should handle returning an implementation of T which is an invalid span. Otherwise, users would need to write a conditional every time this is called?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes the method is needed @ZacBlanco
If plugin loaded & tracing disabled --> no-op sdk span
If plugin not loaded & tracing enabled/disabled --> default implementation of .

*
* @return the root span
*/
T getRootSpan(String traceId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it always create it? Or only if it doesn't exist yet?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each query 1 root span will always be created.

* Creates and returns the ScopedSpan with input name.
*
* @param name the name
* @param skipSpan the skip span
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a `skipSpan?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipSpan corresponds to span-sampling in telemetry-tracing.properties to skip the spans which are less required. These were added to reduce the number of spans so that we can improve the performance and increase throughput, especially in high-load environments.

* @param span the span
* @return the optional
*/
Optional<String> spanString(T span);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a separate method for this? Why can't this just be the implementation of T.toString()?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. Moved the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from:IBM PRs from IBM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants