-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-15566 Opentelemetry changes using java agent #3445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
How to Test locally. (Instructions for linux)
|
@minni31 please follow the above mentioned instructions to setup locally |
Thanks @ArkenKiran hadoop jar C:\RMCosmos\OpenTelemtryHadoop\hadoop\hadoop-dist\target\hadoop-3.4.0-SNAPSHOT\share\hadoop\mapreduce\hadoop-mapreduce-examples-3.4.0-SNAPSHOT.jar pi 2 2 Number of Maps = 2 Please help with debugging of the issue. |
@minni31 Please build with the new changes. I was able to run the example you shared. |
💔 -1 overall
This message was automatically generated. |
Heya @ArkenKiran , I'm testing out OpenTelemetry tracing in HBase and I'd like to extend spans down into HDFS. Is this patch is ready for user-testing? Thanks. |
@ndimiduk yes you can use this patch for testing with hbase. Can you please share the details how you plan to test I can also try locally. Thanks |
@steveloughran can you please review this PR. I will try to share some samples of the traces by tomorrow. |
@ArkenKiran I'm working with HBase from branch-2, compiled with JDK11 against hadoop-3.3.1. I have a zookeeper, name node, data node, hbase master, and hbase region server each running as separate processes. i'm using jaeger with cassandra as my span store. i've been posting progress and screenshots in the HBase Slack (apache-hbase.slack.com, invitation only, but you can request an invite by mailing our dev list, or send me your email and I'll add you), and filing/fixing bugs as I go. I'm also on the-asf.slack.com if that's easier. |
1 similar comment
@ArkenKiran I'm working with HBase from branch-2, compiled with JDK11 against hadoop-3.3.1. I have a zookeeper, name node, data node, hbase master, and hbase region server each running as separate processes. i'm using jaeger with cassandra as my span store. i've been posting progress and screenshots in the HBase Slack (apache-hbase.slack.com, invitation only, but you can request an invite by mailing our dev list, or send me your email and I'll add you), and filing/fixing bugs as I go. I'm also on the-asf.slack.com if that's easier. |
@ArkenKiran i'm delegating review and test to @ndimiduk for now |
hadoop-common-project/hadoop-common/src/main/conf/hadoop-env.sh
Outdated
Show resolved
Hide resolved
@ndimiduk Thanks for sharing your approach. I was able to generate the traces for hbase. I could see the hadoop components only in few paths. Please let me know the next steps for HADOOP-15566 |
💔 -1 overall
This message was automatically generated. |
That's great progress, @ArkenKiran ! You got there before I could. On the read path, any block cache miss will go to the hdfs client for a read (can still be local via short-circuit read). On the write path, we should see hdfs interaction via WAL writes and by MemStore flushes. WAL writes are inline with the client activity, MemStore flushes happen in the background. |
@ArkenKiran where do you get the opentelemetry-agent jar for your testing? I've built this patch and it appears that the agent jar is not included.
|
I think you need to add the otel-agent jar as a dependency to hadoop-common and exclude it as a dependency from hadoop-client. then it'll show up in |
This isn't right, but it's along the path,
|
Oh I see, you already have the dependency in the |
Okay, this patch to your branch,
Give me this result in the tarball
|
@ndimiduk thanks for looking into this. I had a todo for this to move it some other location other than lib (like a trace folder). I am not sure how we can do it. I tried few options of adding it in hadoop-dist but that didn't work. I will add your changes to the current PR. Please let me know how we can move it to another folder (in case it its needed) |
Paging @steveloughran @aw-was-here @busbey ... we'd like to ship this opentelemetry-agent.jar in the distribution but we don't want it on the class path. Can you advise on how we might manage the dependency in the pom, but relocate it for distribution? Maybe we add an invocation of the |
I haven't look at this code in forever, but I'm assuming no one has messed up the 'magic bits' in hadoop-dist that allows users to add things via HADOOP_OPTIONAL_TOOLS. See dev-support/bin/dist-tools-hooks-maker. |
…s and core-default.xml (apache#3099) Reviewed-by: Ayush Saxena <ayushsaxena@apache.org>
Reviewed-by: Ayush Saxena <ayushsaxena@apache.org> Reviewed-by: Akira Ajisaka <aajisaka@apache.org> Reviewed-by: Inigo Goiri <inigoiri@apache.org> Reviewed-by: Hui Fei <ferhui@apache.org> Reviewed-by: Viraj Jasani <vjasani@apache.org> Reviewed-by: litao <tomleescut@gmail.com>
💔 -1 overall
This message was automatically generated. |
@ndimiduk @cijothomas @iwasakims can you please help with the review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, let's get this into trunk. We know we need it.
naming
It's very confusing that the same last names are used everywhere (span, spancontext); that is something we will have to live with. However, we can at least use different names for fields and parameters. this PR does it in some places (OTelTracer) but not consistently
I'm going to propose openSpan, openSpanContext. ...
i think for scope we should use openScope for consistency, even when there is no conflict.
marshalling.
I think the span info should continue to be marshalled as optional bytes
.
- compatibility: nothing will break on the wire.
- flexibility to change again in future.
There is actually a fun little problem here. How to handle the situation where two services are built with different versions of tracing, and a client is sending/expecting htrace values?
Maybe a new optional byte array is needed for opentrace data; all ambiguity goes away then.
testing
There is scope for many more tests. I think the initial ones should be let's see what happens if hey span context has come over the wire and cannot be successfully unmarshalled. That is going to be the critical one for wire compatibility across versions.
optional use.
i think what nick dimiduk was asking for was the ability for things to work if the opentelemetry jar is not on the classpath. Is this right? if so, what to do?
We could hide all use of open telemetry with an optional in a class to our own tracing classes; if opentrace was found/enabled we would use that. otherwise it'd all be no-ops. We would keep all uses of the open telemetry class in one single class and use some reflection tricks to decide which to use (i.e. if the jar was found, use that binding...if not fall back to the noop).
import java.io.Closeable; | ||
|
||
public class Span implements Closeable { | ||
|
||
private io.opentelemetry.api.trace.Span span = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- add some javadoc to line 24 to warn this wraps the opentelemetry span.
- skip the =null assignment, as it will save the jvm from some needless init. assuming a lot of spans get created, this may matter
- give the field a different name. like 'openspan'
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/tracing/Span.java
Show resolved
Hide resolved
} | ||
|
||
public void close() { | ||
if(span != null){ | ||
span.end(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would span need to be nullified here. or is it ok to invoke it after being ended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would span need to be nullified here. or is it ok to invoke it after being ended?
I think we need not nullify span after it has ended. In general we call span.end in a finally block so no further operation takes place. If we end the span before that and try adding events or call end it won't have any impact.
private static final String TRACE_FLAGS = "TRACE_FLAGS"; | ||
|
||
|
||
private io.opentelemetry.api.trace.SpanContext spanContext = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, give a different field name to avoid more confusion; skip the assignment to null
String traceId = kvMap.get(TRACE_ID); | ||
String spanId = kvMap.get(SPAN_ID); | ||
String traceFlagsHex = kvMap.get(TRACE_FLAGS); | ||
TraceFlags traceFlags = TraceFlags.fromHex(traceFlagsHex, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if traceFlagsHex isn't found? exit fast
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/tracing/TraceUtils.java
Show resolved
Hide resolved
|
||
import org.junit.Test; | ||
|
||
import static org.junit.Assert.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* should extend AbstractHadoopTestBase or HadoopTestBase here.
- there is scope for a lot more tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steveloughran sure will start adding more tests
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for these changes. now as far as big issues we need to worry about, they are
- wire marshalling in protobuf: efficiency and compatibility
- what our story about dependencies are.
for dependencies we either embrace and mandate a new jar on the cp for iPC (easiest in our code) or do some reflection games to downgrade if it is not on the classpath. which, given telemetry isn't normally a critical feature, could be justified.
but having all of the opentelemetry APIs present makes it easier to pick up and use elsewhere (s3a auditing...).
how much extra pain in terms of transient dependencies would it be for us to always make the jar a dependency of hadoop common?
@@ -21,32 +21,36 @@ | |||
|
|||
import java.io.Closeable; | |||
|
|||
/*** | |||
* This class is a wrapper class on top of opentelemetry Span class | |||
* avoiding direct dependency on opentelemetry API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afraid you will need a "." at the end to keep all versions of javadoc happy. longstanding PITA
My thought for having a map was, the trace_id, parent_id, trace_flags, trace_state have fixed lengths the serialization and deserialization is still in progress for binary format. It might be good to wait for sometime based on this
Currently we need the agent jar only at runtime and its shaded already. We need to have in seperate path other than the common which will not be picked by default (ex $HADOOP_HOME/share/hadoop/trace). probably somewhere outside the common. Opentelemetry APIs are light . The agent jar will only be picked from env when enabled.
I am not sure if we need this
We can keep it common the whole size was close to 28MB and its shaded. @steveloughran Please share your thoughts |
… rpcheader for open span context
💔 -1 overall
This message was automatically generated. |
@steveloughran I have made change to add new field to propagate the span context at RPC Layer. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Description of PR
Changes in the PR are related to enabling tracing using opentelemetry. I am using opentelemetry-javaagent to initialize the Tracer object
How was this patch tested?
I created a build locally and got single node cluster setup and tested the changes.
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?