-
Notifications
You must be signed in to change notification settings - Fork 202
Implementation of OC-Agent Trace Service clients. #1476
Implementation of OC-Agent Trace Service clients. #1476
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1476 +/- ##
============================================
- Coverage 82.46% 81.57% -0.89%
- Complexity 1590 1595 +5
============================================
Files 243 244 +1
Lines 7493 7610 +117
Branches 702 718 +16
============================================
+ Hits 6179 6208 +29
- Misses 1098 1185 +87
- Partials 216 217 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some high-level comments about the overall design.
...agent/src/test/java/io/opencensus/exporter/trace/ocagent/OcAgentTraceServiceClientsTest.java
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
public void onError(Throwable t) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we at least log this? Should we try to reconnect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this error is logged and we should retry on errors.
"Failed to connect to OC-Agent because of %s. Retried %d time(s).", e, i)); | ||
// Exponential backoff. | ||
try { | ||
Thread.sleep(0, getBackoffTimeInNanos(i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should backoff with more than 1ms. Based on the definition of the Sleep the number of nanos has to be <1M.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed exponential backoff since we'll retry every 5 minutes (or a custom time interval).
if (useInsecure == null) { | ||
useInsecure = false; | ||
} | ||
client = OcAgentTraceServiceClients.openStreams(endPoint, useInsecure, serviceName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably I miss understand the current design, but I want to make sure we are on the same page. I think the whole idea was to watch for the oc-agent to run. So if we fail to connect we should retry every 1min (or configurable time). The main workflow is if somebody deploys this then after 1 week they decide that they need traces they will be able to just deploy the ocagent without restarting their app.
High-level design:
- In the application (or other library) we will have something like "registerOcAgentExporterWhenAvailable"
- Start a new Thread (for the purpose of these explanation say name = ThreadOcAgent) that tries to connect to the oc-agent every 1 minute.
- If not able to connect try again in 1 minute (or configurable time).
- If able to connect:
- Install the OcAgentSpanExportHandler to the SpanExporter. So spans will be exported.
- When disconnected uninstall the OcAgentSpanExportHandler from the SpanExporter, also ThreadOcAgent should again retry to connect in 1 minute.
- The config rpc handling can be done in the ThreadOcAgent as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we fail to connect we should retry every 1min (or configurable time).
Do you mean the Agent exporter should keep retrying forever if initiation did not succeed? I saw in Go we only attempted to connect 10 times with exponential backoff, that's also what I did here.
/cc @odeke-em
In the application (or other library) we will have something like "registerOcAgentExporterWhenAvailable"
There's already an API OcAgentTraceExporter.createAndRegister()
. It registers the OcAgentTraceExporterHandler
to SpanExporter
. Before we managed to connect to Agent, the OcAgentTraceExporterHandler
will just do no-op.
Start a new Thread (for the purpose of these explanation say name = ThreadOcAgent) that tries to connect to the oc-agent every 1 minute.
If not able to connect try again in 1 minute (or configurable time).
If we need to keep retrying connection, I agree this should be done in another thread.
Install the OcAgentSpanExportHandler to the SpanExporter. So spans will be exported.
When disconnected uninstall the OcAgentSpanExportHandler from the SpanExporter, also ThreadOcAgent should again retry to connect in 1 minute.
My current design is to register OcAgentSpanExportHandler
first. If connection failed, it will just do no-op. I feel this is more straightforward and efficient than doing the register/unregister to SpanExporter
every now and then, and these two approaches should have the same behavior to the end users.
The config rpc handling can be done in the ThreadOcAgent as well.
Config
RPC has to be in its own thread, because Config
is a blocking RPC. Once the client (OC-Library) sends back the CurrentLibraryConfig
to Agent, that thread will be blocked waiting for the next UpdatedLibraryConfig
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, we think keep retrying is the right thing to do here. Will update the code based on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I hadn't seen this ping, I'll also create an issue for the Go agent exporter.
Please ping me when this is ready for review. |
Updated, PTAL. This works as follows:
|
78c3ce7
to
68141ab
Compare
Rebased, PTAL. |
return exportRequestObserverRef.get() != null; | ||
} | ||
|
||
private static boolean isConfigConnected() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also have an option in the configuration of the ocagent exporter "enable_config" or something similar that allows users to configure if this exporter should also handle the config.
Also based on some comments on gitter we may have a configuration to that says keep connection open (for Android devices we may not want to always keep the RPC connection/request alive). We can address this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an option enableConfig
to OcAgentTraceExporterConfiguration
. Also left a TODO for adding an option on keeping the connection open.
Also renamed and re-organized a few helper methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments that I wrote early this morning. I will do a full review soon.
build.gradle
Outdated
@@ -26,6 +26,10 @@ def useCheckerFramework = rootProject.hasProperty('checkerFramework') | |||
def useErrorProne = !useCheckerFramework | |||
|
|||
subprojects { | |||
// OC-Agent exporter depends on grpc-core which depends on errorprone annotations, and it | |||
// doesn't wort with checker framework. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/wort/work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/** | ||
* Returns the retry time interval when trying to connect to Agent. | ||
* | ||
* @return the retry time interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"." at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* @since 0.17 | ||
*/ | ||
@Nullable | ||
public abstract Boolean getEnableConfig(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use boolean and make the default value "true"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
As we discussed, I'll try to decouple the logic and send separate PRs. Closing this PR for now. |
Fixes #1408.
Equivalent to https://github.com/census-ecosystem/opencensus-go-exporter-ocagent/blob/master/ocagent.go.
This works similarly to Go Agent exporter:
Export
will be handled in the main thread. OC library (SpanExporter
) pushesSpanData
toExportClient
, thenExportClient
translates them toSpanProto
and pushes to Agent.Config
will be handled in a daemon thread, since waiting and applying incomingConfig
will block the thread.