Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure forceFlush / shutdown can have callers wait for them to be done by returning CompletableResultCode. #1571

Merged

Conversation

anuraaga
Copy link
Contributor

@anuraaga anuraaga commented Aug 21, 2020

/cc @huntc

Fixes #1584

@codecov
Copy link

codecov bot commented Aug 21, 2020

Codecov Report

Merging #1571 into master will increase coverage by 0.08%.
The diff coverage is 86.11%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1571      +/-   ##
============================================
+ Coverage     86.87%   86.96%   +0.08%     
- Complexity     1406     1408       +2     
============================================
  Files           163      163              
  Lines          5470     5469       -1     
  Branches        526      529       +3     
============================================
+ Hits           4752     4756       +4     
+ Misses          530      525       -5     
  Partials        188      188              
Impacted Files Coverage Δ Complexity Δ
...try/exporters/inmemory/InMemoryMetricExporter.java 100.00% <ø> (ø) 8.00 <0.00> (ø)
...metry/exporters/jaeger/JaegerGrpcSpanExporter.java 72.00% <0.00%> (-3.95%) 4.00 <0.00> (ø)
...metry/exporters/logging/LoggingMetricExporter.java 87.50% <ø> (ø) 7.00 <0.00> (ø)
...lemetry/exporters/otlp/OtlpGrpcMetricExporter.java 56.00% <ø> (ø) 6.00 <0.00> (ø)
...metry/sdk/metrics/export/IntervalMetricReader.java 87.69% <ø> (ø) 4.00 <0.00> (ø)
...elemetry/sdk/trace/export/SimpleSpanProcessor.java 97.29% <50.00%> (-0.08%) 10.00 <1.00> (ø)
...k/extensions/trace/export/DisruptorEventQueue.java 81.31% <77.77%> (-2.20%) 13.00 <2.00> (+1.00) ⬇️
...telemetry/sdk/trace/export/BatchSpanProcessor.java 90.20% <93.33%> (+1.00%) 8.00 <1.00> (ø)
...metry/exporters/inmemory/InMemorySpanExporter.java 95.45% <100.00%> (ø) 7.00 <0.00> (ø)
...lemetry/exporters/logging/LoggingSpanExporter.java 85.71% <100.00%> (-0.96%) 7.00 <1.00> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bac7193...6e3e2ba. Read the comment docs.

logger.log(Level.FINE, "Exporter busy. Dropping spans.");
}
}
private ScheduledFuture<?> scheduleNextExport() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this always update nextExport?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops mixed myself up here

}
});

// Timeout if result doesn't complete soon enough.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specific to this PR: Why this timeout is the configuration of BSP and not of the exporter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the spec too ;)

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/sdk.md#batching-processor

I have this issue in case you'd like to add other potential improvements to the spec!

open-telemetry/opentelemetry-specification#849

@SuppressWarnings("FutureReturnValueIgnored")
private void exportBatch() {
List<ReadableSpan> batch = new ArrayList<>(Math.min(maxExportBatchSize, queue.size()));
if (queue.drainTo(batch, maxExportBatchSize) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should benchmark this, but we can save on allocations if we take elements by one, convert them to SpanData and only then put to outgoing forExport. Then we don't need batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think poll has to get a lock in the queue at every call while drain only does it once for the entire batch, can only say for sure with a benchmark indeed but pending the benchmark, I lean towards sticking to drainTo.

monitor.notifyAll();
}
}
List<SpanData> forExport = new ArrayList<>(batch.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse this list and not re-allocate it on every export?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one we pass to the exporter so we can't reuse, I added a reused ReadableSpan buffer though.

if (queue.size() >= maxExportBatchSize) {
exportBatch();
} else {
scheduleNextExport();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned, that if any error situation prevents the runtime to reach this line for one export attempt, then the whole BSP will stop working. It seem that current design (schedule next export when this one finishes) tries to achieve "fixed delay". If we switch to "fixed rate" we can simplify it. I don't see why "fixed rate" is worse than "fixed delay".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the sentiment, but I think we can be confident in our implementation of CompletableResultCode - so worst case should be the timeout below calling this callback no matter what.

I think fixed delay tends to be safer by easily preventing concurrent exports (avoid the need to keep track of availability) and is my reading of the spec anyways. Fixed rate would still be ok I think if we didn't have the eager export when queue still has items like on line 200, which makes the time spent on exporting very unpredictable - I noticed the other implementations have this behavior too so kept it here. Also if we went with fixed rate, I don't think we'd be able to do buffer reuse optimization as in https://github.com/open-telemetry/opentelemetry-java/pull/1571/files#r474439925

I've seen a lot of similar async + scheduled tasks in Armeria and we almost always use this delay pattern to avoid the concurrency issue, and are just confident that callbacks always get called - I'd like to keep it like this but let me know if it's a strong concern.

Copy link
Contributor Author

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - also realized forceFlush / shutdown need to wait for the spans to be sent (forceFlush is for FaaS like lambda)

}
});

// Timeout if result doesn't complete soon enough.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the spec too ;)

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/sdk.md#batching-processor

I have this issue in case you'd like to add other potential improvements to the spec!

open-telemetry/opentelemetry-specification#849

logger.log(Level.FINE, "Exporter busy. Dropping spans.");
}
}
private ScheduledFuture<?> scheduleNextExport() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops mixed myself up here

if (queue.size() >= maxExportBatchSize) {
exportBatch();
} else {
scheduleNextExport();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the sentiment, but I think we can be confident in our implementation of CompletableResultCode - so worst case should be the timeout below calling this callback no matter what.

I think fixed delay tends to be safer by easily preventing concurrent exports (avoid the need to keep track of availability) and is my reading of the spec anyways. Fixed rate would still be ok I think if we didn't have the eager export when queue still has items like on line 200, which makes the time spent on exporting very unpredictable - I noticed the other implementations have this behavior too so kept it here. Also if we went with fixed rate, I don't think we'd be able to do buffer reuse optimization as in https://github.com/open-telemetry/opentelemetry-java/pull/1571/files#r474439925

I've seen a lot of similar async + scheduled tasks in Armeria and we almost always use this delay pattern to avoid the concurrency issue, and are just confident that callbacks always get called - I'd like to keep it like this but let me know if it's a strong concern.

@SuppressWarnings("FutureReturnValueIgnored")
private void exportBatch() {
List<ReadableSpan> batch = new ArrayList<>(Math.min(maxExportBatchSize, queue.size()));
if (queue.drainTo(batch, maxExportBatchSize) == 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think poll has to get a lock in the queue at every call while drain only does it once for the entire batch, can only say for sure with a benchmark indeed but pending the benchmark, I lean towards sticking to drainTo.

monitor.notifyAll();
}
}
List<SpanData> forExport = new ArrayList<>(batch.size());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one we pass to the exporter so we can't reuse, I added a reused ReadableSpan buffer though.

@anuraaga anuraaga force-pushed the simplify-batch-span-processor branch from ee18066 to e164075 Compare August 21, 2020 08:40
forceFlush = true;
// Cancel only returns true once.
if (!nextExport.cancel(false)) {
// Already exporting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty if?

exportBatch(result);
} else if (forceFlush) {
// When we're forcing a flush we want to export even when we don't have a full batch.
forceFlush = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these recent changes it became much more complicated :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I noticed that the forceFlush / shutdown weren't waiting for all the processing anymore in the current code too, though those methods have to be from what I can tell from their javadoc. Not sure of a way around without allowing concurrent exports.

timer.cancel();
spanExporter.shutdown();
});
final CountDownLatch latch = new CountDownLatch(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sure would be nice if we had CompletableFuture here, so we could just have exportBatch return one and we could wait for it to a complete with a timeout. :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got caught up handling forceFlush in a way that tried to respect the delay as much as possible (e.g., canceling the schedule). I realized this is overkill, even if there's a bit of extra churn when flushing it's ok given the use case for the function. Will rework to remove a lot of cruft.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to also add the waiting here - forceFlush, unlike normal export, does need to block on the export that's what it's designed for. I noticed our tests never actually verified this since they were blocking outside of forceFlush so I tweaked them.

@jkwatson Do you think I should move this into a join method on CompletableResultCode so it looks like a duck CompletableFuture?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think it is a great concern as what you have reads fine to me in the absence of Java 8.

});

// Timeout if result doesn't complete soon enough.
executor.schedule(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it hurts me that we have to do this to implement the timeout. Especially since it won't actually stop the export from running!

Copy link
Contributor

@huntc huntc Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing can stop an export from running. We’ve no idea what threads it runs on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to add canceling of ongoing export, we can add a cancel method to CompletableResultCode which cancellable exporters would need to listen for. Can think about it for the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can already do it. Fail the result code from here. Exporters should also register for completion and handle accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code had this functionality given the need to shutdown exports taking too long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an interesting idea to have the exporter also listen on its result, don't think I've seen that pattern before but it sounds like it'd work! In that case yeah this should be ok since we're failing the export result like before in the timeout.

Copy link
Contributor

@huntc huntc Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exporters should subscribe to their completion status by convention. It is the only way they can properly handle their shutdown method.

We should probably add a sentence to the exporter shutdown method to highlight this, and also that there is only ever one completion in flight at any one time.

@huntc
Copy link
Contributor

huntc commented Aug 21, 2020

This PR has reduced code coverage and increased complexity:

- Coverage     86.85%   86.78%   -0.07%     
- Complexity     1392     1398       +6     

Copy link
Contributor Author

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lower coverage seems to only be the lines for the interrupted exception from what I can read from the diff. Handling forceFlush has unfortunately affected that a bit.

});

// Timeout if result doesn't complete soon enough.
executor.schedule(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an interesting idea to have the exporter also listen on its result, don't think I've seen that pattern before but it sounds like it'd work! In that case yeah this should be ok since we're failing the export result like before in the timeout.

}
});
try {
latch.await(exporterTimeoutMillis, TimeUnit.MILLISECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s disappointing that we’ve resorted to blocking a thread again as a solution given the effort I went to removing it in the past. Blocking is generally avoidable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to this, passing in an executor or Timer (existing impl) that the BSP can share with other parts of an app is important. Thread and their pools are an expensive resource, particularly for embedded environments (like Android ;-) ).

Sharing thread resources does mean that blocking is to be avoided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think forceFlush would have to return a CompletableResultCode to allow a caller to wait if it needs to - this would be a huge API change since span processors are defined in the spec itself, both it and the javadoc recommend blocking

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/sdk.md#shutdown

Of course since some languages can't block anyways this does seem like an impractical spec in some sense.

@jkwatson any appetite for changing forceFlush / shutdown to return completable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine. Users who don't care about the result can continue to ignore it. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah also just to make sure, to the point of sharing threads allowing a user to provide an executor is interesting - I think we can do it but would probably lose the buffer reuse optimization since it may not be a single thread.

But since the point is about blocking shared threads wanted to point out this is blocking the caller, not the processing thread. The method is currently intended to be used when the caller wants that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkwatson Cool - let me try changing the return type, it does also give the option of an async force flush which is nice.

Copy link
Contributor Author

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change to make flush / shutdown async, it's a doozy. Since we return these from SpanProcessor now they're very close to users (e.g., they may be calling forceFlush) so I added .get() and .ofAll() to give a bit of convenience when using them.

throw new TimeoutException();
}
return this;
}
Copy link
Contributor

@huntc huntc Aug 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d really rather we don’t add methods that encourage blocking. My past colleague, Viktor Klang, once said that providing an await on Scala’s Future was a mistake. People WILL abuse this!!!

Please remove it.

Copy link
Contributor Author

@anuraaga anuraaga Aug 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scala has a lot of patterns for asynchronouos programming and I can understand that sentiment, but this is Java, it's used in a lot of codebases, and arguable a majority is just standard Tomcat thread-per-request, block a lot. The CountDownLatch pattern is too tedious for these users if they need to wait for a flush - as a Java API, we need to make sure blocking users have an idiomatic API as well.

FWIW, I'm a huge proponent of asynchronous programming and it's one of the reasons I've been trying my best to push all the work on CompletableResultCode forward. Even in Armeria, it took me months to finally allow a blocking option for gRPC since async is awesome! But after users continue to have problems because of it, it became obvious that in Java it's not fair to the users to make blocking inconvenient since many need it, so finally changed my mind. I appreciate you pointing out that instead of blocking by default, we should return CompletableResultCode and I think the API is much more flexible thanks to it :) But can't ignore the use cases for blocking here too, hope that makes sense. (rereading that linked issue apparently I saved someone's life that's gotta be worth it? :P)

Copy link
Contributor

@huntc huntc Aug 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Java vs Scala angle on async programming is inaccurate. The world for Java developers is no different from that of Scala developers when it comes to async.

This get method will encourage blocking where it is not required. Please reconsider.

Copy link
Contributor

@huntc huntc Aug 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps as a compromise, let's see if a problem surfaces where a user of Export related code demands the blocking scenario and remove the get for now. Less API surface area to maintain is always a good thing. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forceFlush exists in the first place mainly for FaaS like AWS Lambda. They require blocking processing or risk having the runtime freeze. Ideally the lambda runtime would have a callback to control freeze - I don't see any reason that won't happen someday but we're not there yet and this is just one runtime. Providing a way to wait for spans to process in FaaS is a well understood use case for this API.

Shutdown also, very few shutdown hooks would function without blocking, I believe this is still the case for e.g. Spring when it closes beans.

When CompletableResultCode was only at the exporter I think it was far from the user but the span processor is much closer, and we have well known use cases for these to allow blocking so let's provide support for this. FWIW we're only providing a version with a timeout here, and I don't see a reason to provide a no-timeout version which would be vastly easier to call in the wrong way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A force-flush will see that a span processor's buffers are flushed. This has no effect on any async activity being performed by an exporter at the time. Some exporters, like NR's, apparently process exports in the background having immediately returned from the export method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a sync or pause style method is required to convey that the environment is about to suspend.

Copy link
Contributor

@huntc huntc Aug 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your earlier suggestion of a join on CompletableResultCode is now quite appealing. If you have join then you shouldn't require get and join's semantics are nice here. Also, force-flush can be async for nearly all use-cases.

NR's exporter is a separate issue. Anything to be used in a FaaS or shutdown style scenario will need to utilise CompleteableResultCode. Joining force-flush's result code will then "just work".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have always considered get and join to be the same, but former has checked exceptions and allows timeout. Are you suggesting we go with join with unchecked exceptions? Or maybe I'm missing a difference in semantics, I've never considered the two in much depth but let me know if join seems better. I may have to define an exception type if we use unchecked exceptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting we go with join with unchecked exceptions?

Yeah.

Using get returns a result that we aren't really interested in - it has no real meaning for the context where we're shutting down or flushing. Joining conveys more about what we're trying to do IMHO. The fact that join returns a value is weird to me, but we should be consistent with CompletableFuture I guess.

Exceptions are generally going to be non-recoverable too, so returning an unchecked exception makes sense to me.

I think the use of join should be rare (as it is with CompletableFuture). But the ideal for the use-cases you mentioned.

@anuraaga anuraaga changed the title Simplify BatchSpanProcessor by using a BlockingQueue / ScheduledExecu… Make sure forceFlush / shutdown can have callers wait for them to be done by returning CompletableResultCode. Aug 24, 2020
@anuraaga
Copy link
Contributor Author

Went ahead and changed get to join and don't throw exceptions from it. The majority of this PR is actually making sure forceFlush can actually be waited on. Without changing the API for SpanProcessor.forceFlush, if we were to stick with void, it's not possible to do an async flush so increasing the flexibility does seem to make sense. But it's definitely a far-reaching change.

@anuraaga
Copy link
Contributor Author

My current thinking after comparing with #1579 is it could make sense to keep the default BatchSpanProcessor a simple implementation that blocks between exports. In my own app, I'd probably swap it out with a custom non-blocking span processor that uses a fastutil MpscQueue and a netty event loop for scheduling anyways (I've done this for my usage of brave for a long time).

One of the big problems I have with this PR is avoiding a stack overflow - in practice even when directly exporting the next batch of spans, it would not result in stack overflows for proper asynchronous exporters. But there's no way to guarantee the exporter is asynchronous, and BatchSpanProcessor should be generic enough to work with synchronous exporters too. I think the workaround I did to use executor.submit without a schedule to immediately execute the next batch probably has at least as much overhead as just a normal blocking call within a loop.

I think we would need the API change to SpanProcessor regardless though since it's not possible to even swap in a non-blocking span processor right now where forceFlush can't be invoked asynchronously.

@iNikem
Copy link
Contributor

iNikem commented Aug 24, 2020

I don't understand why flush should return a future (or any analogue). What are those 2 different use-cases where you want both blocking and non-blocking flush? The purpose of flush is to handle FaaS. I think that use-case requires blocking flush. What are others?

@anuraaga
Copy link
Contributor Author

They're both FaaS :) But it depends on the implementation of the FaaS runtime, whether the only way to prevent freezing is blocking (which is the norm right now) or it exposes a callback to be notified when to freeze - the latter would allow asynchronous apps (e.g. something using just reactive streams) which isn't allowed by a blocking void flush command.

@iNikem
Copy link
Contributor

iNikem commented Aug 24, 2020

Are there such FaaS providers already? Or we just speculate and prepare for a future (pun intended) :)

@anuraaga
Copy link
Contributor Author

I've heard such a future is coming ;) We could also add flushAsync later if we want to punt on it, but I think we may still at least need to speculatively move CompletableResultCode up to allow that.

Copy link
Contributor

@huntc huntc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I particularly like the async forceFlush now.

span6.toSpanData());

// Give time for the BatchSpanProcessor to attempt to export spans.
Thread.sleep(500);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Thread.sleep is fragile in tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - I couldn't find a way around it for this test though.

Copy link
Contributor

@huntc huntc Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - I couldn't find a way around it for this test though.

I had the same problem in my PR. The test should assert only what it can guarantee. If we aren’t happy with that then we should remove the test entirely, as the required outcome can’t be achieved.

Flaky tests will cause headaches for some unfortunate soul down the track.

this.spanExporter = spanExporter;
queue = new ArrayBlockingQueue<>(maxQueueSize);
executor =
Executors.newSingleThreadScheduledExecutor(new DaemonThreadFactory(WORKER_THREAD_NAME));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass in the executor here so that the BSP can share resources and leave the executor as an outside concern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of allowing to pass an executor but would like to keep it to a follow up if that's ok, it's a small change since we'd need a fallback anyways, and this PR has enough to digest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there should be a fallback executor. The executor should always be an external concern IMHO. Users should be encouraged to think about these things... :-)

new Runnable() {
@Override
public void run() {
executor.shutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The executor should be an outside concern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, I don't think we should require a user to bring their executor. So if we support it, we would shutdown a fallback executor, or not shutdown otherwise, and I'd like to handle that separately.

final int limit = offset + num;
List<SpanData> forExport = new ArrayList<>(num);
for (int i = offset; i < limit; i++) {
forExport.add(spans.get(i).toSpanData());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that a native array copy would be more efficient here... and then form an array list around that for calling the exporter. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I've never compared performance of preallocated ArrayList with .add vs allocating an array and using Arrays.asList. I'll need to write a JMH to confirm the difference, since i doesn't start at 0 there is some extra cognitive load so I'd like to try that after confirming the perf difference.

@huntc
Copy link
Contributor

huntc commented Aug 24, 2020

I don't understand why flush should return a future (or any analogue). What are those 2 different use-cases where you want both blocking and non-blocking flush? The purpose of flush is to handle FaaS. I think that use-case requires blocking flush. What are others?

FaaS isn't the only reason for flushing. Shutting down causes an implicit flush and then a join. Keep the joins (blocking) to a minimum otherwise it will become abused.

nextExport.cancel(false);

final CompletableResultCode result = new CompletableResultCode();
final CompletableResultCode flushResult = forceFlush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like there should be a way to compose the things in here more elegantly, but it could be done as a follow-on PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lambdas ;-)

@SuppressWarnings("FutureReturnValueIgnored")
private void exportNextBatch(
final List<ReadableSpan> spans, final int offset, final CompletableResultCode result) {
int num = Math.min(maxExportBatchSize, spans.size() - offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe numberToExport, rather than just num?

}
exportResult.whenComplete(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about putting this inside the try block? Seems like it might read a little more cleanly not to have to pre-declare the exportResult.

@jkwatson
Copy link
Contributor

This is more complex than I hoped, but I think it looks pretty solid. Needs a rebase, after the addition of the ReadWriteSpan.

@huntc
Copy link
Contributor

huntc commented Aug 24, 2020

This is more complex than I hoped, but I think it looks pretty solid. Needs a rebase, after the addition of the ReadWriteSpan.

FWIW I don’t use the BatchSpanProcessor. For me, batching and other concerns such as thread usage belong within the Exporter. I therefore use the SimpleSpanProcessor. I suspect that as other exporters become more sophisticated then they will also encourage the use of SimpleSpanProcessor.

FYI I’ve got a replacement for the OTel exporter that uses Akka streams. Akka streams makes it trivial to batch, scan, slide and so much more. I expose the stream such that a user can easily add their own batch stages etc, and even how the my want to communicate to a collector (or even whether they do!). Streams in general are a great programming abstraction here. Composition is key.

@jkwatson
Copy link
Contributor

This is more complex than I hoped, but I think it looks pretty solid. Needs a rebase, after the addition of the ReadWriteSpan.

FWIW I don’t use the BatchSpanProcessor. For me, batching and other concerns such as thread usage belong within the Exporter. I therefore use the SimpleSpanProcessor. I suspect that as other exporters become more sophisticated then they will also encourage the use of SimpleSpanProcessor.

FYI I’ve got a replacement for the OTel exporter that uses Akka streams. Akka streams makes it trivial to batch, scan, slide and so much more. I expose the stream such that a user can easily add their own batch stages etc, and even how the my want to communicate to a collector (or even whether they do!). Streams in general are a great programming abstraction here. Composition is key.

Absolutely. A java 8 Stream-based processor/exporter would also be a great thing to have.

iNikem added a commit to iNikem/opentelemetry-java that referenced this pull request Aug 25, 2020
iNikem added a commit to iNikem/opentelemetry-java that referenced this pull request Aug 26, 2020
jkwatson pushed a commit that referenced this pull request Aug 26, 2020
* More aggresive version of BatchSpanProcessor

* Add benchmark

* Polish

* Polish

* Incorporated some changes from #1571

* Rollback one test change

* Polish

* Polish
@jkwatson
Copy link
Contributor

@anuraaga Looks like this needs to be updated after #1579 got merged.

Anuraag Agrawal added 3 commits August 27, 2020 15:28
@anuraaga anuraaga force-pushed the simplify-batch-span-processor branch from 141c05e to 555df8b Compare August 27, 2020 06:53
@anuraaga
Copy link
Contributor Author

@jkwatson Updated

Copy link
Contributor

@jkwatson jkwatson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jkwatson jkwatson merged commit c6c179c into open-telemetry:master Aug 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non-blocking SpanProcessor interface
4 participants