Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(sdk-trace-base): eager exporting for batch span processor #3458

Closed
wants to merge 32 commits into from

Conversation

seemk
Copy link
Contributor

@seemk seemk commented Nov 30, 2022

Which problem is this PR solving?

Fixes BSP silently dropping spans.

Fixes #3094

Short description of the changes

Add eager exporting to BSP:

  1. Start an export as soon as the max batch size is reached after a span has been added to the queue.
  2. The usual export loop runs as-is based on the BSP delay interval.

Additionally:

  • Unref the timeout timer when exporting.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • Followed the style guidelines of this project
  • Unit tests have been added

@seemk seemk requested a review from a team November 30, 2022 18:08
@codecov
Copy link

codecov bot commented Nov 30, 2022

Codecov Report

Merging #3458 (f2b8d51) into main (2499708) will increase coverage by 1.90%.
The diff coverage is 100.00%.

❗ Current head f2b8d51 differs from pull request most recent head 68247b3. Consider uploading reports for the commit 68247b3 to get more accurate results

@@            Coverage Diff             @@
##             main    #3458      +/-   ##
==========================================
+ Coverage   90.52%   92.42%   +1.90%     
==========================================
  Files         159      326     +167     
  Lines        3757     9284    +5527     
  Branches      835     1967    +1132     
==========================================
+ Hits         3401     8581    +5180     
- Misses        356      703     +347     
Files Coverage Δ
...dk-trace-base/src/export/BatchSpanProcessorBase.ts 94.16% <100.00%> (+1.24%) ⬆️

... and 178 files with indirect coverage changes

@seemk
Copy link
Contributor Author

seemk commented Nov 30, 2022

Reopening after dealing with browser issues

@seemk seemk closed this Nov 30, 2022
@seemk seemk reopened this Dec 7, 2022
@dyladan
Copy link
Member

dyladan commented Dec 7, 2022

I remember there was some discussion about this in the spec. Do you know if that was ever resolved or what the current state of that is?

@dyladan
Copy link
Member

dyladan commented Dec 7, 2022

Here is the spec issue open-telemetry/opentelemetry-specification#849

@dyladan
Copy link
Member

dyladan commented Dec 7, 2022

https://github.com/open-telemetry/opentelemetry-specification/pull/3024/files

return;
}

if (this._nextExport === 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'm not a fan of this pattern of "detecting" whether we should start a timer or not -- I perfer more explicit start/reset descriptions.

It took me a while to understand that this it the "flag" you are effectively using to determine whether there is "more" data (or not) and then whether to "reset" the timer.

I also don't like that you can "reset" an already running timer, as it could be very easy for someone to come along later and cause an endless (until it got full) delay if something is getting batched at a regular interval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_nextExport is only used to avoid needlessly resetting the timer, i.e. it means that an export is already queued next cycle. Think about appending thousands of spans consecutively in the same event loop cycle.

What would your alternative be to resetting an already running timer? Starting an export in a new promise right away when the buffer reaches the threshold can't be done as it would cause both too many concurrent exports and would nullify the concept of max queue size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_nextExport is only used to avoid needlessly resetting the timer, i.e. it means that an export is already queued next cycle.

this._timer will (should) be undefined when no timer is already running 😄 (as long as you also set the value to undefined (or null) within the setTimeout() implementation as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is basically what the previous implementation was doing with if (this._timer !== undefined) return; in the _maybeSet function (although I'd also prefer to see a check that nothing is currently batched as well -- to avoid creating the timer in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timer is now always running, basically this new implementation sets the timeout to 0 once the batch size is exceeded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if there is nothing in the batch we should not have any timer running...
ie. The timer should only be running if there is something waiting to be sent, otherwise, if an application is sitting doing nothing (because the user walked away) by having a running timer this can cause the device (client) to never go to sleep and therefore use more resource (power / battery) when not necessary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the spec (https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#batching-processor) we could probably simplify this whole processor to just be a setInterval and a list of spans. During the most recent spec iteration it was made clear that there is no need to wait for the previous export to complete before starting another one. The word "returns" is used very explicitly in the spec and refers to thread safety, not to the actual async task of the export.

Copy link
Contributor Author

@seemk seemk Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't not needing to wait for previous export to complete invalidate the maxQueueSize parameter in this case? E.g. when starting an export when the batch size has been reached

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplify this whole processor to just be a setInterval

Noooo! Intervals are worse than timeout management as intervals get left behind and cause the APP/CPU to always be busy (at regular intervals)

@arbiv
Copy link

arbiv commented Jan 16, 2023

We encountered this issue too. @dyladan @seemk @MSNev Is there something blocking it from being merged?
@TimurMisharin FYI

@seemk
Copy link
Contributor Author

seemk commented Jan 16, 2023

We encountered this issue too. @dyladan @seemk @MSNev Is there something blocking it from being merged? @TimurMisharin FYI

Not that I know of, just needs reviews

@dyladan
Copy link
Member

dyladan commented Jan 16, 2023

Nothing blocking for me I'm just on vacation. Spec merged last week

@cftechwiz
Copy link

@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.

@dyladan
Copy link
Member

dyladan commented Feb 3, 2023

@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.

I had been waiting for a resolution of the conversation between you and @MSNev

@MSNev
Copy link
Contributor

MSNev commented Feb 3, 2023

@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.

I had been waiting for a resolution of the conversation between you and @MSNev

And my comment (I believe) still stands, we should NOT have a timer running (including an interval timer), if there is nothing "batched" as this will cause issues relating to performance etc.

@cftechwiz
Copy link

@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.

I had been waiting for a resolution of the conversation between you and @MSNev

And my comment (I believe) still stands, we should NOT have a timer running (including an interval timer), if there is nothing "batched" as this will cause issues relating to performance etc.

This definitely fits the narrative/issue we're trying to resolve. @dyladan can you provide an ETA on this?

@seemk
Copy link
Contributor Author

seemk commented Feb 6, 2023

@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.

I had been waiting for a resolution of the conversation between you and @MSNev

And my comment (I believe) still stands, we should NOT have a timer running (including an interval timer), if there is nothing "batched" as this will cause issues relating to performance etc.

This definitely fits the narrative/issue we're trying to resolve. @dyladan can you provide an ETA on this?

I can update my PR either tomorrow or on Wednesday, however getting rid of a free running timer on a 5 second interval seems to be a microoptimisation, is there even any data to back these performance concerns?

@MSNev
Copy link
Contributor

MSNev commented Feb 17, 2023

I can update my PR either tomorrow or on Wednesday, however getting rid of a free running timer on a 5 second interval seems to be a microoptimisation, is there even any data to back these performance concerns?

I don't have any public data that I can provide, but when a client has a running timer it restricts the ability of that process being put to sleep/reduce power usage state and therefore to reduce the amount of energy that is being burnt.

This issue mostly affects Clients (browser / mobile devices), but will also indirectly affect hosted node instances that will be constantly burning CPU even when it has no events/traffic -- thus costing users $$$

@cftechwiz
Copy link

cftechwiz commented Mar 30, 2023

I am bumping this thread as it seems to have gone a bit dormant. We know of production impact in NodeJS/GraphQL that this PR resolves. In order to accelerate the customer and get them unblocked from leverage Batch Processor we had them fork the repo, rebase the branch and build it. They did this, and I can report that this has resolved the issue and they are no longer dropping spans. Also, we have not identified any of the issues related to performance concerns being discussed here. We're in agreement that the timer is a micro-optimization and do not believe it should hold this up any longer. If we want to continue debating the timer, I propose we raise a new issue and let this one pass.

If anyone would like more information, let me know what you want and I will try to provide it. We'd be really grateful if we could go ahead and see this make its way to being merged soon so we can pick up the fix and get them back into the main branch...

Cheers,
Colin

@dyladan
Copy link
Member

dyladan commented Apr 24, 2023

@cfallwell my apologies I seem to have let this fall through the cracks. In the absence of public data from @MSNev I think I agree we should move forward. Any alternative implementation suggestions are welcome or concrete data to quantify the performance impact.

I'll put this on my schedule to review tomorrow.

@MSNev
Copy link
Contributor

MSNev commented Apr 24, 2023

My original comments still stand, in fact I have had to fix issues related to "background" timers this week for Application Insights because it kept causing the app to "wake up" the App and the burn CPU for no reason (as 99% of the time the queue was empty). This issue was raised by one of our internal partners who have an extremely robust perp framework and runtime measurements.

Also by NOT having a running timer in the background (and assuming that the timeout period is not too long), the need to unref the timer diminishes as there is no running timer to stop the app from closing.

Using standard setInterval / clearInterval (I could not change to setTimeout in our master repo as we return the interval id as part of the API. microsoft/ApplicationInsights-JS@0d970c7

But in the current release, we use a wrapper (which works for any runtime (Node, Browser, worker) microsoft/ApplicationInsights-JS@d14c15b while this does use unref this is because this background timer is generally only used for internal non-critical log reporting.

Both of these just pump the "log" event into our export pipeline (which has it's own timer (which is only running when there is something waiting))

@Flarna Flarna mentioned this pull request Jul 12, 2023
1 task
@dyladan
Copy link
Member

dyladan commented Oct 6, 2023

#3958 solved this

@dyladan dyladan closed this Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eager exporting for BatchSpanProcessor
5 participants