-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(tracing): make spans resilient to performance clock drift #3434
Conversation
abef3d2
to
5d8dd46
Compare
5d8dd46
to
7083983
Compare
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #3434 +/- ##
==========================================
+ Coverage 93.78% 93.80% +0.02%
==========================================
Files 249 249
Lines 7621 7637 +16
Branches 1588 1589 +1
==========================================
+ Hits 7147 7164 +17
+ Misses 474 473 -1
|
return epochMillisToHrTime(inp.valueOf()); | ||
} | ||
|
||
if (isTimeInputHrTime(inp)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version seems to be susceptible to drift in fetch instrumentation (undefined start + hrTime end):
opentelemetry-js/experimental/packages/opentelemetry-instrumentation-fetch/src/fetch.ts
Line 281 in 3290b25
const endTime = core.hrTime(); |
Because hrTime() does getTimeOrigin() + performance.now()
it would generate drifted HrTime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because an HrTime
formatted timestamp is not corrected in any case. Currently, the start time also uses the performance timing API indirectly because a start time is not provided and the performance clock is used. This means that in the current state the whole span is shifted anyway.
My recommendation would be to:
- Change core.hrTime to return an HrTime generated using
Date.now
- Update fetch instrumentation to not pass a time manually
Either of these changes would fix the issue here, but I actually recommend we do both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative to (2) would be to update fetch instrumentation to get start and end times from the performance timing API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@t2t2 does the updated version address your concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fetch and xhr have both been updated. I think this is resolved but I'll wait for @t2t2 to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
like previous pr, codesandbox with the changes, artificial drift and most common instrumentations (with already fixed the |
…try-js into simple-fix-timestamps
The test (Span # should have an entered time for event) is unstable with the change. |
Yeah I see that. I'm trying to work on it now. edit: I see the test relates to events not end. Still looking into it |
Looks like the test was providing time |
@legendecas should be stable now |
@t2t2 I updated the web instrumentations to use Date instead of performance.now to get span end times which I think should address your concerns. I would appreciate your review because you've been very helpful. Hopefully we can get this merged soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic LGTM. Added a few non-blocking comments.
It seems there are so many logical branching here and was wondering if the test coverage is sufficient and covers all the potential permutations.
Each time
can be of 5 types: undefined
, Date.now()
, number epoch
, number performance now
and hrTime
. A span receives times via span constructor, event, and end. I think some of the common combinations or those that are specifically addressed by code can benefit from a dedicated unit test.
@@ -203,15 +227,41 @@ export class Span implements api.Span, ReadableSpan { | |||
this._spanProcessor.onEnd(this); | |||
} | |||
|
|||
private _getTime(inp?: TimeInput): HrTime { | |||
if (typeof inp === 'number' && inp < otperformance.now()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in core we have similar logic where we use origin
for this check, and not now
// Must be a performance.now() if it's smaller than process start time.
if (time < getTimeOrigin()) {
return hrTime(time);
I believe they both work fine, but better to have consistency in the project, even maybe extract this logic into a function isTimeInputPerformanceNow(time: TimeInput)
similar to the existing function isTimeInputHrTime
, this is also self documenting and will make the comment below redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is actually very important to the functioning of the PR. I think the function in core should be deprecated.
The reason to use performance.now
is that we can be much more sure that a given number is a performance timestamp. A number before timeOrigin may be ingested from historical logs or come from an inconsistent time source which has been corrected and timeOrigin is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thanks for the response.
My point is that we are doing the same computation in 2 different places in 2 different ways. This comment is a suggestion to unify it (and maybe extract the logic into a function while we are at it). Could be done in this PR, later one, or not at all
This is a simpler alternative to #3384
Described by the comment here: #3279 (comment)
Fixes #3355