Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/spanmetrics] Clamp negative duration times to 0 bucket. #9891

Merged
merged 4 commits into from
Aug 3, 2022

Conversation

crobertson-conga
Copy link
Contributor

Description:
Modifies spanmetrics processor to handle negative durations by clamping to 0 duration. Original ticket described issue as being caused when the time delay is huge, but the duration would have to be over 290years....which doesn't make sense. The bug is consistently replicated if the start time is after the end time. While ideally this would be fixed at the source, this adds protection in case the problem is missed upstream. See open-telemetry/opentelemetry-js-contrib#1013 for related upstream issue.

Link to tracking Issue: #7250

Testing: Added a new span where the start and end timestamps were swapped.

Documentation: None.

@crobertson-conga crobertson-conga requested review from a team and bogdandrutu May 10, 2022 15:46
@crobertson-conga crobertson-conga changed the title processor/spanmetrics Clamp negative duration times to 0 bucket. [processor/spanmetrics] Clamp negative duration times to 0 bucket. May 10, 2022
@TylerHelmuth
Copy link
Member

What is the implication of clamping the data point to the 0 bucket, does it significantly skew data (is this issue happening a lot)? Would it be better to drop the data point?

@crobertson-conga
Copy link
Contributor Author

crobertson-conga commented May 10, 2022

It doesn't happen a lot, but when it does it crashes the collector.
In my experience with the browser one, it swaps it by a few milliseconds. I'm guessing it's a rounding error in the browsers that is causing it, but am unsure. There doesn't seem to be guards in place anywhere to prevent this form happening.
Capturing the call count is probably more important than dropping the datapoint entirely....a zero in place of negative duration seems sane there.

@@ -381,7 +381,13 @@ func (p *processorImp) aggregateMetricsForServiceSpans(rspans ptrace.ResourceSpa
}

func (p *processorImp) aggregateMetricsForSpan(serviceName string, span ptrace.Span, resourceAttr pcommon.Map) {
latencyInMilliseconds := float64(span.EndTimestamp()-span.StartTimestamp()) / float64(time.Millisecond.Nanoseconds())
// Protect against end timestamps before start timestamps. Assume 0 duration.
latencyInMilliseconds := float64(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like doing a math.Max with 0 and the calculated value would be more straightforward & easier to understand for future maintainers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dehaansa Part of the issue is that the timestamps are uint64, so the result of subtraction is always going to be a positive integer. math.Max will not solve the issue in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the issue is that the timestamps are uint64, so the result of subtraction is always going to be a positive integer. math.Max will not solve the issue in that case.

This doesn't make sense.

This calculation:

latencyInMilliseconds := float64(span.EndTimestamp()-span.StartTimestamp()) / float64(time.Millisecond.Nanoseconds())

will result in a float64 that is >0 if the end time is after the start time and <0 if it is before the start time. So, in that case can't we follow up with math.Max() since we're now in float64 land?

latencyInMilliseconds = math.Max(0, latencyInMilliseconds)

We could rewrite this alternately:

max := func(a, b int) int {
  if a > b {
    return a
  }
  return b
}

intDelta := span.EndTimestamp()-span.SpartTimestamp()
clampedDelta := max(0, intDelta)
latencyInMs := float64(clampedDelta) / float64(time.Millisecond)
// time.Millisecond is already an integer number of nanoseconds
// same as clampedDelta

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aneurysm9 Let me clarify,

If you have a negative result from subtraction, using uint64 would mean the sign bit is ignored, resulting in a large positive number instead of a negative number. When you attempt the clamp, the value is already going to be positive.

You can see this with: https://go.dev/play/p/mHQ3_zjFzlq

This is why we need to validate that end time is after start time instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aneurysm9 @dehaansa @bogdandrutu I would really like to get this in, its been over a month and a half since I provided this solution. The current implementation crashes the collector anytime problematic spans are provided.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Treating timestamps as uint64 is a mistake and can't be assumed to be honored by all languages. The spec anticipates that timestamps will be represented using language-preferred mechanisms, which in Go is time.Time, aka int64 nanoseconds from the UNIX epoch. The proto uses fixed64, which is presumably why pdata uses uint64, but even pdata assumes that values can be cast to int64 sanely.

If you use span.EndTimestamp().AsTime() you'll get a time.Time value which will work correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that treating the timestamps as a uint64 is a mistake generally, however, this is golang and the current underlying implementation for this data is uint64. The calculation is only in this specific scope geared towards incrementing a specific bucket and isn't bleeding out of this function.

Both methods will result in the same effect. Converting to timestamps and using Sub function has additional overhead associated with it that a subtraction and if statement doesn't.

I'm not against doing the switch, it's pretty simple, but is it worth the additional overhead for a throwaway calculation to be the correct data type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pinging @Aneurysm9

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding things correctly:

  1. This PR resolves a user-facing correctness problem.
  2. Whether or not pdata.Timestamp should be uint64 is a question of code quality / quality of life. It applies to a larger context than this PR.

If so, I suggest this PR should be merged now, and a separate proposal can be made to discuss the uint64 question.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 2, 2022

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jun 2, 2022
@crobertson-conga
Copy link
Contributor Author

This is not stale

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jun 25, 2022
@crobertson-conga
Copy link
Contributor Author

As a note, it looks like accepting negative duration spans is semi-intentional. They warn about it, but don't discard. See https://github.com/open-telemetry/opentelemetry-js/blob/6eca6d4e4c3cf63a2b80ab0b95e4292f916d0437/packages/opentelemetry-sdk-trace-base/src/Span.ts#L183

@github-actions github-actions bot removed the Stale label Jul 3, 2022
@TylerHelmuth
Copy link
Member

@crobertson-conga please add a new changelog entry following these new instructions

@TylerHelmuth
Copy link
Member

pinging @albertteoh as code owner

@crobertson-conga
Copy link
Contributor Author

@Aneurysm9 @TylerHelmuth @albertteoh

I would like to get this merged, please let me know if the change to using time is insisted on given the other considerations surrounding additional overhead.

Copy link
Contributor

@albertteoh albertteoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@crobertson-conga
Copy link
Contributor Author

@Aneurysm9 @TylerHelmuth still looking to get this merged

Comment on lines +388 to +389
if endTime > startTime {
latencyInMilliseconds = float64(endTime-startTime) / float64(time.Millisecond.Nanoseconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the correct behavior to treat a malformed span (end time > start time) as a latency 0? I think we should actually skip these spans.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm conflicted here. I think the count of those spans is useful. Ideally this gets fixed upstream of it happening though.

@bogdandrutu bogdandrutu merged commit facab6e into open-telemetry:main Aug 3, 2022
@crobertson-conga
Copy link
Contributor Author

🎉 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants