Un-actionable Issues in Sentry #41089

Dhrumil-Sentry · 2022-11-07T21:31:48Z

Dhrumil-Sentry
Nov 7, 2022
Maintainer

Sentry groups your errors into Issues so you find actual problems in your app. Issues offer a deeper understanding of the problem with error metadata such as stack traces, breadcrumbs, tags, etc.

However, some issues are ones you don’t care about and are often just an annoyance. We’re trying to characterize such Issues and use this to improve your Sentry experience by pruning low-quality issues from your Issue Stream and notifications.

We could categorize Sentry Issues into one of three buckets:

Definitely a bug in your app
1. Issue has the stack trace and is affecting several app users: These are bugs that you could fix.
Operational Errors- You care only if they spike
1. Errors such as APIConnectionError/ReadTimeoutError - You can’t really fix them, you just handle them better or you may only care about these issues when these spike signaling an acute problem.
Un-actionable “low-quality” Issues
1. Issues that don’t have the context to help you fix the issue
  1. E.g, Browser console errors, Errors without stack traces or breadcrumbs
2. Warnings/Info messages- Issues that should be treated as logs more than errors

Help Sentry identify low-quality issues

We’d love to hear your thoughts about the third category, Un-actionable Issues. Some initial questions to help us on our way:

What makes an issue un-actionable or flat-out useless for you?
What are issues you can’t fix but would still like to see in Sentry and Why?
What are issues that you wish Sentry would not create?

We’re sure this isn’t an exhaustive list of questions so definitely tell us anything else we’re missing!

realkosty · 2022-11-07T22:19:46Z

realkosty
Nov 7, 2022
Collaborator

Love that breakdown. I had a customer bring up a subtype of 2. They were unsure what was the origin of those and whether they should ignore them:

2.1 Client-side Mobile Connection errors.
One customer had the following two dominate their error volume (1st and 3rd most common errors) in their Next.js application.

title:"TypeError: Failed to fetch"
title:"TypeError: Load failed"
Upon closer inspection turned out around 98.7% of each of those came from mobile os.name. Additionally the customer said they were able to reproduce those by refreshing the page a couple of times while still loading (suggesting cancelled HTTP requests are also reported the same way as failed due to network).

The best suggestion I could come up at the time with was:

Maybe there’s a way to optimize your page that critical assets e.g. product image load first? Or could set a metric alert on it with a generous threshold in case the number goes significantly up for some reason. You could try a grouping all those error types together by overriding fingerprint.

1 reply

mikejihbe Nov 7, 2022
Maintainer

Most backend services on the internet will also occasionally get scanned by script kiddies sending lots of garbage requests to your server to see if it's vulnerable to any known attacks. Often they're non-sensical because they attempt exploits for technology that's not in use, e.g. your Django service is getting requests meant for Wordpress. Suppressing any errors generated by these would be nice.

rjo100 · 2022-11-07T23:57:53Z

rjo100
Nov 7, 2022

Deployment related issues are a mixed bag. Sometimes things will spike/fail but only briefly during a deploy. Generally unhelpful as they are expected.

2 replies

Dhrumil-Sentry Nov 8, 2022
Maintainer Author

@rjo100 Is there any issue in Sentry that you could share as an example of this?

rjo100 Nov 8, 2022

No but I've run into it before on other projects haha

tommilligan · 2022-11-15T10:33:29Z

tommilligan
Nov 15, 2022

Excellent breakdown. We especially have issues with high volumes of

Operational Errors- You care only if they spike

Most of these are some form of connection error - either:

An internal connection failure, caused by:
- An internal service temporarily being unavailable for new connections during deployment
- An existing service instance being killed and truncating the connection, leading to a bad/invalid response (that is not considered valid for retry)
An external resource being unavailable (e.g. external email server being offline)

We mentally classify these errors by downstream impact, mostly:

Could this error propagate and cause a user facing error? e.g. is the root caller a cli/UI session, or is it an internal batch process?
Is the error frequency low enough that it is likely a caller will succeed on retry?

One other type of unactionable issue we have (sort of a subtype) is high volume un-actionable issues. These errors are expected to happen in high volumes, and as such we have to pre-batch them before sending to Sentry to not blow up our quota usage.

For example, we have a caching service that may fail to read or write keys. This isn't a hard error (it's just a cache, who cares), unless the volumes get really high. Currently, if there is a temporary issue, we might generate a synthetic error to send to sentry, like

CacheError("Cache key failure: failed to read 627 keys, write 329 keys, in the last minute")

We did this because managing single error instances through the UI (and our quota) was untenable.

If there was some hook we could plug into when initializing the sentry SDK that allowed us to id/batch error cases together, that would be a cool feature - or some other tooling to make bulk error spikes in this case feasible. As it happens now we've written our own layer, but it would be cool to see natively.

2 replies

Dhrumil-Sentry Nov 15, 2022
Maintainer Author

Hi @tommilligan thanks a lot for the feedback! For the internal connection failures you called out, what would a better workflow look like:

These failures are created as warning-level issues in Sentry and we don't notify you about them unless they spike?
Would you like to continue seeing such issues in your stream?

tommilligan Nov 16, 2022

Yes, probably something like the first point. Our current management flow is something like:

Issue raised by Sentry
Dev inspects error, mentally classifies as an expected ephemeral error
Dev adds an ignore for a fixed threshold, to handle anticipated spikes in future
Over time as load/scaling increases, this threshold will be hit and need to be reassessed.

An ideal flow for us would either be:

Classification this is an ephemeral expected error
- Single time labelling by dev
- Magic classification by Sentry
Thereafter, Sentry only notifies on sustained volumes of said errors (e.g. for multiple minutes at increased volume)
- Note that this is explicitly not a spike - we would expect to see spikes around e.g. deployment/rollout events
The logic of what constitutes "abnormal" should scale with service activity/traffic handled

I think it's also worth calling out that internal service connection errors are always expected to have a baseline of zero (or approximately zero).

yayapao · 2022-11-23T08:25:31Z

yayapao
Nov 23, 2022

Excellent breakdown, which help me to deep think about issues in Sentry.

As described above about network errors, for this type of issues, like Uncaught NetworkError, scripts failed to load. i think it's also possible to determined whether they should be notified from the impact side, not just the volumes or the trends of the quantity.

Thereafter, Sentry only notifies on sustained volumes of said errors (e.g. for multiple minutes at increased volume)

If a lagre number of evnets of this type are generated by the same user, it's very likely to be a script kiddies, so this type of issue is non-sensical. However, if this type of error is triggered by many people, even if only once per user, then i think it should be created by Sentry because it has a large impact.

As for CanceledError, it's definetily an un-actionable issue, cacnel is a normal action, and should not be an issue.
Some issues like ResizeObserver loop limit exceeded, it's benign and will not break the site, so we dont't have to care about it. And we can improve it by some codes, it looks more like a warning.

1 reply

Dhrumil-Sentry Dec 1, 2022
Maintainer Author

Hi @yayapao Thanks a lot for the feedback! What would your ideal workflow look like for such errors?

Should these errors be warning-level issues in Sentry and we don't notify you about them unless they spike?
Would you like to continue seeing these issues in your stream?

realkosty · 2023-03-23T16:14:28Z

realkosty
Mar 23, 2023
Collaborator

[Web only][may be off-topic]

Sometimes what is essentially the same error get split into 3 issues because of legitimately different stacktraces (or lack thereof) between browsers:
Safari issue
Safari issue - same thing but somehow this one with a stacktrace
Firefox same error looks differently
I wonder how common this is.

Technically they are duplicates, but:

From customer perspective it's still noise/non-actionable
Arguably this is outside the realm of Grouping because all 3 have distinct fingerprints?

1 reply

therealarkin May 22, 2023

Great point. cc: @armenzg @brianthi

tdtm · 2023-06-05T13:24:45Z

tdtm
Jun 5, 2023

This is great.

Deployment-related errors. Others before me have already covered them (ie some issues are expected during deploy and aren't actionable, but if they continue for longer than X, then there's something wrong).

I'd just add to that: Sentry's current auto-grouping doesn't handle this very well. We've tuned the groupings using SDK fingerprinting, which has largely helped, but I wish the issue merging was more effective at this. Typically, these errors will share a bit of the stack trace but not the whole trace.
Other examples of errors that are not typically actionable:
- permissions assertion failures,
- entity not found, or
- page not found.
A number of these are somewhat normal (eg an entity was deleted but someone still had a bookmark to its page) but if they spike, either in total number or number of affected unique records or number of affected users, it means something is broken. Or maybe someone is attempting to enumerate. Either way, it's abnormal.

1 reply

armenzg Jun 9, 2023
Collaborator

@tdtm If you file an issue, I can look into it and work on a better solution. Thanks!

holvianssi · 2023-12-11T07:58:47Z

holvianssi
Dec 11, 2023

I'm also interested in this topic. Quite often a large service with many external integrations will produce errors you can not act on. We have sometimes removed them altogether from Sentry, but better would be to have them there, but as non-actionable issue.

I'm wondering if categorising these to actionable and non-actionable issues would be the right way to go.

TypeError would be an actionable issue. TimedOutError wouldn't typically be one.

So, what should happen when an TimedOutError is first seen by Sentry? You would check it in UI, and click "not an actionable issue", and perhaps you would be asked to configure "raise as issue in case more than N in 5 minutes / AI autodetects spikes". So, now if you get an additional TimedOutError, it would still be recorded, but Sentry wouldn't show it as actionable. If you get a lot of TimedOutErrors, then Sentry would switch it to an actionable issue.

After you systematically tackle all incoming issues in the UI by checking if they are actionable you are left with a list of actionable issues you should fix. Anything new you would check immediately to determine if it is trash (non-actionable), operational event (non-actionable with trigger on spike), or actionable. The goal for the team would be that you squash all actionable Sentry issues.

2 replies

Dhrumil-Sentry Dec 11, 2023
Maintainer Author

@holvianssi this is great feedback.

You would check it in UI, and click "not an actionable issue", and perhaps you would be asked to configure "raise as issue in case more than N in 5 minutes / AI autodetects spikes"

Do you think there are any properties in a Sentry issue that can help us automatically detect this for you as well? Are there certain issue types or error messages that Sentry could leverage to determine the actionability/severity of such issues?

How would you want notifications to work in such cases? Today Sentry alerts for all new issues, would you expect immediate alerts for such actionable issues only and prefer a daily digest for all other issues ?

cc @trillville @rachrwang

tdtm Dec 11, 2023

Do you think there are any properties in a Sentry issue that can help us automatically detect this for you as well?

As a general matter, nothing other than explicitly being marked by the user can guarantee it.

However I can think of a few things that could help Sentry make a good guess that would justify default behaviour and/or a Sentry showing a prompt that says "we think this may be a transient issue" (or similar):

Time to a new release being created, given
- A release was created within a window of +/- X ms of the occurrence (could be shortly before or shortly after the release was created, depending on how the release notification is setup in the release pipeline); and
- That release was seen on this host; and
- The issue was only seen during this window (or a past such window); if it keeps occurring after this window ended, then it's not an intermittent release-related event, but probably a bug
SDK-specific logic. This depends on the SDK, some are too generic for this (like a JS SDK), but some known exceptions like say TooManyConnectionsException in a RDBMS abstraction (eg provided by Laravel or Symfony, which Sentry maintains SDK for) could be an example.
Explicit tagging by the user in their application code in a way that's exposed by the SDK. Just like my app logic can add tags, fingerprinting, etc, the SDK could expose a way for the user to mark an exception as transient before Sentry even sees it. That's something I already do using tags, but obviously it's not as useful as it would be should Sentry actually handle it differently based on that attribute.

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

Uh oh!

Un-actionable Issues in Sentry #41089

Uh oh!

Uh oh!

Dhrumil-Sentry Nov 7, 2022 Maintainer

Help Sentry identify low-quality issues

Replies: 10 comments · 12 replies

Uh oh!

realkosty Nov 7, 2022 Collaborator

Uh oh!

mikejihbe Nov 7, 2022 Maintainer

Uh oh!

Uh oh!

Dhrumil-Sentry Nov 8, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Dhrumil-Sentry Nov 15, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Dhrumil-Sentry Dec 1, 2022 Maintainer Author

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Uh oh!

Uh oh!

realkosty Mar 23, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

armenzg Jun 9, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

Dhrumil-Sentry Dec 11, 2023 Maintainer Author

Uh oh!

Uh oh!

Dhrumil-Sentry
Nov 7, 2022
Maintainer

Replies: 10 comments 12 replies

realkosty
Nov 7, 2022
Collaborator

mikejihbe Nov 7, 2022
Maintainer

Dhrumil-Sentry Nov 8, 2022
Maintainer Author

Dhrumil-Sentry Nov 15, 2022
Maintainer Author

Dhrumil-Sentry Dec 1, 2022
Maintainer Author

realkosty
Mar 23, 2023
Collaborator

armenzg Jun 9, 2023
Collaborator

Dhrumil-Sentry Dec 11, 2023
Maintainer Author