fix: Add snuba outcomes consumers to setup #426

untitaker · 2020-04-02T11:23:44Z

TSDB is currently broken because we've been using redissnuba backend without running the consumers.

Add outcomes consumers with an odd value for auto-offset-reset to attempt to recover TSDB data from outcomes.

BYK

Yay!

untitaker · 2020-04-02T11:52:23Z

@fpacifici @tkaemming can one of you review what I did with the --auto-offset-reset? This is a recovery effort since we have no opportunity for a proper migration

fpacifici · 2020-04-02T18:32:32Z

Could you expand on the reasoning of using earliest ? I am not sure I get what is broken exactly so I cannot say whether using earliest is the right solution.

tkaemming · 2020-04-02T18:36:50Z

Yeah, I think this is the best option in the absence of any real migration strategy. The biggest downsides that I can see here would be:

in the event of an error that causes the consumer to read off the tail end (newer side) of a partition (which is unlikely but not impossible), the consumer will reset to the earliest point in the log for that partition, reprocessing all of the contents in between the earliest message and the last processed message, inflating outcomes counters
if the consumer offsets topic retention is less than the outcome topic retention, it's conceivable that during a period of extended downtime, the consumer offsets will be evicted but the outcomes data will remain, leading to a scenario like the previous one where the consumer will default to the head of the partition and consume messages that were previously consumed

Both of these are pretty unlikely (but not impossible), and the second one really comes down to a decision between potential data duplication with resetting to earliest or potential data loss with resetting to latest so I'm not sure there's an ideal solution to that anyway.

Thanks for bring this to our attention. 👍

untitaker · 2020-04-02T19:13:05Z

@fpacifici we were aiming to do the TSDB migration for the onpremise setup that we did in prod. We found that we already had used the redissnuba backend in onpremise, so that means we stopped writing some keys to redis while at the same time the snuba consumers were not set up. That means the tsdb data (for the models that can be derived from outcomes) is gone.

However, the default topic retention is 7 days, so by adding the snuba consumer now and setting its auto-reset to "earliest" we may be able to recover some data in onpremise setup

untitaker · 2020-04-02T19:14:59Z

in the event of an error that causes the consumer to read off the tail end (newer side) of a partition (which is unlikely but not impossible)

The second case makes sense, but why would this one happen?

tkaemming · 2020-04-02T19:40:56Z

The second case makes sense, but why would this one happen?

Unexpected implementation error. It shouldn't happen but Kafka does (or at least used to) allow you to set locally and commit invalid offsets. I think it's also theoretically possible on an unclean failover where an out of sync replica gets promoted to leader for a partition.

RaduW · 2020-04-03T09:20:37Z

docker-compose.yml

@@ -15,6 +15,7 @@ x-sentry-defaults: &sentry_defaults
    - smtp
    - snuba-api
    - snuba-consumer
+    - snuba-outcomes-consumer


Do we really need this ?
This will force snuba-outcomes-consumer service to be running for any of the: web, cron, worker, event-consumer, post-process-forwarder and sentry-cleanup services.

Generally speaking we should run it. I am not sure about how our docker-compose is structured, but it seems like we only have two dependency lists, one for any snuba container and one for any sentry container.

@untitaker we can have a dependency list for any service so if this can be a dependency of something else, like relay we can put it there too.

It's up to you how granular you want this to be, but I can't imagine this consumer being optional in any setup

BYK

I think using earliest makes sense here as:

It would allow us to have the last 7-days worth of data for already broken Sentry 10 installations
Double counting is better than 0 counting as there is possibility to recover the underlying actual data with double counting. When you lose data, you have nothing to work off of.

Question: should we do the same for events consumer too as the above 2 applies there too and I don't think double processing events would create more records, just more work during recovery?

untitaker · 2020-04-03T13:15:34Z

Potentially but I would only cross that road once we know that we had a bug that caused us to drop events.

fpacifici · 2020-04-03T17:30:30Z

should we do the same for events consumer too as the above 2 applies there too
This is more complex. There are additional consequences in processing a large number of events twice:

replacers: we would be reprocessing events for groups we deleted. That would reappear until we do not encounter the replacement message again.
post processing. Assuming the synchronized consumer is restarted as well and it starts from earliest we would be reprocessing plugins, alerts, etc. It is hard to gauge the exact product impact.

Now in the case of events, if the consumer ends up going to the earliest available offset, it means one of those two cases Ted mentioned above, which imply an extended downtime of the main feature of Sentry, thus during this downtime pretty much nothing works. At that point I am not sure whether we recover from latest or earliest makes a big difference.

* feat: Add snuba outcomes consumers to setup * fix: Rename all references of snuba-consumer * ref: Rename back to snuba-consumer * fix: Change auto-offset-reset * fix: Attempt to fix CI

untitaker added 3 commits April 2, 2020 13:00

feat: Add snuba outcomes consumers to setup

23401cc

fix: Rename all references of snuba-consumer

da66f1c

ref: Rename back to snuba-consumer

6a7509d

untitaker changed the title ~~feat: Add snuba outcomes consumers to setup~~ fix: Add snuba outcomes consumers to setup Apr 2, 2020

BYK approved these changes Apr 2, 2020

View reviewed changes

untitaker added 2 commits April 2, 2020 13:43

fix: Change auto-offset-reset

68036c7

fix: Attempt to fix CI

cf1ee8a

untitaker mentioned this pull request Apr 2, 2020

feat(tsdb): Option on redissnuba to switch to Redis after a certain date getsentry/sentry#18040

Merged

RaduW reviewed Apr 3, 2020

View reviewed changes

BYK approved these changes Apr 3, 2020

View reviewed changes

untitaker merged commit 8899158 into master Apr 3, 2020

untitaker deleted the feat/snuba-outcomes-consumers branch April 3, 2020 13:16

BYK mentioned this pull request May 19, 2020

Graphs are delayed #501

Closed

github-actions bot locked and limited conversation to collaborators Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: Add snuba outcomes consumers to setup #426

fix: Add snuba outcomes consumers to setup #426

Uh oh!

untitaker commented Apr 2, 2020 •

edited

Loading

Uh oh!

BYK left a comment

Uh oh!

untitaker commented Apr 2, 2020

Uh oh!

fpacifici commented Apr 2, 2020

Uh oh!

tkaemming commented Apr 2, 2020

Uh oh!

untitaker commented Apr 2, 2020

Uh oh!

untitaker commented Apr 2, 2020

Uh oh!

tkaemming commented Apr 2, 2020

Uh oh!

RaduW Apr 3, 2020 •

edited

Loading

Uh oh!

untitaker Apr 3, 2020

Uh oh!

BYK Apr 3, 2020

Uh oh!

untitaker Apr 3, 2020

Uh oh!

BYK left a comment

Uh oh!

untitaker commented Apr 3, 2020

Uh oh!

fpacifici commented Apr 3, 2020

Uh oh!

Uh oh!

Uh oh!

fix: Add snuba outcomes consumers to setup #426

fix: Add snuba outcomes consumers to setup #426

Uh oh!

Conversation

untitaker commented Apr 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BYK left a comment

Choose a reason for hiding this comment

Uh oh!

untitaker commented Apr 2, 2020

Uh oh!

fpacifici commented Apr 2, 2020

Uh oh!

tkaemming commented Apr 2, 2020

Uh oh!

untitaker commented Apr 2, 2020

Uh oh!

untitaker commented Apr 2, 2020

Uh oh!

tkaemming commented Apr 2, 2020

Uh oh!

RaduW Apr 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

untitaker Apr 3, 2020

Choose a reason for hiding this comment

Uh oh!

BYK Apr 3, 2020

Choose a reason for hiding this comment

Uh oh!

untitaker Apr 3, 2020

Choose a reason for hiding this comment

Uh oh!

BYK left a comment

Choose a reason for hiding this comment

Uh oh!

untitaker commented Apr 3, 2020

Uh oh!

fpacifici commented Apr 3, 2020

Uh oh!

Uh oh!

untitaker commented Apr 2, 2020 •

edited

Loading

RaduW Apr 3, 2020 •

edited

Loading