Skip to content

Conversation

untitaker
Copy link
Member

@untitaker untitaker commented Apr 2, 2020

TSDB is currently broken because we've been using redissnuba backend without running the consumers.

Add outcomes consumers with an odd value for auto-offset-reset to attempt to recover TSDB data from outcomes.

@untitaker untitaker changed the title feat: Add snuba outcomes consumers to setup fix: Add snuba outcomes consumers to setup Apr 2, 2020
Copy link
Member

@BYK BYK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!

@untitaker
Copy link
Member Author

@fpacifici @tkaemming can one of you review what I did with the --auto-offset-reset? This is a recovery effort since we have no opportunity for a proper migration

@fpacifici
Copy link
Contributor

Could you expand on the reasoning of using earliest ? I am not sure I get what is broken exactly so I cannot say whether using earliest is the right solution.

@tkaemming
Copy link

Yeah, I think this is the best option in the absence of any real migration strategy. The biggest downsides that I can see here would be:

  1. in the event of an error that causes the consumer to read off the tail end (newer side) of a partition (which is unlikely but not impossible), the consumer will reset to the earliest point in the log for that partition, reprocessing all of the contents in between the earliest message and the last processed message, inflating outcomes counters
  2. if the consumer offsets topic retention is less than the outcome topic retention, it's conceivable that during a period of extended downtime, the consumer offsets will be evicted but the outcomes data will remain, leading to a scenario like the previous one where the consumer will default to the head of the partition and consume messages that were previously consumed

Both of these are pretty unlikely (but not impossible), and the second one really comes down to a decision between potential data duplication with resetting to earliest or potential data loss with resetting to latest so I'm not sure there's an ideal solution to that anyway.

Thanks for bring this to our attention. 👍

@untitaker
Copy link
Member Author

@fpacifici we were aiming to do the TSDB migration for the onpremise setup that we did in prod. We found that we already had used the redissnuba backend in onpremise, so that means we stopped writing some keys to redis while at the same time the snuba consumers were not set up. That means the tsdb data (for the models that can be derived from outcomes) is gone.

However, the default topic retention is 7 days, so by adding the snuba consumer now and setting its auto-reset to "earliest" we may be able to recover some data in onpremise setup

@untitaker
Copy link
Member Author

in the event of an error that causes the consumer to read off the tail end (newer side) of a partition (which is unlikely but not impossible)

The second case makes sense, but why would this one happen?

@tkaemming
Copy link

The second case makes sense, but why would this one happen?

Unexpected implementation error. It shouldn't happen but Kafka does (or at least used to) allow you to set locally and commit invalid offsets. I think it's also theoretically possible on an unclean failover where an out of sync replica gets promoted to leader for a partition.

@@ -15,6 +15,7 @@ x-sentry-defaults: &sentry_defaults
- smtp
- snuba-api
- snuba-consumer
- snuba-outcomes-consumer
Copy link
Contributor

@RaduW RaduW Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this ?
This will force snuba-outcomes-consumer service to be running for any of the: web, cron, worker, event-consumer, post-process-forwarder and sentry-cleanup services.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking we should run it. I am not sure about how our docker-compose is structured, but it seems like we only have two dependency lists, one for any snuba container and one for any sentry container.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@untitaker we can have a dependency list for any service so if this can be a dependency of something else, like relay we can put it there too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up to you how granular you want this to be, but I can't imagine this consumer being optional in any setup

Copy link
Member

@BYK BYK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using earliest makes sense here as:

  1. It would allow us to have the last 7-days worth of data for already broken Sentry 10 installations
  2. Double counting is better than 0 counting as there is possibility to recover the underlying actual data with double counting. When you lose data, you have nothing to work off of.

Question: should we do the same for events consumer too as the above 2 applies there too and I don't think double processing events would create more records, just more work during recovery?

@untitaker
Copy link
Member Author

Potentially but I would only cross that road once we know that we had a bug that caused us to drop events.

@untitaker untitaker merged commit 8899158 into master Apr 3, 2020
@untitaker untitaker deleted the feat/snuba-outcomes-consumers branch April 3, 2020 13:16
@fpacifici
Copy link
Contributor

should we do the same for events consumer too as the above 2 applies there too
This is more complex. There are additional consequences in processing a large number of events twice:

  • replacers: we would be reprocessing events for groups we deleted. That would reappear until we do not encounter the replacement message again.
  • post processing. Assuming the synchronized consumer is restarted as well and it starts from earliest we would be reprocessing plugins, alerts, etc. It is hard to gauge the exact product impact.

Now in the case of events, if the consumer ends up going to the earliest available offset, it means one of those two cases Ted mentioned above, which imply an extended downtime of the main feature of Sentry, thus during this downtime pretty much nothing works. At that point I am not sure whether we recover from latest or earliest makes a big difference.

@BYK BYK mentioned this pull request May 19, 2020
BYK pushed a commit that referenced this pull request May 23, 2020
* feat: Add snuba outcomes consumers to setup

* fix: Rename all references of snuba-consumer

* ref: Rename back to snuba-consumer

* fix: Change auto-offset-reset

* fix: Attempt to fix CI
MaicolBen pushed a commit to hinthealth/onpremise that referenced this pull request Jul 20, 2020
* feat: Add snuba outcomes consumers to setup

* fix: Rename all references of snuba-consumer

* ref: Rename back to snuba-consumer

* fix: Change auto-offset-reset

* fix: Attempt to fix CI
@github-actions github-actions bot locked and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants