Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: convert some vasserts to exceptions #22822

Merged
merged 1 commit into from
Aug 9, 2024

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Aug 9, 2024

We've seen at least one instance where empty batches are pushed to the log. Rather than bringing down the process, this updates a couple vasserts to throw instead (converted a couple that seemed related to data inputs, and left in a vassert that seems much more problematic related to being on the correct shard).

Fix for the codepath in offsets commit that caused the vassert to fire is here #22824

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

Bug Fixes

  • Avoids a crash in Redpanda's layer that could be triggered by empty offset commits.

@andrwng andrwng requested review from dotnwat, WillemKauf and bharathv and removed request for dotnwat August 9, 2024 19:27
WillemKauf
WillemKauf previously approved these changes Aug 9, 2024
Copy link
Contributor

@WillemKauf WillemKauf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, seems that we need to be a bit more conservative with our uses of vassert() versus exceptions/error codes in the codebase.

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: I'll post a separate fix for the codepath in offsets commit that caused the vassert to fire.

shouldn't we fix this, and leave the asserts in place?

@andrwng
Copy link
Contributor Author

andrwng commented Aug 9, 2024

NOTE: I'll post a separate fix for the codepath in offsets commit that caused the vassert to fire.

shouldn't we fix this, and leave the asserts in place?

Until we make it significantly more difficult to build empty batches, the burden of proof of ensuring storage applications never write empty batches is too high. Ideally we could point to something in our code that guarantees this to never fire, rather than proof by audit. Short of that, throwing at least lets us debug more easily in a pinch.

FWIW in the cluster we saw this fire, all nodes went down and it made chasing this much more difficult.

bharathv
bharathv previously approved these changes Aug 9, 2024
Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are places where empty batches are ok (for better user experience). For example the RPC in question here offset_commit_request, (I think) it makes sense to return success if there is nothing to commit but that decision should probably be done on a case by case basis via code audit. Until then exception instead of assert makes sense to me.

We've seen at least one instance where empty batches are pushed to the
log. Rather than bringing down the process, this updates a couple
vasserts to throw instead (converted a couple that seemed related to
data inputs, and left in a vassert that seems much more problematic
related to being on the correct shard).
@andrwng andrwng dismissed stale reviews from bharathv and WillemKauf via 5838337 August 9, 2024 20:03
@andrwng andrwng force-pushed the storage-empty-batch-exception branch from 5868f7b to 5838337 Compare August 9, 2024 20:03
@andrwng
Copy link
Contributor Author

andrwng commented Aug 9, 2024

Tweaked the test to assert on the error types (for my own curiosity)

@andrwng
Copy link
Contributor Author

andrwng commented Aug 9, 2024

I think there are places where empty batches are ok (for better user experience). For example the RPC in question here offset_commit_request, (I think) it makes sense to return success if there is nothing to commit but that decision should probably be done on a case by case basis via code audit. Until then exception instead of assert makes sense to me.

Yea maybe we should make the batch builder return a result<offset, error_code> and make callers handle some empty code.

@bharathv
Copy link
Contributor

bharathv commented Aug 9, 2024

Yea maybe we should make the batch builder return a result<offset, error_code> and make callers handle some empty code.

ya that sounds like a good idea.

@vbotbuildovich
Copy link
Collaborator

@andrwng andrwng merged commit 1cf6b8b into redpanda-data:dev Aug 9, 2024
20 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants