-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
persist: run the heartbeat_read task on two tokio runtimes #21873
Conversation
In response to a production incident, this runs the heartbeat task on both the in-context tokio runtime and persist's isolated runtime. We think we were seeing tasks (including this one) get stuck indefinitely in tokio while waiting for a runtime worker. This could happen if some other task in that runtime never yields. It's possible that one of the two runtimes is healthy while the other isn't (this was inconclusive in the incident debugging), and the heartbeat task is fairly lightweight, so run a copy in each in case that helps. The real fix here is to find the misbehaving task and fix it. Remove this duplication when that happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets try it!
Hmm, cargo test has failed (flaked?) twice now. Neither error seems directly related to this change, but they both have to do with postgres, which is maybe suspicious |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd endorse running a subset of nightlies on this to see if we can trigger something under load!
There's two tests left still running in nightlies, but everything else has passed so far (mod some "invalid syntax" failures in the checks stuff that are also present on main) |
Yeah - the completed tests include the most interesting ones for this particular change, so IMO this is good to go! |
TFTRs! |
In response to a production incident, this runs the heartbeat task on both the in-context tokio runtime and persist's isolated runtime. We think we were seeing tasks (including this one) get stuck indefinitely in tokio while waiting for a runtime worker. This could happen if some other task in that runtime never yields. It's possible that one of the two runtimes is healthy while the other isn't (this was inconclusive in the incident debugging), and the heartbeat task is fairly lightweight, so run a copy in each in case that helps.
The real fix here is to find the misbehaving task and fix it. Remove this duplication when that happens.
Motivation
Tips for reviewer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.