Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persist: run the heartbeat_read task on two tokio runtimes #21873

Merged
merged 1 commit into from
Sep 21, 2023

Conversation

danhhz
Copy link
Contributor

@danhhz danhhz commented Sep 21, 2023

In response to a production incident, this runs the heartbeat task on both the in-context tokio runtime and persist's isolated runtime. We think we were seeing tasks (including this one) get stuck indefinitely in tokio while waiting for a runtime worker. This could happen if some other task in that runtime never yields. It's possible that one of the two runtimes is healthy while the other isn't (this was inconclusive in the incident debugging), and the heartbeat task is fairly lightweight, so run a copy in each in case that helps.

The real fix here is to find the misbehaving task and fix it. Remove this duplication when that happens.

Motivation

  • This PR fixes a previously unreported bug.

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

In response to a production incident, this runs the heartbeat task on
both the in-context tokio runtime and persist's isolated runtime. We
think we were seeing tasks (including this one) get stuck indefinitely
in tokio while waiting for a runtime worker. This could happen if some
other task in that runtime never yields. It's possible that one of the
two runtimes is healthy while the other isn't (this was inconclusive in
the incident debugging), and the heartbeat task is fairly lightweight,
so run a copy in each in case that helps.

The real fix here is to find the misbehaving task and fix it. Remove
this duplication when that happens.
@danhhz danhhz requested a review from a team as a code owner September 21, 2023 16:31
Copy link
Contributor

@guswynn guswynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets try it!

@danhhz
Copy link
Contributor Author

danhhz commented Sep 21, 2023

Hmm, cargo test has failed (flaked?) twice now. Neither error seems directly related to this change, but they both have to do with postgres, which is maybe suspicious

Copy link
Contributor

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd endorse running a subset of nightlies on this to see if we can trigger something under load!

@danhhz
Copy link
Contributor Author

danhhz commented Sep 21, 2023

@danhhz
Copy link
Contributor Author

danhhz commented Sep 21, 2023

There's two tests left still running in nightlies, but everything else has passed so far (mod some "invalid syntax" failures in the checks stuff that are also present on main)

@bkirwi
Copy link
Contributor

bkirwi commented Sep 21, 2023

Yeah - the completed tests include the most interesting ones for this particular change, so IMO this is good to go!

@danhhz
Copy link
Contributor Author

danhhz commented Sep 21, 2023

TFTRs!

@danhhz danhhz merged commit 2359551 into MaterializeInc:main Sep 21, 2023
@danhhz danhhz deleted the persist_heartbeat_runtime branch September 21, 2023 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants