persist: run the heartbeat_read task on two tokio runtimes #21873

danhhz · 2023-09-21T16:31:12Z

In response to a production incident, this runs the heartbeat task on both the in-context tokio runtime and persist's isolated runtime. We think we were seeing tasks (including this one) get stuck indefinitely in tokio while waiting for a runtime worker. This could happen if some other task in that runtime never yields. It's possible that one of the two runtimes is healthy while the other isn't (this was inconclusive in the incident debugging), and the heartbeat task is fairly lightweight, so run a copy in each in case that helps.

The real fix here is to find the misbehaving task and fix it. Remove this duplication when that happens.

Motivation

This PR fixes a previously unreported bug.

Tips for reviewer

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:

In response to a production incident, this runs the heartbeat task on both the in-context tokio runtime and persist's isolated runtime. We think we were seeing tasks (including this one) get stuck indefinitely in tokio while waiting for a runtime worker. This could happen if some other task in that runtime never yields. It's possible that one of the two runtimes is healthy while the other isn't (this was inconclusive in the incident debugging), and the heartbeat task is fairly lightweight, so run a copy in each in case that helps. The real fix here is to find the misbehaving task and fix it. Remove this duplication when that happens.

guswynn

lets try it!

danhhz · 2023-09-21T17:33:41Z

Hmm, cargo test has failed (flaked?) twice now. Neither error seems directly related to this change, but they both have to do with postgres, which is maybe suspicious

bkirwi

I'd endorse running a subset of nightlies on this to see if we can trigger something under load!

danhhz · 2023-09-21T17:56:10Z

Good idea: https://buildkite.com/materialize/nightlies/builds/4194

danhhz · 2023-09-21T20:01:58Z

There's two tests left still running in nightlies, but everything else has passed so far (mod some "invalid syntax" failures in the checks stuff that are also present on main)

bkirwi · 2023-09-21T20:17:19Z

Yeah - the completed tests include the most interesting ones for this particular change, so IMO this is good to go!

danhhz · 2023-09-21T20:43:17Z

TFTRs!

danhhz requested a review from a team as a code owner September 21, 2023 16:31

danhhz requested review from bkirwi and guswynn September 21, 2023 17:04

guswynn approved these changes Sep 21, 2023

View reviewed changes

bkirwi approved these changes Sep 21, 2023

View reviewed changes

danhhz merged commit 2359551 into MaterializeInc:main Sep 21, 2023

danhhz deleted the persist_heartbeat_runtime branch September 21, 2023 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persist: run the heartbeat_read task on two tokio runtimes #21873

persist: run the heartbeat_read task on two tokio runtimes #21873

danhhz commented Sep 21, 2023

guswynn left a comment

danhhz commented Sep 21, 2023

bkirwi left a comment

danhhz commented Sep 21, 2023

danhhz commented Sep 21, 2023

bkirwi commented Sep 21, 2023

danhhz commented Sep 21, 2023

persist: run the heartbeat_read task on two tokio runtimes #21873

persist: run the heartbeat_read task on two tokio runtimes #21873

Conversation

danhhz commented Sep 21, 2023

Motivation

Tips for reviewer

Checklist

guswynn left a comment

Choose a reason for hiding this comment

danhhz commented Sep 21, 2023

bkirwi left a comment

Choose a reason for hiding this comment

danhhz commented Sep 21, 2023

danhhz commented Sep 21, 2023

bkirwi commented Sep 21, 2023

danhhz commented Sep 21, 2023