Add more docs on WAL failover #19052

rmloveland · 2024-10-25T18:50:01Z

Fixes DOC-11199

Summary of changes:

Add a new 'WAL Failover' page to the 'Self-Hosted Deployments' section
Update cockroach start docs to note that it has the basic info, but
to see the 'WAL Failover Playbook' for more detailed instructions
Mark WAL failover as GA (aka no longer in Preview)

github-actions · 2024-10-25T18:50:24Z

Files changed:

src/current/_includes/v24.3/sidebar-data/self-hosted-deployments.json
src/current/_includes/v24.3/wal-failover-intro.md:

src/current/_includes/v24.3/wal-failover-log-config.md:

src/current/_includes/v24.3/wal-failover-metrics.md:

src/current/_includes/v24.3/wal-failover-side-disk.md:

src/current/v24.3/wal-failover.md

src/current/images/v24.3/wal-failover-long-stall-metrics.jpg:

Warning: include not used in any v24.3 file or include

src/current/images/v24.3/wal-failover-metrics-chart.jpg:

src/current/v24.3/wal-failover.md

src/current/v24.3/cluster-setup-troubleshooting.md
src/current/v24.3/cockroach-start.md
src/current/v24.3/cockroachdb-feature-availability.md
src/current/v24.3/wal-failover.md

netlify · 2024-10-25T18:50:30Z

✅ Deploy Preview for cockroachdb-interactivetutorials-docs ready!

Name	Link
🔨 Latest commit	`0914cae`
🔍 Latest deploy log	https://app.netlify.com/sites/cockroachdb-interactivetutorials-docs/deploys/6737ce8b6aa520000852e4e3
😎 Deploy Preview	https://deploy-preview-19052--cockroachdb-interactivetutorials-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify · 2024-10-25T18:50:31Z

✅ Deploy Preview for cockroachdb-api-docs canceled.

Name	Link
🔨 Latest commit	`0914cae`
🔍 Latest deploy log	https://app.netlify.com/sites/cockroachdb-api-docs/deploys/6737ce8b8d63cc00086c4282

netlify · 2024-10-25T18:56:14Z

✅ Netlify Preview

Name	Link
🔨 Latest commit	`0914cae`
🔍 Latest deploy log	https://app.netlify.com/sites/cockroachdb-docs/deploys/6737ce8b10b28e0008d6895e
😎 Deploy Preview	https://deploy-preview-19052--cockroachdb-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

rmloveland · 2024-11-11T19:09:43Z

hi @dshjoshi and @jbowens !

this PR adds a new docs page 'WAL failover' that content-wise is 90% the same as the 'WAL failover playbook' on the wiki but with some edits to make it appropriate for our public-facing docs - notably no use of roachprod and a bit of copy editing and adding links to other things in our docs

once we are happy with the contents of this PR i plan to backport it to v24.1 and v24.2 docs as well, assuming WAL failover is going GA for those versions too (please let me know if not)

Fixes DOC-11199 Summary of changes: - Add a new 'WAL Failover' page to the 'Self-Hosted Deployments' section - Update `cockroach start` docs to note that it has the basic info, but to see the 'WAL Failover Playbook' for more detailed instructions - Mark WAL failover as GA (aka no longer in Preview)

src/current/_includes/v24.3/wal-failover-intro.md

jbowens · 2024-11-12T21:19:21Z

src/current/_includes/v24.3/wal-failover-metrics.md

+- `storage.wal.failover.primary.duration`: Cumulative time spent (in nanoseconds) writing to the primary WAL directory. Only populated when WAL failover is configured.
+- `storage.wal.failover.switch.count`: Count of the number of times WAL writing has switched from primary to secondary store, and vice versa.
+
+The `storage.wal.failover.secondary.duration` is the primary metric to monitor. You should expect this metric to be `0` unless a WAL failover occurs. If a WAL failover occurs, you probably care about how long it remains non-zero because it provides an indication of the health of the primary store.


you probably care about how long it remains non-zero because it provides an indication of the health of the primary store.

This metric is a counter, so once non-zero it will remain non-zero until the process restarts. The operator probably cares about the rate that it increases. Any increase indicates a failover event.

thanks for the correction - updated to the following - PTAL

If a WAL failover occurs, the rate at which it increases provides an indication of the health of the primary store.

jbowens · 2024-11-12T21:23:59Z

src/current/v24.3/wal-failover.md

+
+## Why WAL Failover?
+
+In cloud environments transient [disk stalls]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disk-stalls) are common, often lasting on the order of several seconds. This will negatively impact latency for the user-facing foreground workload. In the field, we have observed that stalls occur most frequently while writing to the WAL. While we cannot prevent disk stalls, we can minimize their impact on CockroachDB. That is where WAL failover comes into play.


In the field, we have observed that stalls occur most frequently while writing to the WAL

This sentence isn't quite right. Stalls apply to the various files of the storage engine with equal frequency, but a stall of the write-ahead log is the most impactful to foreground latencies. (Because most other writes, like those of flushes and compactions happen asynchronously in the background and foreground operations do not need to wait for them.)

Updated to the following based on your comment - PTAL

In the field, we have observed that stalls while writing to the WAL are the most impactful to foreground latencies. Most other writes, such as flushes and compactions, happen asynchronously in the background and foreground operations do not need to wait for them.

src/current/v24.3/wal-failover.md

jbowens · 2024-11-12T21:45:40Z

src/current/images/v24.3/wal-failover-long-stall-metrics.jpg

My preference would be to exclude this graph, because I think it's more likely to confuse than anything else. It's confusing to show an instance of the node crashing because a stall was too prolonged.

i've removed the image and also updated the text above it to rephrase so as not to reference the image, PTAL

When the disk continues to be stalled for longer than the duration of COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT, the node goes down, and there is no more metrics data coming from that node.

src/current/v24.3/wal-failover.md

jbowens · 2024-11-12T21:52:54Z

src/current/v24.3/wal-failover.md

+
+### 14. If there are more than 2 stores, will the WAL failover cascade from store A to B to C?
+
+Yes, for example if store _A_'s disk stalls for `30s`, the WAL will failover to store B after `100ms`. While store _A_ is still stalled at the `20s` mark, if store _B_'s disk fails, store _B_ will failover to store _C_. When _B_ fails, only _B_'s WAL write will failover to _C_; if _A_ is still down, _A_'s write will not failover to _C_.


This is confusing. I would not say this is a "yes" to the question. Store A will failover to store B, store B to store C, and store C to store A, but store A will never failover to store C.

i replaced this sentence with what you wrote and the section now reads as follows, PTAL

Store A will failover to store B, store B will failover to store C, and store C will failover to store A, but store A will never failover to store C.

However, the WAL failback operation will not cascade back until all drives are available - that is, if store A's disk unstalls while store B is still stalled, store C will not failback to store A until B also becomes available again. In other words, C must failback to B, which must then failback to A.

rmloveland · 2024-11-14T19:53:12Z

@jbowens thanks for the review, i've made updates that hopefully address your feedback - PTAL!

taroface

LGTM! Epic PR, just have some style suggestions and I think there's one draft line that needs to be removed.

src/current/_includes/v24.3/wal-failover-intro.md

src/current/_includes/v24.3/wal-failover-metrics.md

src/current/v24.3/cluster-setup-troubleshooting.md

src/current/v24.3/cockroach-start.md

src/current/v24.3/wal-failover.md

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

rmloveland · 2024-11-15T20:48:22Z

LGTM! Epic PR, just have some style suggestions and I think there's one draft line that needs to be removed.

thanks @taroface - the credit really goes to @dshjoshi for writing this in our wiki, I just ported it here and edited some stuff out b/c we don't support roachprod

in process of applying your edits now

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

rmloveland marked this pull request as draft October 25, 2024 18:50

rmloveland force-pushed the 20241025-DOC-11199-wal-failover-playbook branch 4 times, most recently from e574a36 to f251c97 Compare November 11, 2024 19:05

rmloveland marked this pull request as ready for review November 11, 2024 19:06

rmloveland requested review from jbowens and dshjoshi November 11, 2024 19:07

rmloveland force-pushed the 20241025-DOC-11199-wal-failover-playbook branch from f251c97 to b608970 Compare November 11, 2024 19:23

jbowens reviewed Nov 12, 2024

View reviewed changes

rmloveland requested a review from jbowens November 14, 2024 19:52

Update with jbowens feedback (1)

54f5d52

rmloveland requested a review from taroface November 15, 2024 19:31

taroface approved these changes Nov 15, 2024

View reviewed changes

rmloveland and others added 9 commits November 15, 2024 15:36

Update src/current/_includes/v24.3/wal-failover-metrics.md

febc861

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/cluster-setup-troubleshooting.md

c32e0e0

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

bc8d65e

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

d421e4a

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

677bf74

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

f0be83f

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/cockroach-start.md

e6f591e

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

02b3920

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/_includes/v24.3/wal-failover-metrics.md

6b31661

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

rmloveland and others added 10 commits November 15, 2024 15:44

Update src/current/v24.3/wal-failover.md

5985a0c

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

eda6d22

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

c7573a6

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

ce398b8

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

d858675

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

263354e

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

3c4eeac

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

8f07c45

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

7589b9f

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

b706a66

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

rmloveland and others added 9 commits November 15, 2024 15:48

Update src/current/v24.3/wal-failover.md

23e8c26

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

0c60ebc

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

fe15dfe

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

33028c2

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

66ced46

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

30745ba

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update src/current/v24.3/wal-failover.md

6bbb5d4

Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>

Update with rest of taroface feedback

fd56742

Fix busted links

9f09362

rmloveland force-pushed the 20241025-DOC-11199-wal-failover-playbook branch from 34a0128 to 9f09362 Compare November 15, 2024 22:42

Merge branch 'main' into 20241025-DOC-11199-wal-failover-playbook

0914cae

rmloveland enabled auto-merge (squash) November 15, 2024 22:46

rmloveland merged commit 0d4e85a into main Nov 15, 2024
7 checks passed

rmloveland deleted the 20241025-DOC-11199-wal-failover-playbook branch November 15, 2024 22:56

rmloveland mentioned this pull request May 28, 2025

Backport WAL failover feature page to v24.1 #19667

Open


		## Why WAL Failover?

		In cloud environments transient [disk stalls]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disk-stalls) are common, often lasting on the order of several seconds. This will negatively impact latency for the user-facing foreground workload. In the field, we have observed that stalls occur most frequently while writing to the WAL. While we cannot prevent disk stalls, we can minimize their impact on CockroachDB. That is where WAL failover comes into play.


		### 14. If there are more than 2 stores, will the WAL failover cascade from store A to B to C?

		Yes, for example if store _A_'s disk stalls for `30s`, the WAL will failover to store B after `100ms`. While store _A_ is still stalled at the `20s` mark, if store _B_'s disk fails, store _B_ will failover to store _C_. When _B_ fails, only _B_'s WAL write will failover to _C_; if _A_ is still down, _A_'s write will not failover to _C_.

Add more docs on WAL failover #19052

Add more docs on WAL failover #19052

Uh oh!

Conversation

rmloveland commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Files changed:

Uh oh!

netlify bot commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cockroachdb-interactivetutorials-docs ready!

Uh oh!

netlify bot commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cockroachdb-api-docs canceled.

Uh oh!

netlify bot commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Netlify Preview

Uh oh!

rmloveland commented Nov 11, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmloveland commented Nov 14, 2024

Uh oh!

taroface left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rmloveland commented Nov 15, 2024

Uh oh!

Uh oh!

Uh oh!

rmloveland commented Oct 25, 2024 •

edited

Loading

github-actions bot commented Oct 25, 2024 •

edited

Loading

netlify bot commented Oct 25, 2024 •

edited

Loading

netlify bot commented Oct 25, 2024 •

edited

Loading

netlify bot commented Oct 25, 2024 •

edited

Loading