Skip to content

Add more docs on WAL failover #19052

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Nov 15, 2024
Merged

Conversation

rmloveland
Copy link
Contributor

@rmloveland rmloveland commented Oct 25, 2024

Fixes DOC-11199

Summary of changes:

  • Add a new 'WAL Failover' page to the 'Self-Hosted Deployments' section

  • Update cockroach start docs to note that it has the basic info, but
    to see the 'WAL Failover Playbook' for more detailed instructions

  • Mark WAL failover as GA (aka no longer in Preview)

@rmloveland rmloveland marked this pull request as draft October 25, 2024 18:50
Copy link

github-actions bot commented Oct 25, 2024

Files changed:

Copy link

netlify bot commented Oct 25, 2024

Deploy Preview for cockroachdb-interactivetutorials-docs ready!

Name Link
🔨 Latest commit 0914cae
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-interactivetutorials-docs/deploys/6737ce8b6aa520000852e4e3
😎 Deploy Preview https://deploy-preview-19052--cockroachdb-interactivetutorials-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

netlify bot commented Oct 25, 2024

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit 0914cae
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-api-docs/deploys/6737ce8b8d63cc00086c4282

Copy link

netlify bot commented Oct 25, 2024

Netlify Preview

Name Link
🔨 Latest commit 0914cae
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/6737ce8b10b28e0008d6895e
😎 Deploy Preview https://deploy-preview-19052--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@rmloveland rmloveland force-pushed the 20241025-DOC-11199-wal-failover-playbook branch 4 times, most recently from e574a36 to f251c97 Compare November 11, 2024 19:05
@rmloveland rmloveland marked this pull request as ready for review November 11, 2024 19:06
@rmloveland
Copy link
Contributor Author

hi @dshjoshi and @jbowens !

this PR adds a new docs page 'WAL failover' that content-wise is 90% the same as the 'WAL failover playbook' on the wiki but with some edits to make it appropriate for our public-facing docs - notably no use of roachprod and a bit of copy editing and adding links to other things in our docs

once we are happy with the contents of this PR i plan to backport it to v24.1 and v24.2 docs as well, assuming WAL failover is going GA for those versions too (please let me know if not)

Fixes DOC-11199

Summary of changes:

- Add a new 'WAL Failover' page to the 'Self-Hosted Deployments' section

- Update `cockroach start` docs to note that it has the basic info, but
  to see the 'WAL Failover Playbook' for more detailed instructions

- Mark WAL failover as GA (aka no longer in Preview)
@rmloveland rmloveland force-pushed the 20241025-DOC-11199-wal-failover-playbook branch from f251c97 to b608970 Compare November 11, 2024 19:23
- `storage.wal.failover.primary.duration`: Cumulative time spent (in nanoseconds) writing to the primary WAL directory. Only populated when WAL failover is configured.
- `storage.wal.failover.switch.count`: Count of the number of times WAL writing has switched from primary to secondary store, and vice versa.

The `storage.wal.failover.secondary.duration` is the primary metric to monitor. You should expect this metric to be `0` unless a WAL failover occurs. If a WAL failover occurs, you probably care about how long it remains non-zero because it provides an indication of the health of the primary store.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably care about how long it remains non-zero because it provides an indication of the health of the primary store.

This metric is a counter, so once non-zero it will remain non-zero until the process restarts. The operator probably cares about the rate that it increases. Any increase indicates a failover event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the correction - updated to the following - PTAL

If a WAL failover occurs, the rate at which it increases provides an indication of the health of the primary store.


## Why WAL Failover?

In cloud environments transient [disk stalls]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disk-stalls) are common, often lasting on the order of several seconds. This will negatively impact latency for the user-facing foreground workload. In the field, we have observed that stalls occur most frequently while writing to the WAL. While we cannot prevent disk stalls, we can minimize their impact on CockroachDB. That is where WAL failover comes into play.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the field, we have observed that stalls occur most frequently while writing to the WAL

This sentence isn't quite right. Stalls apply to the various files of the storage engine with equal frequency, but a stall of the write-ahead log is the most impactful to foreground latencies. (Because most other writes, like those of flushes and compactions happen asynchronously in the background and foreground operations do not need to wait for them.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to the following based on your comment - PTAL

In the field, we have observed that stalls while writing to the WAL are the most impactful to foreground latencies. Most other writes, such as flushes and compactions, happen asynchronously in the background and foreground operations do not need to wait for them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to exclude this graph, because I think it's more likely to confuse than anything else. It's confusing to show an instance of the node crashing because a stall was too prolonged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've removed the image and also updated the text above it to rephrase so as not to reference the image, PTAL

When the disk continues to be stalled for longer than the duration of COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT, the node goes down, and there is no more metrics data coming from that node.


### 14. If there are more than 2 stores, will the WAL failover cascade from store A to B to C?

Yes, for example if store _A_'s disk stalls for `30s`, the WAL will failover to store B after `100ms`. While store _A_ is still stalled at the `20s` mark, if store _B_'s disk fails, store _B_ will failover to store _C_. When _B_ fails, only _B_'s WAL write will failover to _C_; if _A_ is still down, _A_'s write will not failover to _C_.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing. I would not say this is a "yes" to the question. Store A will failover to store B, store B to store C, and store C to store A, but store A will never failover to store C.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i replaced this sentence with what you wrote and the section now reads as follows, PTAL

Store A will failover to store B, store B will failover to store C, and store C will failover to store A, but store A will never failover to store C.

However, the WAL failback operation will not cascade back until all drives are available - that is, if store A's disk unstalls while store B is still stalled, store C will not failback to store A until B also becomes available again. In other words, C must failback to B, which must then failback to A.

@rmloveland rmloveland requested a review from jbowens November 14, 2024 19:52
@rmloveland
Copy link
Contributor Author

@jbowens thanks for the review, i've made updates that hopefully address your feedback - PTAL!

@rmloveland rmloveland requested a review from taroface November 15, 2024 19:31
Copy link
Contributor

@taroface taroface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Epic PR, just have some style suggestions and I think there's one draft line that needs to be removed.

rmloveland and others added 9 commits November 15, 2024 15:36
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
rmloveland and others added 10 commits November 15, 2024 15:44
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
@rmloveland
Copy link
Contributor Author

LGTM! Epic PR, just have some style suggestions and I think there's one draft line that needs to be removed.

thanks @taroface - the credit really goes to @dshjoshi for writing this in our wiki, I just ported it here and edited some stuff out b/c we don't support roachprod

in process of applying your edits now

rmloveland and others added 9 commits November 15, 2024 15:48
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
@rmloveland rmloveland force-pushed the 20241025-DOC-11199-wal-failover-playbook branch from 34a0128 to 9f09362 Compare November 15, 2024 22:42
@rmloveland rmloveland enabled auto-merge (squash) November 15, 2024 22:46
@rmloveland rmloveland merged commit 0d4e85a into main Nov 15, 2024
7 checks passed
@rmloveland rmloveland deleted the 20241025-DOC-11199-wal-failover-playbook branch November 15, 2024 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants