-
Notifications
You must be signed in to change notification settings - Fork 469
Add more docs on WAL failover #19052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Files changed:
|
✅ Deploy Preview for cockroachdb-interactivetutorials-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
✅ Deploy Preview for cockroachdb-api-docs canceled.
|
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify site configuration. |
e574a36
to
f251c97
Compare
this PR adds a new docs page 'WAL failover' that content-wise is 90% the same as the 'WAL failover playbook' on the wiki but with some edits to make it appropriate for our public-facing docs - notably no use of once we are happy with the contents of this PR i plan to backport it to v24.1 and v24.2 docs as well, assuming WAL failover is going GA for those versions too (please let me know if not) |
Fixes DOC-11199 Summary of changes: - Add a new 'WAL Failover' page to the 'Self-Hosted Deployments' section - Update `cockroach start` docs to note that it has the basic info, but to see the 'WAL Failover Playbook' for more detailed instructions - Mark WAL failover as GA (aka no longer in Preview)
f251c97
to
b608970
Compare
- `storage.wal.failover.primary.duration`: Cumulative time spent (in nanoseconds) writing to the primary WAL directory. Only populated when WAL failover is configured. | ||
- `storage.wal.failover.switch.count`: Count of the number of times WAL writing has switched from primary to secondary store, and vice versa. | ||
|
||
The `storage.wal.failover.secondary.duration` is the primary metric to monitor. You should expect this metric to be `0` unless a WAL failover occurs. If a WAL failover occurs, you probably care about how long it remains non-zero because it provides an indication of the health of the primary store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably care about how long it remains non-zero because it provides an indication of the health of the primary store.
This metric is a counter, so once non-zero it will remain non-zero until the process restarts. The operator probably cares about the rate that it increases. Any increase indicates a failover event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the correction - updated to the following - PTAL
If a WAL failover occurs, the rate at which it increases provides an indication of the health of the primary store.
src/current/v24.3/wal-failover.md
Outdated
|
||
## Why WAL Failover? | ||
|
||
In cloud environments transient [disk stalls]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disk-stalls) are common, often lasting on the order of several seconds. This will negatively impact latency for the user-facing foreground workload. In the field, we have observed that stalls occur most frequently while writing to the WAL. While we cannot prevent disk stalls, we can minimize their impact on CockroachDB. That is where WAL failover comes into play. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the field, we have observed that stalls occur most frequently while writing to the WAL
This sentence isn't quite right. Stalls apply to the various files of the storage engine with equal frequency, but a stall of the write-ahead log is the most impactful to foreground latencies. (Because most other writes, like those of flushes and compactions happen asynchronously in the background and foreground operations do not need to wait for them.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to the following based on your comment - PTAL
In the field, we have observed that stalls while writing to the WAL are the most impactful to foreground latencies. Most other writes, such as flushes and compactions, happen asynchronously in the background and foreground operations do not need to wait for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be to exclude this graph, because I think it's more likely to confuse than anything else. It's confusing to show an instance of the node crashing because a stall was too prolonged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've removed the image and also updated the text above it to rephrase so as not to reference the image, PTAL
When the disk continues to be stalled for longer than the duration of
COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT
, the node goes down, and there is no more metrics data coming from that node.
src/current/v24.3/wal-failover.md
Outdated
|
||
### 14. If there are more than 2 stores, will the WAL failover cascade from store A to B to C? | ||
|
||
Yes, for example if store _A_'s disk stalls for `30s`, the WAL will failover to store B after `100ms`. While store _A_ is still stalled at the `20s` mark, if store _B_'s disk fails, store _B_ will failover to store _C_. When _B_ fails, only _B_'s WAL write will failover to _C_; if _A_ is still down, _A_'s write will not failover to _C_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing. I would not say this is a "yes" to the question. Store A will failover to store B, store B to store C, and store C to store A, but store A will never failover to store C.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i replaced this sentence with what you wrote and the section now reads as follows, PTAL
Store A will failover to store B, store B will failover to store C, and store C will failover to store A, but store A will never failover to store C.
However, the WAL failback operation will not cascade back until all drives are available - that is, if store A's disk unstalls while store B is still stalled, store C will not failback to store A until B also becomes available again. In other words, C must failback to B, which must then failback to A.
@jbowens thanks for the review, i've made updates that hopefully address your feedback - PTAL! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Epic PR, just have some style suggestions and I think there's one draft line that needs to be removed.
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
thanks @taroface - the credit really goes to @dshjoshi for writing this in our wiki, I just ported it here and edited some stuff out b/c we don't support in process of applying your edits now |
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com>
34a0128
to
9f09362
Compare
Fixes DOC-11199
Summary of changes:
Add a new 'WAL Failover' page to the 'Self-Hosted Deployments' section
Update
cockroach start
docs to note that it has the basic info, butto see the 'WAL Failover Playbook' for more detailed instructions
Mark WAL failover as GA (aka no longer in Preview)