-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: exercise follower replica IO overload #81834
Labels
A-admission-control
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
Comments
tbg
added
the
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
label
May 25, 2022
This was referenced May 31, 2022
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Aug 11, 2022
This is is a less ad-hoc version of the experiment in cockroachdb#81289, where I messed with the EBS configuration. This can't be done programmatically, and so here we use an IO nemesis on n3 instead. Part of cockroachdb#79215. Closes cockroachdb#81834. Release note: None
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Aug 18, 2022
This is is a less ad-hoc version of the experiment in cockroachdb#81289, where I messed with the EBS configuration. This can't be done programmatically, and so here we use an IO nemesis on n3 instead. Part of cockroachdb#79215. Closes cockroachdb#81834. Release note: None
5 tasks
In #98576 we're seeing that the mentioned 8tb restore test with a disk throttle on one node reliably causes severe memory pressure, if not outright OOMs. (At the time of writing even without a disk throttle the test OOMs a fair amount of the time). |
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Mar 16, 2023
**This is for 23.2 only** We shouldn't artificially delay SST ingestion below raft because this exacerbates memory pressure (cockroachdb#81834). Anecdotally I see in my "typical experiment" (restores under I/O bandwidth constraints) that `PreIngestDelay` is mostly fairly small compared to the slowness that comes from write bandwidth overload itself, so at least in those experiments removing this has little to no effect. As we are also working on replication admission control[^1] and are looking into improving the memory footprint under unchecked overload[^2] now's a good time to rip this out as we'll be in a good place to address any detrimental fallout from doing so. [^1]: cockroachdb#95563 [^2]: cockroachdb#98576 Touches cockroachdb#81834. Touches cockroachdb#57247. Epic: none Release note: None
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Apr 14, 2023
We shouldn't artificially delay SST ingestion below raft because this exacerbates memory pressure (cockroachdb#81834). Anecdotally I see in my "typical experiment" (restores under I/O bandwidth constraints) that `PreIngestDelay` is mostly fairly small compared to the slowness that comes from write bandwidth overload itself, so at least in those experiments removing this has little to no effect. As we are also working on replication admission control[^1] and are looking into improving the memory footprint under unchecked overload[^2] now's a good time to rip this out as we'll be in a good place to address any detrimental fallout from doing so. [^1]: cockroachdb#95563 [^2]: cockroachdb#98576 Touches cockroachdb#81834. Touches cockroachdb#57247. Epic: none Release note: None
craig bot
pushed a commit
that referenced
this issue
Apr 18, 2023
98762: kvserver: remove below-raft PreIngestDelay during SST application r=erikgrinaker a=tbg **This is for 23.2 only** We shouldn't artificially delay SST ingestion below raft because this exacerbates memory pressure (#81834). Anecdotally I see in my "typical experiment" (restores under I/O bandwidth constraints) that `PreIngestDelay` is mostly fairly small compared to the slowness that comes from write bandwidth overload itself, so at least in those experiments removing this has little to no effect. As we are also working on replication admission control[^1] and are looking into improving the memory footprint under unchecked overload[^2] now's a good time to rip this out as we'll be in a good place to address any detrimental fallout from doing so. [^1]: #95563 [^2]: #98576 Touches #81834. Fixes #57247. Epic: CRDB-25503 Release note: None 101381: ui: add trace rate option to stmt diagnostics r=maryliag a=maryliag Fixes #92415 Add option to select the trace rate for statement diagnostics collection directly on the Console. https://www.loom.com/share/beaf1ce16f7d4efc845056e33cb3bee1 <img width="587" alt="Screenshot 2023-04-12 at 4 09 50 PM" src="https://user-images.githubusercontent.com/1017486/231573175-e05ea6dd-d03d-4044-ae53-bb4cba55bb77.png"> <img width="591" alt="Screenshot 2023-04-12 at 4 10 00 PM" src="https://user-images.githubusercontent.com/1017486/231573195-5189fee3-8af2-4ada-af1b-8cc6fde5ceb2.png"> Release note (ui change): Add option to select the trace rate for statement diagnostics collection. 101650: tenantcostclient: mark test as being `no-remote` r=rail a=rickystewart This test is broken under remote execution. Epic: CRDB-17165 Release note: None 101706: ui: update side nav and titles to match r=maryliag a=maryliag Previously the values for Advanced Debug (side nav) and Advanced Debugging (page title) were not matching. This commit uses the name "`Advanced Debug` for both. Similarly, we were using Network Latency on the side nav and Network Diagnostics on the page title. Since we might want to show more than just latency, this commit updates the title to the more generic `Network`, to match how we name other pages (e.g. SQL Activity, Database, etc). This commit also removed an extra space on the filter on the Network page. Before <img width="968" alt="Screenshot 2023-04-17 at 6 02 56 PM" src="https://user-images.githubusercontent.com/1017486/232657702-c399f8b0-9755-4aff-8043-c1753c7ad913.png"> After <img width="967" alt="Screenshot 2023-04-17 at 10 43 18 PM" src="https://user-images.githubusercontent.com/1017486/232657728-7b907868-abc4-4926-80c8-5cf616ee2d58.png"> Epic: none Release note (ui change): Update Network Latency side nav name and Network Diagnostics page title to `Network`. Update the Advanced Debugging page title to `Advanced Debug`. 101750: keyvisjob: mark the keyvis job as idle when it is not doing useful work r=zachlite a=zachlite Resolves #101539 Epic: none Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: maryliag <marylia@cockroachlabs.com> Co-authored-by: Ricky Stewart <rickybstewart@gmail.com> Co-authored-by: Zach Lite <zach@cockroachlabs.com>
Let's narrow this to overload on a follower caused by regular traffic. Since that is not currently protected by replication admission control. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-admission-control
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
See #79215.
We need a good reproduction for a situation in which a follower is overloaded via incoming raft messages. We need to set this up in a way that cleanly separates the effect of the overload on indirect foreground traffic (i.e. writes reaching the follower as part of quorum, but with the lease elsewhere) and direct overload (i.e. reads/writes for which the overloaded node holds the lease).
This can then be used to prototype improvements to the current behavior.
Jira issue: CRDB-16089
Epic CRDB-39898
The text was updated successfully, but these errors were encountered: