Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: exercise follower replica IO overload #81834

Open
tbg opened this issue May 25, 2022 · 2 comments
Open

roachtest: exercise follower replica IO overload #81834

tbg opened this issue May 25, 2022 · 2 comments
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented May 25, 2022

See #79215.

We need a good reproduction for a situation in which a follower is overloaded via incoming raft messages. We need to set this up in a way that cleanly separates the effect of the overload on indirect foreground traffic (i.e. writes reaching the follower as part of quorum, but with the lease elsewhere) and direct overload (i.e. reads/writes for which the overloaded node holds the lease).

This can then be used to prototype improvements to the current behavior.

Jira issue: CRDB-16089

Epic CRDB-39898

@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label May 25, 2022
@tbg tbg self-assigned this May 25, 2022
tbg added a commit to tbg/cockroach that referenced this issue Aug 11, 2022
This is is a less ad-hoc version of the experiment in cockroachdb#81289, where I
messed with the EBS configuration. This can't be done programmatically,
and so here we use an IO nemesis on n3 instead.

Part of cockroachdb#79215.
Closes cockroachdb#81834.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Aug 18, 2022
This is is a less ad-hoc version of the experiment in cockroachdb#81289, where I
messed with the EBS configuration. This can't be done programmatically,
and so here we use an IO nemesis on n3 instead.

Part of cockroachdb#79215.
Closes cockroachdb#81834.

Release note: None
@tbg
Copy link
Member Author

tbg commented Mar 16, 2023

In #98576 we're seeing that the mentioned 8tb restore test with a disk throttle on one node reliably causes severe memory pressure, if not outright OOMs. (At the time of writing even without a disk throttle the test OOMs a fair amount of the time).

tbg added a commit to tbg/cockroach that referenced this issue Mar 16, 2023
**This is for 23.2 only**

We shouldn't artificially delay SST ingestion below raft because this
exacerbates memory pressure (cockroachdb#81834).

Anecdotally I see in my "typical experiment" (restores under I/O
bandwidth constraints) that `PreIngestDelay` is mostly fairly small
compared to the slowness that comes from write bandwidth overload
itself, so at least in those experiments removing this has little
to no effect.

As we are also working on replication admission control[^1] and are
looking into improving the memory footprint under unchecked overload[^2]
now's a good time to rip this out as we'll be in a good place to address
any detrimental fallout from doing so.

[^1]: cockroachdb#95563
[^2]: cockroachdb#98576

Touches cockroachdb#81834.
Touches cockroachdb#57247.

Epic: none
Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Apr 14, 2023
We shouldn't artificially delay SST ingestion below raft because this
exacerbates memory pressure (cockroachdb#81834).

Anecdotally I see in my "typical experiment" (restores under I/O
bandwidth constraints) that `PreIngestDelay` is mostly fairly small
compared to the slowness that comes from write bandwidth overload
itself, so at least in those experiments removing this has little
to no effect.

As we are also working on replication admission control[^1] and are
looking into improving the memory footprint under unchecked overload[^2]
now's a good time to rip this out as we'll be in a good place to address
any detrimental fallout from doing so.

[^1]: cockroachdb#95563
[^2]: cockroachdb#98576

Touches cockroachdb#81834.
Touches cockroachdb#57247.

Epic: none
Release note: None
craig bot pushed a commit that referenced this issue Apr 18, 2023
98762: kvserver: remove below-raft PreIngestDelay during SST application r=erikgrinaker a=tbg

**This is for 23.2 only**

We shouldn't artificially delay SST ingestion below raft because this
exacerbates memory pressure (#81834).

Anecdotally I see in my "typical experiment" (restores under I/O
bandwidth constraints) that `PreIngestDelay` is mostly fairly small
compared to the slowness that comes from write bandwidth overload
itself, so at least in those experiments removing this has little
to no effect.

As we are also working on replication admission control[^1] and are
looking into improving the memory footprint under unchecked overload[^2]
now's a good time to rip this out as we'll be in a good place to address
any detrimental fallout from doing so.

[^1]: #95563
[^2]: #98576

Touches #81834.
Fixes #57247.

Epic: CRDB-25503
Release note: None

101381: ui: add trace rate option to stmt diagnostics r=maryliag a=maryliag

Fixes #92415

Add option to select the trace rate for statement
diagnostics collection directly on the Console.

https://www.loom.com/share/beaf1ce16f7d4efc845056e33cb3bee1

<img width="587" alt="Screenshot 2023-04-12 at 4 09 50 PM" src="https://user-images.githubusercontent.com/1017486/231573175-e05ea6dd-d03d-4044-ae53-bb4cba55bb77.png">

<img width="591" alt="Screenshot 2023-04-12 at 4 10 00 PM" src="https://user-images.githubusercontent.com/1017486/231573195-5189fee3-8af2-4ada-af1b-8cc6fde5ceb2.png">


Release note (ui change): Add option to select the trace rate for statement diagnostics collection.

101650: tenantcostclient: mark test as being `no-remote` r=rail a=rickystewart

This test is broken under remote execution.

Epic: CRDB-17165
Release note: None

101706: ui: update side nav and titles to match r=maryliag a=maryliag

Previously the values for Advanced Debug (side nav) and Advanced Debugging (page title) were not matching. This commit uses the name "`Advanced Debug` for both.

Similarly, we were using Network Latency on the side nav and Network Diagnostics on the page title. Since we might want to show more than just latency, this commit updates the title to the more generic `Network`, to match how we name other pages (e.g. SQL Activity, Database, etc).

This commit also removed an extra space on the filter on the Network page.

Before
<img width="968" alt="Screenshot 2023-04-17 at 6 02 56 PM" src="https://user-images.githubusercontent.com/1017486/232657702-c399f8b0-9755-4aff-8043-c1753c7ad913.png">


After
<img width="967" alt="Screenshot 2023-04-17 at 10 43 18 PM" src="https://user-images.githubusercontent.com/1017486/232657728-7b907868-abc4-4926-80c8-5cf616ee2d58.png">


Epic: none

Release note (ui change): Update Network Latency side nav name and Network Diagnostics page title to `Network`. Update the Advanced Debugging page title to `Advanced Debug`.

101750: keyvisjob: mark the keyvis job as idle when it is not doing useful work r=zachlite a=zachlite

Resolves #101539
Epic: none
Release note: None

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: maryliag <marylia@cockroachlabs.com>
Co-authored-by: Ricky Stewart <rickybstewart@gmail.com>
Co-authored-by: Zach Lite <zach@cockroachlabs.com>
@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Oct 2, 2023

Let's narrow this to overload on a follower caused by regular traffic. Since that is not currently protected by replication admission control.

@nicktrav nicktrav added the T-admission-control Admission Control label Dec 21, 2023
@exalate-issue-sync exalate-issue-sync bot removed the T-admission-control Admission Control label Mar 26, 2024
@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Incoming
Development

Successfully merging a pull request may close this issue.

5 participants