Fix commandlog large-reply when using reply copy avoidance#2652
Fix commandlog large-reply when using reply copy avoidance#2652enjoy-binbin merged 2 commits intovalkey-io:unstablefrom
Conversation
In valkey-io#2078, we did not report large reply when copy avoidance is allowed. This results in replies larger than 16384 not being recorded in the commandlog large-reply. This 16384 is controlled by the hidden config min-string-size-avoid-copy-reply. Signed-off-by: Binbin <binloveplay1314@qq.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #2652 +/- ##
============================================
- Coverage 72.23% 72.12% -0.12%
============================================
Files 127 127
Lines 70935 70942 +7
============================================
- Hits 51239 51164 -75
- Misses 19696 19778 +82
🚀 New features to boost your workflow:
|
|
@xbasel Will you get time to take a look at this? |
|
@enjoy-binbin Shall we merge this and include it in 9.0.1? I can approve it fast if we need it. |
Yes, we can do it. Although it is still lacking review, the code seems to be safe. |
…#2652) In valkey-io#2078, we did not report large reply when copy avoidance is allowed. This results in replies larger than 16384 not being recorded in the commandlog large-reply. This 16384 is controlled by the hidden config min-string-size-avoid-copy-reply. Signed-off-by: Binbin <binloveplay1314@qq.com>
In #2078, we did not report large reply when copy avoidance is allowed. This results in replies larger than 16384 not being recorded in the commandlog large-reply. This 16384 is controlled by the hidden config min-string-size-avoid-copy-reply. Signed-off-by: Binbin <binloveplay1314@qq.com>
|
I noticed that this commit is effecting the performance Get commands from 2.7M to 2.0M for 96 bytes data size, 9 IO-Threads and 10 Pipelining. this is about 25% drop
You can see more here @enjoy-binbin Please can you take a look here |
|
odd, the code is quite simple, thanks for the report, i will take a look later. |
Isn;t this an issue since we access the string checking the length of it? |
Yeah, this is why I had asked Basel to take a look to see if there was anything we can do about this PR. We are paying a large performance penalty by doing the memory dereference here. |
|
Btw, Basel has since left Amazon so he probably won't follow up. This was a sticky issue we never really knew what to do about. The issue is we can't know the length of the SDS without de-referencing it at this point. We discussed doing it in the IO thread itself, but that would be a lot more work and require some mechanism to track the actual bytes written to the COB. |
|
ok, i see, sorry i didn't notice it before. Should we revert it before we have a solution?
Yes, i also tried doing it in the io thread before, it is indeed hard to do it. |
I don't think so. Correctness is more important than performance IMHO and I have already released this fix in 9.0.1. When we have a fix, we can easily release another 9.0.2. Idea: Do this in the IO thread using an atomic variable or mutex. |
I don't agree with this. If someone was relying on the performance, it's effectively going to break their application. I think we shouldn't have released this 9.0.1. We might be able to prefetch it when it's given to the io thread as well. If we use a mutex there will more likely be contention. |
Fair enough. Yeah, if I'd known about the performance regression I had waited. But now that it's done, let's prioritize a fix and release it in 9.0.2 ASAP?
You mean prefetch earlier so when we access the length here, it's already prefetched? We already do that afaict: https://github.com/valkey-io/valkey/blob/9.0.1/src/memory_prefetch.c#L141 |
|
The |
Let's see if we can fix it first! We still have a problem. If we can't fix it we need to decide what to do now.
So the specific case that is failing is the 96 byte case. I wonder if that specific size the length is on a different cache line as the value ptr? |
You mean we prefetch The sds pointer is pointing a few bytes into the allocation. Do you think it would make a difference to round the address down before prefetching, to prefetch the actual sds allocation's address? Like |
…#2652) In valkey-io#2078, we did not report large reply when copy avoidance is allowed. This results in replies larger than 16384 not being recorded in the commandlog large-reply. This 16384 is controlled by the hidden config min-string-size-avoid-copy-reply. Signed-off-by: Binbin <binloveplay1314@qq.com>
…#2652) In valkey-io#2078, we did not report large reply when copy avoidance is allowed. This results in replies larger than 16384 not being recorded in the commandlog large-reply. This 16384 is controlled by the hidden config min-string-size-avoid-copy-reply. Signed-off-by: Binbin <binloveplay1314@qq.com>
…copy-avoidance In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO thread after writing the reply to the client, and (since valkey-io#2652) it's also done in the normal code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()). This means the output bytes are recorded twice in the cluster slot stats. Fixes valkey-io#3043. Bug was introduced in valkey-io#2652. Signed-off-by: Binbin <binloveplay1314@qq.com>
…copy-avoidance (#3046) In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO thread after writing the reply to the client, and (since #2652) it's also done in the normal code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()). This means the output bytes are recorded twice in the cluster slot stats. Fixes #3043. Bug was introduced in #2652 (9.0.1). Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…3086) Currently after #2652 in copy avoidance path we unconditionally track `c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than -1`. This PR provides ability to skip that accounting in copy avoidance path based on config value. If `commandlog-request-larger-than -1` then performance is same as before #2652. Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it is not tracked in main thread based on decision made in main thread. Read Performance (write performance is the same): ``` With this change: Summary: throughput summary: 2191732.75 requests per second latency summary (msec): avg min p50 p95 p99 max 1.720 0.072 1.743 1.919 2.647 23.983 Unstable: Summary: throughput summary: 1658197.25 requests per second latency summary (msec): avg min p50 p95 p99 max 2.299 0.120 2.343 2.503 3.319 4.791 Config: commandlog-request-larger-than -1 ^^ without this performance is just like on unstable branch databases 1 save "" appendonly no rdbcompression no activedefrag no maxclients 1000 io-threads 9 protected-mode no hz 10 maxmemory 2gb ``` --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
…copy-avoidance (valkey-io#3046) In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO thread after writing the reply to the client, and (since valkey-io#2652) it's also done in the normal code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()). This means the output bytes are recorded twice in the cluster slot stats. Fixes valkey-io#3043. Bug was introduced in valkey-io#2652 (9.0.1). Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…alkey-io#3086) Currently after valkey-io#2652 in copy avoidance path we unconditionally track `c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than -1`. This PR provides ability to skip that accounting in copy avoidance path based on config value. If `commandlog-request-larger-than -1` then performance is same as before valkey-io#2652. Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it is not tracked in main thread based on decision made in main thread. Read Performance (write performance is the same): ``` With this change: Summary: throughput summary: 2191732.75 requests per second latency summary (msec): avg min p50 p95 p99 max 1.720 0.072 1.743 1.919 2.647 23.983 Unstable: Summary: throughput summary: 1658197.25 requests per second latency summary (msec): avg min p50 p95 p99 max 2.299 0.120 2.343 2.503 3.319 4.791 Config: commandlog-request-larger-than -1 ^^ without this performance is just like on unstable branch databases 1 save "" appendonly no rdbcompression no activedefrag no maxclients 1000 io-threads 9 protected-mode no hz 10 maxmemory 2gb ``` --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
…copy-avoidance (valkey-io#3046) In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO thread after writing the reply to the client, and (since valkey-io#2652) it's also done in the normal code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()). This means the output bytes are recorded twice in the cluster slot stats. Fixes valkey-io#3043. Bug was introduced in valkey-io#2652 (9.0.1). Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…alkey-io#3086) Currently after valkey-io#2652 in copy avoidance path we unconditionally track `c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than -1`. This PR provides ability to skip that accounting in copy avoidance path based on config value. If `commandlog-request-larger-than -1` then performance is same as before valkey-io#2652. Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it is not tracked in main thread based on decision made in main thread. Read Performance (write performance is the same): ``` With this change: Summary: throughput summary: 2191732.75 requests per second latency summary (msec): avg min p50 p95 p99 max 1.720 0.072 1.743 1.919 2.647 23.983 Unstable: Summary: throughput summary: 1658197.25 requests per second latency summary (msec): avg min p50 p95 p99 max 2.299 0.120 2.343 2.503 3.319 4.791 Config: commandlog-request-larger-than -1 ^^ without this performance is just like on unstable branch databases 1 save "" appendonly no rdbcompression no activedefrag no maxclients 1000 io-threads 9 protected-mode no hz 10 maxmemory 2gb ``` --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
…copy-avoidance (#3046) In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO thread after writing the reply to the client, and (since #2652) it's also done in the normal code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()). This means the output bytes are recorded twice in the cluster slot stats. Fixes #3043. Bug was introduced in #2652 (9.0.1). Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…3086) Currently after #2652 in copy avoidance path we unconditionally track `c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than -1`. This PR provides ability to skip that accounting in copy avoidance path based on config value. If `commandlog-request-larger-than -1` then performance is same as before #2652. Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it is not tracked in main thread based on decision made in main thread. Read Performance (write performance is the same): ``` With this change: Summary: throughput summary: 2191732.75 requests per second latency summary (msec): avg min p50 p95 p99 max 1.720 0.072 1.743 1.919 2.647 23.983 Unstable: Summary: throughput summary: 1658197.25 requests per second latency summary (msec): avg min p50 p95 p99 max 2.299 0.120 2.343 2.503 3.319 4.791 Config: commandlog-request-larger-than -1 ^^ without this performance is just like on unstable branch databases 1 save "" appendonly no rdbcompression no activedefrag no maxclients 1000 io-threads 9 protected-mode no hz 10 maxmemory 2gb ``` --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
…copy-avoidance (valkey-io#3046) In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO thread after writing the reply to the client, and (since valkey-io#2652) it's also done in the normal code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()). This means the output bytes are recorded twice in the cluster slot stats. Fixes valkey-io#3043. Bug was introduced in valkey-io#2652 (9.0.1). Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…alkey-io#3086) Currently after valkey-io#2652 in copy avoidance path we unconditionally track `c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than -1`. This PR provides ability to skip that accounting in copy avoidance path based on config value. If `commandlog-request-larger-than -1` then performance is same as before valkey-io#2652. Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it is not tracked in main thread based on decision made in main thread. Read Performance (write performance is the same): ``` With this change: Summary: throughput summary: 2191732.75 requests per second latency summary (msec): avg min p50 p95 p99 max 1.720 0.072 1.743 1.919 2.647 23.983 Unstable: Summary: throughput summary: 1658197.25 requests per second latency summary (msec): avg min p50 p95 p99 max 2.299 0.120 2.343 2.503 3.319 4.791 Config: commandlog-request-larger-than -1 ^^ without this performance is just like on unstable branch databases 1 save "" appendonly no rdbcompression no activedefrag no maxclients 1000 io-threads 9 protected-mode no hz 10 maxmemory 2gb ``` --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>

In #2078, we did not report large reply when copy avoidance is allowed.
This results in replies larger than 16384 not being recorded in the
commandlog large-reply. This 16384 is controlled by the hidden config
min-string-size-avoid-copy-reply.