Fix commandlog large-reply when using reply copy avoidance by enjoy-binbin · Pull Request #2652 · valkey-io/valkey

enjoy-binbin · 2025-09-26T08:01:49Z

In #2078, we did not report large reply when copy avoidance is allowed.
This results in replies larger than 16384 not being recorded in the
commandlog large-reply. This 16384 is controlled by the hidden config
min-string-size-avoid-copy-reply.

In valkey-io#2078, we did not report large reply when copy avoidance is allowed. This results in replies larger than 16384 not being recorded in the commandlog large-reply. This 16384 is controlled by the hidden config min-string-size-avoid-copy-reply. Signed-off-by: Binbin <binloveplay1314@qq.com>

codecov · 2025-09-26T08:21:18Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.12%. Comparing base (ffdf222) to head (bf86728).
⚠️ Report is 117 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2652      +/-   ##
============================================
- Coverage     72.23%   72.12%   -0.12%     
============================================
  Files           127      127              
  Lines         70935    70942       +7     
============================================
- Hits          51239    51164      -75     
- Misses        19696    19778      +82

Files with missing lines	Coverage Δ
src/networking.c	`88.31% <100.00%> (+0.12%)`	⬆️

... and 14 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Binbin <binloveplay1314@qq.com>

madolson · 2025-10-20T14:09:51Z

@xbasel Will you get time to take a look at this?

zuiderkwast · 2025-12-04T18:50:08Z

@enjoy-binbin Shall we merge this and include it in 9.0.1? I can approve it fast if we need it.

enjoy-binbin · 2025-12-04T23:25:14Z

Shall we merge this and include it in 9.0.1? I can approve it fast if we need it.

Yes, we can do it. Although it is still lacking review, the code seems to be safe.

tests/unit/commandlog.tcl

…#2652) In valkey-io#2078, we did not report large reply when copy avoidance is allowed. This results in replies larger than 16384 not being recorded in the commandlog large-reply. This 16384 is controlled by the hidden config min-string-size-avoid-copy-reply. Signed-off-by: Binbin <binloveplay1314@qq.com>

In #2078, we did not report large reply when copy avoidance is allowed. This results in replies larger than 16384 not being recorded in the commandlog large-reply. This 16384 is controlled by the hidden config min-string-size-avoid-copy-reply. Signed-off-by: Binbin <binloveplay1314@qq.com>

roshkhatri · 2025-12-10T22:28:42Z

I noticed that this commit is effecting the performance Get commands from 2.7M to 2.0M for 96 bytes data size, 9 IO-Threads and 10 Pipelining. this is about 25% drop

You can see more here

@enjoy-binbin Please can you take a look here

enjoy-binbin · 2025-12-11T02:23:21Z

odd, the code is quite simple, thanks for the report, i will take a look later.

ranshid · 2025-12-11T07:34:21Z

odd, the code is quite simple, thanks for the report, i will take a look later.

Isn;t this an issue since we access the string checking the length of it?

madolson · 2025-12-11T07:35:30Z

Isn;t this an issue since we access the string checking the length of it?

Yeah, this is why I had asked Basel to take a look to see if there was anything we can do about this PR. We are paying a large performance penalty by doing the memory dereference here.

madolson · 2025-12-11T07:41:58Z

Btw, Basel has since left Amazon so he probably won't follow up. This was a sticky issue we never really knew what to do about. The issue is we can't know the length of the SDS without de-referencing it at this point. We discussed doing it in the IO thread itself, but that would be a lot more work and require some mechanism to track the actual bytes written to the COB.

enjoy-binbin · 2025-12-11T09:48:13Z

ok, i see, sorry i didn't notice it before. Should we revert it before we have a solution?

This was a sticky issue we never really knew what to do about. The issue is we can't know the length of the SDS without de-referencing it at this point. We discussed doing it in the IO thread itself, but that would be a lot more work and require some mechanism to track the actual bytes written to the COB.

Yes, i also tried doing it in the io thread before, it is indeed hard to do it.

zuiderkwast · 2025-12-11T11:31:32Z

ok, i see, sorry i didn't notice it before. Should we revert it before we have a solution?

I don't think so. Correctness is more important than performance IMHO and I have already released this fix in 9.0.1.

When we have a fix, we can easily release another 9.0.2. Idea: Do this in the IO thread using an atomic variable or mutex.

madolson · 2025-12-11T14:44:42Z

I don't think so. Correctness is more important than performance IMHO and I have already released this fix in 9.0.1.

I don't agree with this. If someone was relying on the performance, it's effectively going to break their application. I think we shouldn't have released this 9.0.1.

We might be able to prefetch it when it's given to the io thread as well. If we use a mutex there will more likely be contention.

zuiderkwast · 2025-12-11T15:08:28Z

we shouldn't have released this 9.0.1

Fair enough. Yeah, if I'd known about the performance regression I had waited. But now that it's done, let's prioritize a fix and release it in 9.0.2 ASAP?

We might be able to prefetch it when it's given to the io thread as well

You mean prefetch earlier so when we access the length here, it's already prefetched? We already do that afaict: https://github.com/valkey-io/valkey/blob/9.0.1/src/memory_prefetch.c#L141

zuiderkwast · 2025-12-11T15:55:11Z

The c->net_output_bytes_curr_cmd accounting also affects CLUSTER SLOT-STATS, so that was probably another bug that was fixed by this PR.

madolson · 2025-12-12T12:39:54Z

Fair enough. Yeah, if I'd known about the performance regression I had waited. But now that it's done, let's prioritize a fix and release it in 9.0.2 ASAP?

Let's see if we can fix it first! We still have a problem. If we can't fix it we need to decide what to do now.

You mean prefetch earlier so when we access the length here, it's already prefetched? We already do that afaict: https://github.com/valkey-io/valkey/blob/9.0.1/src/memory_prefetch.c#L141

So the specific case that is failing is the 96 byte case. I wonder if that specific size the length is on a different cache line as the value ptr?

zuiderkwast · 2025-12-12T13:08:18Z

So the specific case that is failing is the 96 byte case. I wonder if that specific size the length is on a different cache line as the value ptr?

You mean we prefetch val->ptr but the size is located at 1-2 bytes before that? It seems unlinkely that the cache line boundary would be exactly there, because allocations of this size are normally 16B aligned and the sds header is only 3 bytes.

The sds pointer is pointing a few bytes into the allocation. Do you think it would make a difference to round the address down before prefetching, to prefetch the actual sds allocation's address? Like valkey_prefetch((void *)((uintptr_t)(val->ptr) & ~7))).

…#2652) In valkey-io#2078, we did not report large reply when copy avoidance is allowed. This results in replies larger than 16384 not being recorded in the commandlog large-reply. This 16384 is controlled by the hidden config min-string-size-avoid-copy-reply. Signed-off-by: Binbin <binloveplay1314@qq.com>

…copy-avoidance In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO thread after writing the reply to the client, and (since valkey-io#2652) it's also done in the normal code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()). This means the output bytes are recorded twice in the cluster slot stats. Fixes valkey-io#3043. Bug was introduced in valkey-io#2652. Signed-off-by: Binbin <binloveplay1314@qq.com>

…copy-avoidance (#3046) In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO thread after writing the reply to the client, and (since #2652) it's also done in the normal code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()). This means the output bytes are recorded twice in the cluster slot stats. Fixes #3043. Bug was introduced in #2652 (9.0.1). Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…3086) Currently after #2652 in copy avoidance path we unconditionally track `c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than -1`. This PR provides ability to skip that accounting in copy avoidance path based on config value. If `commandlog-request-larger-than -1` then performance is same as before #2652. Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it is not tracked in main thread based on decision made in main thread. Read Performance (write performance is the same): ``` With this change: Summary: throughput summary: 2191732.75 requests per second latency summary (msec): avg min p50 p95 p99 max 1.720 0.072 1.743 1.919 2.647 23.983 Unstable: Summary: throughput summary: 1658197.25 requests per second latency summary (msec): avg min p50 p95 p99 max 2.299 0.120 2.343 2.503 3.319 4.791 Config: commandlog-request-larger-than -1 ^^ without this performance is just like on unstable branch databases 1 save "" appendonly no rdbcompression no activedefrag no maxclients 1000 io-threads 9 protected-mode no hz 10 maxmemory 2gb ``` --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>

…copy-avoidance (valkey-io#3046) In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO thread after writing the reply to the client, and (since valkey-io#2652) it's also done in the normal code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()). This means the output bytes are recorded twice in the cluster slot stats. Fixes valkey-io#3043. Bug was introduced in valkey-io#2652 (9.0.1). Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…alkey-io#3086) Currently after valkey-io#2652 in copy avoidance path we unconditionally track `c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than -1`. This PR provides ability to skip that accounting in copy avoidance path based on config value. If `commandlog-request-larger-than -1` then performance is same as before valkey-io#2652. Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it is not tracked in main thread based on decision made in main thread. Read Performance (write performance is the same): ``` With this change: Summary: throughput summary: 2191732.75 requests per second latency summary (msec): avg min p50 p95 p99 max 1.720 0.072 1.743 1.919 2.647 23.983 Unstable: Summary: throughput summary: 1658197.25 requests per second latency summary (msec): avg min p50 p95 p99 max 2.299 0.120 2.343 2.503 3.319 4.791 Config: commandlog-request-larger-than -1 ^^ without this performance is just like on unstable branch databases 1 save "" appendonly no rdbcompression no activedefrag no maxclients 1000 io-threads 9 protected-mode no hz 10 maxmemory 2gb ``` --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>

…copy-avoidance (valkey-io#3046) In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO thread after writing the reply to the client, and (since valkey-io#2652) it's also done in the normal code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()). This means the output bytes are recorded twice in the cluster slot stats. Fixes valkey-io#3043. Bug was introduced in valkey-io#2652 (9.0.1). Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…alkey-io#3086) Currently after valkey-io#2652 in copy avoidance path we unconditionally track `c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than -1`. This PR provides ability to skip that accounting in copy avoidance path based on config value. If `commandlog-request-larger-than -1` then performance is same as before valkey-io#2652. Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it is not tracked in main thread based on decision made in main thread. Read Performance (write performance is the same): ``` With this change: Summary: throughput summary: 2191732.75 requests per second latency summary (msec): avg min p50 p95 p99 max 1.720 0.072 1.743 1.919 2.647 23.983 Unstable: Summary: throughput summary: 1658197.25 requests per second latency summary (msec): avg min p50 p95 p99 max 2.299 0.120 2.343 2.503 3.319 4.791 Config: commandlog-request-larger-than -1 ^^ without this performance is just like on unstable branch databases 1 save "" appendonly no rdbcompression no activedefrag no maxclients 1000 io-threads 9 protected-mode no hz 10 maxmemory 2gb ``` --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>

…copy-avoidance (#3046) In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO thread after writing the reply to the client, and (since #2652) it's also done in the normal code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()). This means the output bytes are recorded twice in the cluster slot stats. Fixes #3043. Bug was introduced in #2652 (9.0.1). Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…3086) Currently after #2652 in copy avoidance path we unconditionally track `c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than -1`. This PR provides ability to skip that accounting in copy avoidance path based on config value. If `commandlog-request-larger-than -1` then performance is same as before #2652. Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it is not tracked in main thread based on decision made in main thread. Read Performance (write performance is the same): ``` With this change: Summary: throughput summary: 2191732.75 requests per second latency summary (msec): avg min p50 p95 p99 max 1.720 0.072 1.743 1.919 2.647 23.983 Unstable: Summary: throughput summary: 1658197.25 requests per second latency summary (msec): avg min p50 p95 p99 max 2.299 0.120 2.343 2.503 3.319 4.791 Config: commandlog-request-larger-than -1 ^^ without this performance is just like on unstable branch databases 1 save "" appendonly no rdbcompression no activedefrag no maxclients 1000 io-threads 9 protected-mode no hz 10 maxmemory 2gb ``` --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>

…copy-avoidance (valkey-io#3046) In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO thread after writing the reply to the client, and (since valkey-io#2652) it's also done in the normal code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()). This means the output bytes are recorded twice in the cluster slot stats. Fixes valkey-io#3043. Bug was introduced in valkey-io#2652 (9.0.1). Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…alkey-io#3086) Currently after valkey-io#2652 in copy avoidance path we unconditionally track `c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than -1`. This PR provides ability to skip that accounting in copy avoidance path based on config value. If `commandlog-request-larger-than -1` then performance is same as before valkey-io#2652. Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it is not tracked in main thread based on decision made in main thread. Read Performance (write performance is the same): ``` With this change: Summary: throughput summary: 2191732.75 requests per second latency summary (msec): avg min p50 p95 p99 max 1.720 0.072 1.743 1.919 2.647 23.983 Unstable: Summary: throughput summary: 1658197.25 requests per second latency summary (msec): avg min p50 p95 p99 max 2.299 0.120 2.343 2.503 3.319 4.791 Config: commandlog-request-larger-than -1 ^^ without this performance is just like on unstable branch databases 1 save "" appendonly no rdbcompression no activedefrag no maxclients 1000 io-threads 9 protected-mode no hz 10 maxmemory 2gb ``` --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>

enjoy-binbin requested a review from xbasel September 26, 2025 08:01

github-actions bot assigned enjoy-binbin Sep 26, 2025

enjoy-binbin added this to Valkey 9.0 Sep 26, 2025

enjoy-binbin requested a review from madolson September 26, 2025 08:02

Fix test

bf86728

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Sep 26, 2025

zuiderkwast approved these changes Dec 5, 2025

View reviewed changes

tests/unit/commandlog.tcl Show resolved Hide resolved

zuiderkwast moved this to In Progress in Valkey 9.0 Dec 8, 2025

enjoy-binbin moved this from In Progress to To be backported in Valkey 9.0 Dec 9, 2025

enjoy-binbin moved this from To be backported to 9.0.1 (WIP) in Valkey 9.0 Dec 9, 2025

enjoy-binbin merged commit 48e0cbb into valkey-io:unstable Dec 9, 2025
58 of 61 checks passed

github-project-automation bot moved this from 9.0.1 (WIP) to To be backported in Valkey 9.0 Dec 9, 2025

enjoy-binbin deleted the fix_large_reply branch December 9, 2025 02:20

enjoy-binbin mentioned this pull request Dec 9, 2025

Backport fixes and add release notes for 9.0.1 #2906

Merged

zuiderkwast moved this from To be backported to 9.0.1 (WIP) in Valkey 9.0 Dec 9, 2025

roshkhatri mentioned this pull request Dec 10, 2025

[Performance] Potential regression of 25% from commandlog large-reply when using reply copy avoidance #2926

Open

enjoy-binbin mentioned this pull request Dec 11, 2025

Allow partial sync after loading AOF with preamble #2366

Merged

dvkashapov mentioned this pull request Jan 9, 2026

Fix perf regression from #2652 #3035

Closed

zuiderkwast mentioned this pull request Jan 12, 2026

[BUG] Slot stats is recorded twice with copy-avoidance #3043

Closed

enjoy-binbin mentioned this pull request Jan 12, 2026

Avoid duplicate calculations of network-bytes-out in slot stats with copy-avoidance #3046

Merged

dvkashapov mentioned this pull request Jan 21, 2026

Perf: Track net bytes only if commandlog-request-larger-than != -1 #3086

Merged

Comments

Conversation

enjoy-binbin commented Sep 26, 2025

Uh oh!

codecov bot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

madolson commented Oct 20, 2025

Uh oh!

zuiderkwast commented Dec 4, 2025

Uh oh!

enjoy-binbin commented Dec 4, 2025

Uh oh!

Uh oh!

Uh oh!

roshkhatri commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enjoy-binbin commented Dec 11, 2025

Uh oh!

ranshid commented Dec 11, 2025

Uh oh!

madolson commented Dec 11, 2025

Uh oh!

madolson commented Dec 11, 2025

Uh oh!

enjoy-binbin commented Dec 11, 2025

Uh oh!

zuiderkwast commented Dec 11, 2025

Uh oh!

madolson commented Dec 11, 2025

Uh oh!

zuiderkwast commented Dec 11, 2025

Uh oh!

zuiderkwast commented Dec 11, 2025

Uh oh!

madolson commented Dec 12, 2025

Uh oh!

zuiderkwast commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Sep 26, 2025 •

edited

Loading

roshkhatri commented Dec 10, 2025 •

edited

Loading