Skip to content

Comments

Fix commandlog large-reply when using reply copy avoidance#2652

Merged
enjoy-binbin merged 2 commits intovalkey-io:unstablefrom
enjoy-binbin:fix_large_reply
Dec 9, 2025
Merged

Fix commandlog large-reply when using reply copy avoidance#2652
enjoy-binbin merged 2 commits intovalkey-io:unstablefrom
enjoy-binbin:fix_large_reply

Conversation

@enjoy-binbin
Copy link
Member

In #2078, we did not report large reply when copy avoidance is allowed.
This results in replies larger than 16384 not being recorded in the
commandlog large-reply. This 16384 is controlled by the hidden config
min-string-size-avoid-copy-reply.

In valkey-io#2078, we did not report large reply when copy avoidance is allowed.
This results in replies larger than 16384 not being recorded in the
commandlog large-reply. This 16384 is controlled by the hidden config
min-string-size-avoid-copy-reply.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@codecov
Copy link

codecov bot commented Sep 26, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.12%. Comparing base (ffdf222) to head (bf86728).
⚠️ Report is 117 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2652      +/-   ##
============================================
- Coverage     72.23%   72.12%   -0.12%     
============================================
  Files           127      127              
  Lines         70935    70942       +7     
============================================
- Hits          51239    51164      -75     
- Misses        19696    19778      +82     
Files with missing lines Coverage Δ
src/networking.c 88.31% <100.00%> (+0.12%) ⬆️

... and 14 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Sep 26, 2025
@madolson
Copy link
Member

@xbasel Will you get time to take a look at this?

@zuiderkwast
Copy link
Contributor

@enjoy-binbin Shall we merge this and include it in 9.0.1? I can approve it fast if we need it.

@enjoy-binbin
Copy link
Member Author

Shall we merge this and include it in 9.0.1? I can approve it fast if we need it.

Yes, we can do it. Although it is still lacking review, the code seems to be safe.

@zuiderkwast zuiderkwast moved this to In Progress in Valkey 9.0 Dec 8, 2025
@enjoy-binbin enjoy-binbin moved this from In Progress to To be backported in Valkey 9.0 Dec 9, 2025
@enjoy-binbin enjoy-binbin moved this from To be backported to 9.0.1 (WIP) in Valkey 9.0 Dec 9, 2025
@enjoy-binbin enjoy-binbin merged commit 48e0cbb into valkey-io:unstable Dec 9, 2025
58 of 61 checks passed
@github-project-automation github-project-automation bot moved this from 9.0.1 (WIP) to To be backported in Valkey 9.0 Dec 9, 2025
@enjoy-binbin enjoy-binbin deleted the fix_large_reply branch December 9, 2025 02:20
zuiderkwast pushed a commit to zuiderkwast/placeholderkv that referenced this pull request Dec 9, 2025
…#2652)

In valkey-io#2078, we did not report large reply when copy avoidance is allowed.
This results in replies larger than 16384 not being recorded in the
commandlog large-reply. This 16384 is controlled by the hidden config
min-string-size-avoid-copy-reply.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@zuiderkwast zuiderkwast moved this from To be backported to 9.0.1 (WIP) in Valkey 9.0 Dec 9, 2025
zuiderkwast pushed a commit that referenced this pull request Dec 9, 2025
In #2078, we did not report large reply when copy avoidance is allowed.
This results in replies larger than 16384 not being recorded in the
commandlog large-reply. This 16384 is controlled by the hidden config
min-string-size-avoid-copy-reply.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@roshkhatri
Copy link
Member

roshkhatri commented Dec 10, 2025

I noticed that this commit is effecting the performance Get commands from 2.7M to 2.0M for 96 bytes data size, 9 IO-Threads and 10 Pipelining. this is about 25% drop

Screenshot 2025-12-10 at 2 25 38 PM

You can see more here

@enjoy-binbin Please can you take a look here

@enjoy-binbin
Copy link
Member Author

odd, the code is quite simple, thanks for the report, i will take a look later.

@ranshid
Copy link
Member

ranshid commented Dec 11, 2025

odd, the code is quite simple, thanks for the report, i will take a look later.

Isn;t this an issue since we access the string checking the length of it?

@madolson
Copy link
Member

Isn;t this an issue since we access the string checking the length of it?

Yeah, this is why I had asked Basel to take a look to see if there was anything we can do about this PR. We are paying a large performance penalty by doing the memory dereference here.

@madolson
Copy link
Member

Btw, Basel has since left Amazon so he probably won't follow up. This was a sticky issue we never really knew what to do about. The issue is we can't know the length of the SDS without de-referencing it at this point. We discussed doing it in the IO thread itself, but that would be a lot more work and require some mechanism to track the actual bytes written to the COB.

@enjoy-binbin
Copy link
Member Author

ok, i see, sorry i didn't notice it before. Should we revert it before we have a solution?

This was a sticky issue we never really knew what to do about. The issue is we can't know the length of the SDS without de-referencing it at this point. We discussed doing it in the IO thread itself, but that would be a lot more work and require some mechanism to track the actual bytes written to the COB.

Yes, i also tried doing it in the io thread before, it is indeed hard to do it.

@zuiderkwast
Copy link
Contributor

ok, i see, sorry i didn't notice it before. Should we revert it before we have a solution?

I don't think so. Correctness is more important than performance IMHO and I have already released this fix in 9.0.1.

When we have a fix, we can easily release another 9.0.2. Idea: Do this in the IO thread using an atomic variable or mutex.

@madolson
Copy link
Member

I don't think so. Correctness is more important than performance IMHO and I have already released this fix in 9.0.1.

I don't agree with this. If someone was relying on the performance, it's effectively going to break their application. I think we shouldn't have released this 9.0.1.

We might be able to prefetch it when it's given to the io thread as well. If we use a mutex there will more likely be contention.

@zuiderkwast
Copy link
Contributor

we shouldn't have released this 9.0.1

Fair enough. Yeah, if I'd known about the performance regression I had waited. But now that it's done, let's prioritize a fix and release it in 9.0.2 ASAP?

We might be able to prefetch it when it's given to the io thread as well

You mean prefetch earlier so when we access the length here, it's already prefetched? We already do that afaict: https://github.com/valkey-io/valkey/blob/9.0.1/src/memory_prefetch.c#L141

@zuiderkwast
Copy link
Contributor

The c->net_output_bytes_curr_cmd accounting also affects CLUSTER SLOT-STATS, so that was probably another bug that was fixed by this PR.

@madolson
Copy link
Member

Fair enough. Yeah, if I'd known about the performance regression I had waited. But now that it's done, let's prioritize a fix and release it in 9.0.2 ASAP?

Let's see if we can fix it first! We still have a problem. If we can't fix it we need to decide what to do now.

You mean prefetch earlier so when we access the length here, it's already prefetched? We already do that afaict: https://github.com/valkey-io/valkey/blob/9.0.1/src/memory_prefetch.c#L141

So the specific case that is failing is the 96 byte case. I wonder if that specific size the length is on a different cache line as the value ptr?

@zuiderkwast
Copy link
Contributor

So the specific case that is failing is the 96 byte case. I wonder if that specific size the length is on a different cache line as the value ptr?

You mean we prefetch val->ptr but the size is located at 1-2 bytes before that? It seems unlinkely that the cache line boundary would be exactly there, because allocations of this size are normally 16B aligned and the sds header is only 3 bytes.

The sds pointer is pointing a few bytes into the allocation. Do you think it would make a difference to round the address down before prefetching, to prefetch the actual sds allocation's address? Like valkey_prefetch((void *)((uintptr_t)(val->ptr) & ~7))).

aradz44 pushed a commit to aradz44/valkey that referenced this pull request Dec 23, 2025
…#2652)

In valkey-io#2078, we did not report large reply when copy avoidance is allowed.
This results in replies larger than 16384 not being recorded in the
commandlog large-reply. This 16384 is controlled by the hidden config
min-string-size-avoid-copy-reply.

Signed-off-by: Binbin <binloveplay1314@qq.com>
jdheyburn pushed a commit to jdheyburn/valkey that referenced this pull request Jan 8, 2026
…#2652)

In valkey-io#2078, we did not report large reply when copy avoidance is allowed.
This results in replies larger than 16384 not being recorded in the
commandlog large-reply. This 16384 is controlled by the hidden config
min-string-size-avoid-copy-reply.

Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Jan 12, 2026
…copy-avoidance

In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO
thread after writing the reply to the client, and (since valkey-io#2652) it's also done in the normal
code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()).
This means the output bytes are recorded twice in the cluster slot stats.

Fixes valkey-io#3043. Bug was introduced in valkey-io#2652.

Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin added a commit that referenced this pull request Jan 14, 2026
…copy-avoidance (#3046)

In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO
thread after writing the reply to the client, and (since #2652) it's also done in the normal
code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()).
This means the output bytes are recorded twice in the cluster slot stats.

Fixes #3043. Bug was introduced in #2652 (9.0.1).

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
zuiderkwast pushed a commit that referenced this pull request Jan 26, 2026
…3086)

Currently after #2652 in copy avoidance path we unconditionally track
`c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than
-1`. This PR provides ability to skip that accounting in copy avoidance
path based on config value. If `commandlog-request-larger-than -1` then
performance is same as before #2652.

Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it
is not tracked in main thread based on decision made in main thread.

Read Performance (write performance is the same):
```
With this change:
Summary:
  throughput summary: 2191732.75 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.720     0.072     1.743     1.919     2.647    23.983

Unstable: 
Summary:
  throughput summary: 1658197.25 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.299     0.120     2.343     2.503     3.319     4.791

Config:
commandlog-request-larger-than -1
^^ without this performance is just like on unstable branch
databases 1
save ""
appendonly no
rdbcompression no
activedefrag no
maxclients 1000
io-threads 9
protected-mode no
hz 10
maxmemory 2gb
```

---------

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
zuiderkwast added a commit to zuiderkwast/placeholderkv that referenced this pull request Jan 29, 2026
…copy-avoidance (valkey-io#3046)

In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO
thread after writing the reply to the client, and (since valkey-io#2652) it's also done in the normal
code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()).
This means the output bytes are recorded twice in the cluster slot stats.

Fixes valkey-io#3043. Bug was introduced in valkey-io#2652 (9.0.1).

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
zuiderkwast pushed a commit to zuiderkwast/placeholderkv that referenced this pull request Jan 29, 2026
…alkey-io#3086)

Currently after valkey-io#2652 in copy avoidance path we unconditionally track
`c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than
-1`. This PR provides ability to skip that accounting in copy avoidance
path based on config value. If `commandlog-request-larger-than -1` then
performance is same as before valkey-io#2652.

Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it
is not tracked in main thread based on decision made in main thread.

Read Performance (write performance is the same):
```
With this change:
Summary:
  throughput summary: 2191732.75 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.720     0.072     1.743     1.919     2.647    23.983

Unstable:
Summary:
  throughput summary: 1658197.25 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.299     0.120     2.343     2.503     3.319     4.791

Config:
commandlog-request-larger-than -1
^^ without this performance is just like on unstable branch
databases 1
save ""
appendonly no
rdbcompression no
activedefrag no
maxclients 1000
io-threads 9
protected-mode no
hz 10
maxmemory 2gb
```

---------

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
zuiderkwast added a commit to zuiderkwast/placeholderkv that referenced this pull request Jan 30, 2026
…copy-avoidance (valkey-io#3046)

In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO
thread after writing the reply to the client, and (since valkey-io#2652) it's also done in the normal
code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()).
This means the output bytes are recorded twice in the cluster slot stats.

Fixes valkey-io#3043. Bug was introduced in valkey-io#2652 (9.0.1).

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
zuiderkwast pushed a commit to zuiderkwast/placeholderkv that referenced this pull request Jan 30, 2026
…alkey-io#3086)

Currently after valkey-io#2652 in copy avoidance path we unconditionally track
`c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than
-1`. This PR provides ability to skip that accounting in copy avoidance
path based on config value. If `commandlog-request-larger-than -1` then
performance is same as before valkey-io#2652.

Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it
is not tracked in main thread based on decision made in main thread.

Read Performance (write performance is the same):
```
With this change:
Summary:
  throughput summary: 2191732.75 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.720     0.072     1.743     1.919     2.647    23.983

Unstable:
Summary:
  throughput summary: 1658197.25 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.299     0.120     2.343     2.503     3.319     4.791

Config:
commandlog-request-larger-than -1
^^ without this performance is just like on unstable branch
databases 1
save ""
appendonly no
rdbcompression no
activedefrag no
maxclients 1000
io-threads 9
protected-mode no
hz 10
maxmemory 2gb
```

---------

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
zuiderkwast added a commit that referenced this pull request Feb 3, 2026
…copy-avoidance (#3046)

In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO
thread after writing the reply to the client, and (since #2652) it's also done in the normal
code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()).
This means the output bytes are recorded twice in the cluster slot stats.

Fixes #3043. Bug was introduced in #2652 (9.0.1).

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
zuiderkwast pushed a commit that referenced this pull request Feb 3, 2026
…3086)

Currently after #2652 in copy avoidance path we unconditionally track
`c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than
-1`. This PR provides ability to skip that accounting in copy avoidance
path based on config value. If `commandlog-request-larger-than -1` then
performance is same as before #2652.

Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it
is not tracked in main thread based on decision made in main thread.

Read Performance (write performance is the same):
```
With this change:
Summary:
  throughput summary: 2191732.75 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.720     0.072     1.743     1.919     2.647    23.983

Unstable:
Summary:
  throughput summary: 1658197.25 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.299     0.120     2.343     2.503     3.319     4.791

Config:
commandlog-request-larger-than -1
^^ without this performance is just like on unstable branch
databases 1
save ""
appendonly no
rdbcompression no
activedefrag no
maxclients 1000
io-threads 9
protected-mode no
hz 10
maxmemory 2gb
```

---------

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
harrylin98 pushed a commit to harrylin98/valkey_forked that referenced this pull request Feb 19, 2026
…copy-avoidance (valkey-io#3046)

In releaseBufReferences(), clusterSlotStatsAddNetworkBytesOutForSlot() is called in the IO
thread after writing the reply to the client, and (since valkey-io#2652) it's also done in the normal
code path in the main thread (afterCommand() -> clusterSlotStatsAddNetworkBytesOutForUserClient()).
This means the output bytes are recorded twice in the cluster slot stats.

Fixes valkey-io#3043. Bug was introduced in valkey-io#2652 (9.0.1).

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
harrylin98 pushed a commit to harrylin98/valkey_forked that referenced this pull request Feb 19, 2026
…alkey-io#3086)

Currently after valkey-io#2652 in copy avoidance path we unconditionally track
`c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than
-1`. This PR provides ability to skip that accounting in copy avoidance
path based on config value. If `commandlog-request-larger-than -1` then
performance is same as before valkey-io#2652.

Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it
is not tracked in main thread based on decision made in main thread.

Read Performance (write performance is the same):
```
With this change:
Summary:
  throughput summary: 2191732.75 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.720     0.072     1.743     1.919     2.647    23.983

Unstable: 
Summary:
  throughput summary: 1658197.25 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.299     0.120     2.343     2.503     3.319     4.791

Config:
commandlog-request-larger-than -1
^^ without this performance is just like on unstable branch
databases 1
save ""
appendonly no
rdbcompression no
activedefrag no
maxclients 1000
io-threads 9
protected-mode no
hz 10
maxmemory 2gb
```

---------

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: 9.0.1

Development

Successfully merging this pull request may close these issues.

5 participants