-
Notifications
You must be signed in to change notification settings - Fork 714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce redundant call of prepareClientToWrite when call addReply* continuously. #670
Reduce redundant call of prepareClientToWrite when call addReply* continuously. #670
Conversation
3ec1437
to
4549e46
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #670 +/- ##
============================================
+ Coverage 70.17% 70.20% +0.02%
============================================
Files 110 110
Lines 60077 60100 +23
============================================
+ Hits 42160 42194 +34
+ Misses 17917 17906 -11
|
Ping @valkey-io/core-team, could you help to take a look, is this optimization sensible? |
Having a couple of weird optimizations that are dangerous to get wrong (we won't install the correct handler) feels weird. Do we know why that specific function is taking 7% of the CPU? Is there some cache thrashing going on maybe we can fix more generically? I read through the code in prepareClientToWrite and it seemed fairly straight forward. |
Is this just an estimate or measured using the same tests? Can you share the row test numbers before and after the change?
+1. |
@madolson @PingXie @zuiderkwast Thanks for your review on this.
@madolson I did some check, the write handler should only be installed in the first call of
Maybe caused by L1 cache miss(see below profiling result).
@PingXie The 7% improvement is based on this test-suite https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1key-set-1K-elements-smembers.yml.
I did some profiling for cache missing We can find the the |
Your code looks correct. However there are other cases, such as hitting the CoB limit, that will cause your implementation to behavior slightly differently. |
This is interesting. I don't have a dedicated bare metal server to test on, but we are just calling this function a lot. |
Some other observations while looking at the code more:
I wonder if we try to optimize the codepath first? |
9b5d3a8
to
b7b83bd
Compare
…tinuously. Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com> Co-authored-by: Wangyang Guo <wangyang.guo@intel.com>
@madolson Thanks for your suggestions.
You mean we cloud move below logic outside if (!clientHasPendingReplies(c) && io_threads_op == IO_THREADS_OP_IDLE) putClientInPendingWriteQueue(c);
Could you be more specific?
Do you want to combine them together with this patch or separate? |
I assume this is the data cache misses?
+1 on 1) and 3) but not sure about 2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @lipzhu!
Can you confirm that you get the expected performance boost with these changes? and how much is that?
Yes.
@PingXie I test 4 test suites which may benefit from this patch, they are all from https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites repo. |
This change is much simpler, so I'm OK merging this if we are still seeing a good performance boost. |
Yes. I disassembled it for my x86 machine, and there were a lot of conditional jumps which might have been causing some issues on the instruction cache. I'm surprised there wasn't a better way to optimize it though.
You're right, I glanced at it briefly and did the math wrong. This shouldn't be a concern. |
@madolson , Thanks for you help, I already updated the performance boost info in top comments. For the suggestion listed in #687, I‘d like to search a benchmark test suite later which can help us to prove it really works. What do you think? |
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com> Co-authored-by: Wangyang Guo <wangyang.guo@intel.com>
I'm honestly okay just dropping that issue. I prefer an approach where we start with a benchmark we think is useful, and then try to optimize that case, like you did here. My suggestions were more to see if there was a way to have a more general optimization that might help the case you brought up. |
Worth mentioning in the release notes as one of the performance optimizations? |
There are always going to be a lot of little performance optimizations. In Redis we usually only listed large ones (maybe this counts as large?). I guess it's probably better to mark it and then we can decide later to remove them. |
…tinuously (#13412) This PR is based on the commits from PR valkey-io/valkey#670. ## Description While exploring hotspots with profiling some benchmark workloads, we noticed the high cycles ratio of `prepareClientToWrite`, taking about 9% of the CPU of `smembers`, `lrange` commands. After deep dive the code logic, we thought we can gain the performance by reducing the redundant call of `prepareClientToWrite` when call addReply* continuously. For example: In https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L1080-L1082, `prepareClientToWrite` is called three times in a row. --------- Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com> Co-authored-by: Lipeng Zhu <lipeng.zhu@intel.com> Co-authored-by: Wangyang Guo <wangyang.guo@intel.com>
…tinuously (redis#13412) This PR is based on the commits from PR valkey-io/valkey#670. ## Description While exploring hotspots with profiling some benchmark workloads, we noticed the high cycles ratio of `prepareClientToWrite`, taking about 9% of the CPU of `smembers`, `lrange` commands. After deep dive the code logic, we thought we can gain the performance by reducing the redundant call of `prepareClientToWrite` when call addReply* continuously. For example: In https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L1080-L1082, `prepareClientToWrite` is called three times in a row. --------- Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com> Co-authored-by: Lipeng Zhu <lipeng.zhu@intel.com> Co-authored-by: Wangyang Guo <wangyang.guo@intel.com>
Description
While exploring hotspots with profiling some benchmark workloads, we noticed the high cycles ratio of
prepareClientToWrite
, taking about 9% of the CPU ofsmembers
,lrange
commands. After deep dive the code logic, we thought we can gain the performance by reducing the redundant call ofprepareClientToWrite
when call addReply* continuously.For example: In https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L1080-L1082,
prepareClientToWrite
is called three times in a row.Profiling of Hotspots
Test Environment
Performance Result
Perf Boost
Test Scenario
1. Start server
2. Prepare test data
3. Run benchmark