-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConcurrentReadTxNoCopy share read tx buffer #13243
base: main
Are you sure you want to change the base?
Conversation
Thanks, really impressive results. My main worry about this changes is code readability, creating yet another read mode makes it even harder to understand. It's my personal opinion, but I would want to see how we can simply transaction code before we complicate it with this. Would be happy to discuss what can be done here. |
Thanks, I agree with your about readability. But, I think buffer copy is too heavy for rang interface, because it is unnecessary in many time (count only range、keys not exist range、keys not in buffer) |
Did you try mixed workload and see the performance? We need to make sure it works for different ratios of r/w operations. |
btw, i usually try it on machines with nvme ssd because it is a common configuration for a large cluster. |
|
This looks good. But I think we may need to reorganize the patch to make it more clear. Maybe we can have a discussion on this and write down a change proposal for this patch. |
From what I understand from the patch, the logic seems like to be: When we create the txReadBuffer, we don't copy the content. Instead, we hold a revision and uses this revision to get into the shared txReadBuffer to search. Since we have a revision saved when creating the txn, we can ignore the new revisions while going through the txReadBuffer, so that the txn will not get the new changes. I think this part of the logic is valid. But I guess Yang needs to make some changes to the patch to make it more clear. We also need to double-check to make sure the logic is correct. |
Yes, the logic which you understand is right. New revisions can be ignored, because index tree will not return newer revision than storeTxnRead created:
|
looks good, lets make the code more clear and do another round of review with more ppl involved. |
@gyuho Hi Gyuho, I think this patch looks good to me. But before we merge, we definitely need to clean it up. Can you take a look and see if there are any potential issues? |
@ptabor made some good suggestions today. For normal cases it is fine, but we need to go though the corner cases such as when compaction is in progress. Lets have a discussion on this later. |
Thanks. |
f24799c
to
ffb4631
Compare
d8cd859
to
2e6a943
Compare
2e6a943
to
ba3ca2f
Compare
ba3ca2f
to
c9ae54c
Compare
Can you guys take a look at this? @gyuho @ptabor @serathius From my side, I think this patch can further improve etcd performance and we can see if we can put it into the next release. Can we put this bug on our release plan? Btw, I will later double-check this logic again. Wilson |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LeoYang90 can you please fix conflicts? Thanks!
c9ae54c
to
6b902eb
Compare
Already done. @spzala |
// t.backend.readTx.buf is used for ConcurrentReadTxNoCopyMode | ||
// t.backend.readTx.buf can not be reset simply | ||
// create a new one, old one can still use | ||
t.backend.readTx.buf = &txReadBuffer{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the line rt.buf.reset()
is removed in readTx.reset()
, instead you create a new txReadBuffer
here for safety. The logic looks good to me, but the comment above isn't clear enough, and other developers may get confused when they read the source code.
So suggest to reword the above comment to something like below:
The buf isn't reset in readTx.reset() because it may still be used by another range
sessions at the moment. But it's safe to create a new txReadBuffer, because the old
one keeps unchanged.
@@ -74,7 +76,15 @@ func (baseReadTx *baseReadTx) UnsafeForEach(bucket Bucket, visitor func(k, v []b | |||
return baseReadTx.buf.ForEach(bucket, visitor) | |||
} | |||
|
|||
func bytesToRev(bytes []byte) int64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is a little confusing, because it has the same name but different logic as other implementations, such as revision.go#L56.
So suggest to rename it to something like getMainRevFromBytes
.
if len(k2) != 0 { | ||
// If rate of buffRangeHitCounter/rangeTotalCounter is high, we should not use ConcurrentReadTxNoCopyMode | ||
buffRangeHitCounter.Inc() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see any need to handle single key range separately, and we should also try to get the k/v from the buf firstly.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
SharedBufReadTxMode make read faster because there is no copy of read buffer.
But, I think this can be done further. For expensive read request, we can use ConcurrentReadTx to copy read buffer. For regular read request,we can just share read buffer with less lock time:
Sharing read buffer means read buffer can not simply reset in committing, we need to create a new one:
For txn rangeKeys, we can determine use ConcurrentReadTx or ConcurrentReadTxNoCopy on the basis of limit:
Benchmark Result:
release-3.5.0:
This Version: