Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConcurrentReadTxNoCopy share read tx buffer #13243

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

LeoYang90
Copy link

@LeoYang90 LeoYang90 commented Jul 26, 2021

SharedBufReadTxMode make read faster because there is no copy of read buffer.

But, I think this can be done further. For expensive read request, we can use ConcurrentReadTx to copy read buffer. For regular read request,we can just share read buffer with less lock time:

func (b *backend) ConcurrentReadTxNoCopy() ReadTx {
	b.readTx.RLock()
	defer b.readTx.RUnlock()

	b.readTx.txWg.Add(1)

	// no copy b.readTx.buf
	return &concurrentReadTx{
		baseReadTx: baseReadTx{
			buf:          b.readTx.buf,
			bufferCopied: false,
			mu:           b.readTx.mu,
			txMu:         b.readTx.txMu,
			tx:           b.readTx.tx,
			buckets:      b.readTx.buckets,
			txWg:         b.readTx.txWg,
		},
	}
}

func (baseReadTx *baseReadTx) UnsafeRangeWithLock(bucketType Bucket, key, endKey []byte, limit int64, b Backend) ([][]byte, [][]byte) {
	...
	var keys [][]byte
	var vals [][]byte
	lockHeld := false
	baseReadTx.mu.RLock()
	lockHeld = true
	readBuff := b.ReadTx().UnsafeGetBuffer().(*txReadBuffer)
	if baseReadTx.buf != readBuff {
		// backend.readTx.buf has already been updated to a new one by unsafeCommit
		// batchTxBuffered will write back to the new readTx.buf instead of the baseReadTx.buf
		// baseReadTx.buf do not need be protected by readTx.mu
		lockHeld = false
		baseReadTx.mu.RUnlock()
	}
	if keyRev >= baseReadTx.buf.bufMinRev || endKeyRev >= baseReadTx.buf.bufMinRev {
		keys, vals = baseReadTx.buf.Range(bucketType, key, endKey, limit)
		if int64(len(keys)) == limit {
			if lockHeld {
				baseReadTx.mu.RUnlock()
			}
			return keys, vals
		}
	}
	if lockHeld {
		baseReadTx.mu.RUnlock()
	}

	...
}

Sharing read buffer means read buffer can not simply reset in committing, we need to create a new one:

func (t *batchTxBuffered) unsafeCommit(stop bool) {
	if t.backend.readTx.tx != nil {
		...
		t.backend.readTx.reset()
		// t.backend.readTx.buf is used for ConcurrentReadTxNoCopyMode
		// t.backend.readTx.buf can not be reset simply
		// create a new one, old one can still use
		t.backend.readTx.buf = &txReadBuffer{
			txBuffer:   txBuffer{make(map[BucketID]*bucketBuffer)},
			bufVersion: 0,
			bufMinRev:  0,
		}
	}

	t.batchTx.commit(stop)

	if !stop {
		t.backend.readTx.tx = t.backend.begin(false)
	}
}

For txn rangeKeys, we can determine use ConcurrentReadTx or ConcurrentReadTxNoCopy on the basis of limit:

func (tr *storeTxnRead) rangeKeys(ctx context.Context, key, end []byte, curRev int64, ro RangeOptions) (*RangeResult, error) {
    ...
    limit := int(ro.Limit)
	if limit <= 0 || limit > len(revpairs) {
		limit = len(revpairs)
	}

	if tr.readMode == ConcurrentReadTxNoCopyMode && tr.tx == nil { // ConcurrentReadTxNoCopyMode
		var tx backend.ReadTx
		if limit >= ExpensiveReadLimit { // expensive request
			tx = tr.s.b.ConcurrentReadTx()
		} else {
			tx = tr.s.b.ConcurrentReadTxNoCopy()
		}

		tx.RLock() // RLock is no-op. concurrentReadTx does not need to be locked after it is created.
		tr.tx = tx
	}
    ...
}

Benchmark Result:

./benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=4300000000 --key-size=128 --val-size=10240  --total=1500000 --rate=40000

release-3.5.0:

Summary:
  Total:  98.8474 secs.
  Slowest:  0.0901 secs.
  Fastest:  0.0003 secs.
  Average:  0.0131 secs.
  Stddev: 0.0079 secs.
  Requests/sec: 15174.9014

Response time histogram:
  0.0003 [1]  |
  0.0093 [515654] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0183 [810071] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0272 [80097]  |∎∎∎
  0.0362 [57508]  |∎∎
  0.0452 [23321]  |∎
  0.0542 [9591] |
  0.0632 [2473] |
  0.0722 [823]  |
  0.0811 [179]  |
  0.0901 [282]  |

Latency distribution:
  10% in 0.0059 secs.
  25% in 0.0063 secs.
  50% in 0.0136 secs.
  75% in 0.0144 secs.
  90% in 0.0207 secs.
  95% in 0.0296 secs.
  99% in 0.0444 secs.
  99.9% in 0.0627 secs.

This Version:

Summary:
  Total:  60.6494 secs.
  Slowest:  0.1726 secs.
  Fastest:  0.0004 secs.
  Average:  0.0079 secs.
  Stddev: 0.0069 secs.
  Requests/sec: 24732.2982

Response time histogram:
  0.0004 [1]  |
  0.0176 [1438834]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0348 [53124]  |∎
  0.0520 [4684] |
  0.0693 [355]  |
  0.0865 [200]  |
  0.1037 [202]  |
  0.1209 [995]  |
  0.1382 [633]  |
  0.1554 [652]  |
  0.1726 [320]  |

Latency distribution:
  10% in 0.0054 secs.
  25% in 0.0058 secs.
  50% in 0.0063 secs.
  75% in 0.0076 secs.
  90% in 0.0120 secs.
  95% in 0.0161 secs.
  99% in 0.0286 secs.
  99.9% in 0.1288 secs.

@serathius
Copy link
Member

Thanks, really impressive results. My main worry about this changes is code readability, creating yet another read mode makes it even harder to understand. It's my personal opinion, but I would want to see how we can simply transaction code before we complicate it with this. Would be happy to discuss what can be done here.

@LeoYang90
Copy link
Author

Thanks, really impressive results. My main worry about this changes is code readability, creating yet another read mode makes it even harder to understand. It's my personal opinion, but I would want to see how we can simply transaction code before we complicate it with this. Would be happy to discuss what can be done here.

Thanks, I agree with your about readability. But, I think buffer copy is too heavy for rang interface, because it is unnecessary in many time (count only range、keys not exist range、keys not in buffer)
And it makes put operation waiting for readTx lock long time at the end of txn.

@wilsonwang371
Copy link
Contributor

Did you try mixed workload and see the performance? We need to make sure it works for different ratios of r/w operations.

@wilsonwang371
Copy link
Contributor

btw, i usually try it on machines with nvme ssd because it is a common configuration for a large cluster.

@LeoYang90
Copy link
Author

LeoYang90 commented Aug 26, 2021

$cat /proc/cpuinfo | grep "physical id" | uniq | wc -l
4
$cat /proc/cpuinfo | grep "cpu cores" | uniq
cpu cores	: 24
$cat /proc/cpuinfo | grep 'model name' |uniq
model name	: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
$cat /proc/meminfo | grep MemTotal
MemTotal:       527541692 kB

$sudo nvme id-ctrl /dev/nvme0n1
NVME Identify Controller:
vid       : 0x144d
ssvid     : 0x144d
sn        : S2UDNX0J500042
mn        : SAMSUNG MZQLW3T8HMLP-00003
fr        : CXV8701Q

image
master-53d234f1f:
image
This Version:
image
@wilsonwang371 @serathius

@wilsonwang371
Copy link
Contributor

This looks good. But I think we may need to reorganize the patch to make it more clear. Maybe we can have a discussion on this and write down a change proposal for this patch.

@wilsonwang371
Copy link
Contributor

wilsonwang371 commented Sep 1, 2021

From what I understand from the patch, the logic seems like to be:

When we create the txReadBuffer, we don't copy the content. Instead, we hold a revision and uses this revision to get into the shared txReadBuffer to search. Since we have a revision saved when creating the txn, we can ignore the new revisions while going through the txReadBuffer, so that the txn will not get the new changes.

I think this part of the logic is valid. But I guess Yang needs to make some changes to the patch to make it more clear. We also need to double-check to make sure the logic is correct.

FYI @gyuho @ptabor @lilic

@LeoYang90
Copy link
Author

From what I understand from the patch, the logic seems like to be:

When we create the txReadBuffer, we don't copy the content. Instead, we hold a revision and uses this revision to get into the shared txReadBuffer to search. Since we have a revision saved when creating the txn, we can ignore the new revisions while going through the txReadBuffer, so that the txn will not get the new changes.

I think this part of the logic is valid. But I guess Yang needs to make some changes to the patch to make it more clear. We also need to double-check to make sure the logic is correct.

FYI @gyuho @ptabor @lilic

Yes, the logic which you understand is right. New revisions can be ignored, because index tree will not return newer revision than storeTxnRead created:

func (s *store) Read(mode ReadTxMode, trace *traceutil.Trace) TxnRead {
	s.revMu.RLock()
	...

	firstRev, rev := s.compactMainRev, s.currentRev
	return newMetricsTxnRead(&storeTxnRead{s, tx, firstRev, rev, trace})
}

func (tr *storeTxnRead) rangeKeys(ctx context.Context, key, end []byte, curRev int64, ro RangeOptions) (*RangeResult, error) {
    ...
    revpairs, total := tr.s.kvindex.Revisions(key, end, rev, int(ro.Limit))
    ...
    for i, revpair := range revpairs[:len(kvs)] {
        ...
        revToBytes(revpair, revBytes)
        _, vs := tr.tx.UnsafeRange(schema.Key, revBytes, nil, 0)
        ...
    }
}

@LeoYang90
Copy link
Author

LeoYang90 commented Sep 3, 2021

image
image

image
image

image
image

@wilsonwang371
Copy link
Contributor

looks good, lets make the code more clear and do another round of review with more ppl involved.

@wilsonwang371
Copy link
Contributor

@gyuho Hi Gyuho, I think this patch looks good to me. But before we merge, we definitely need to clean it up. Can you take a look and see if there are any potential issues?

@wilsonwang371
Copy link
Contributor

@LeoYang90

@ptabor made some good suggestions today. For normal cases it is fine, but we need to go though the corner cases such as when compaction is in progress. Lets have a discussion on this later.

@LeoYang90
Copy link
Author

Thanks.
Performance test is already under compaction, which is set ETCD_AUTO_COMPACTION_RETENTION=5m.
@wilsonwang371

@LeoYang90
Copy link
Author

image

@LeoYang90
Copy link
Author

image
@wilsonwang371 @ptabor

@LeoYang90 LeoYang90 force-pushed the txbuffer_with_revision branch 2 times, most recently from f24799c to ffb4631 Compare October 28, 2021 07:39
@LeoYang90 LeoYang90 force-pushed the txbuffer_with_revision branch 2 times, most recently from d8cd859 to 2e6a943 Compare December 1, 2021 02:24
server/storage/backend/read_tx.go Outdated Show resolved Hide resolved
server/storage/backend/read_tx.go Show resolved Hide resolved
server/storage/backend/read_tx.go Outdated Show resolved Hide resolved
server/storage/backend/read_tx.go Outdated Show resolved Hide resolved
server/storage/mvcc/kvstore_txn.go Outdated Show resolved Hide resolved
server/storage/mvcc/kvstore_txn.go Show resolved Hide resolved
@LeoYang90 LeoYang90 force-pushed the txbuffer_with_revision branch from ba3ca2f to c9ae54c Compare December 23, 2021 03:06
@wilsonwang371
Copy link
Contributor

So far, things look fine to me. I think we may also get few others to review this.

@gyuho @ptabor

@wilsonwang371
Copy link
Contributor

Can you guys take a look at this? @gyuho @ptabor @serathius

From my side, I think this patch can further improve etcd performance and we can see if we can put it into the next release. Can we put this bug on our release plan?

Btw, I will later double-check this logic again.

Wilson

@serathius serathius added this to the etcd-v3.6 milestone Feb 9, 2022
Copy link
Member

@spzala spzala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LeoYang90 can you please fix conflicts? Thanks!

@LeoYang90 LeoYang90 force-pushed the txbuffer_with_revision branch from c9ae54c to 6b902eb Compare February 25, 2022 10:01
@LeoYang90
Copy link
Author

LeoYang90 commented Feb 25, 2022

@LeoYang90 can you please fix conflicts? Thanks!

Already done. @spzala

// t.backend.readTx.buf is used for ConcurrentReadTxNoCopyMode
// t.backend.readTx.buf can not be reset simply
// create a new one, old one can still use
t.backend.readTx.buf = &txReadBuffer{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the line rt.buf.reset() is removed in readTx.reset(), instead you create a new txReadBuffer here for safety. The logic looks good to me, but the comment above isn't clear enough, and other developers may get confused when they read the source code.

So suggest to reword the above comment to something like below:

The buf isn't reset in readTx.reset() because it may still be used by another range 
sessions at the moment. But it's safe to create a new txReadBuffer, because the old 
one keeps unchanged. 

@@ -74,7 +76,15 @@ func (baseReadTx *baseReadTx) UnsafeForEach(bucket Bucket, visitor func(k, v []b
return baseReadTx.buf.ForEach(bucket, visitor)
}

func bytesToRev(bytes []byte) int64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is a little confusing, because it has the same name but different logic as other implementations, such as revision.go#L56.

So suggest to rename it to something like getMainRevFromBytes.

if len(k2) != 0 {
// If rate of buffRangeHitCounter/rangeTotalCounter is high, we should not use ConcurrentReadTxNoCopyMode
buffRangeHitCounter.Inc()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see any need to handle single key range separately, and we should also try to get the k/v from the buf firstly.

@stale
Copy link

stale bot commented Jul 10, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@k8s-ci-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants