ConcurrentReadTxNoCopy share read tx buffer #13243

LeoYang90 · 2021-07-26T08:04:50Z

SharedBufReadTxMode make read faster because there is no copy of read buffer.

But, I think this can be done further. For expensive read request, we can use ConcurrentReadTx to copy read buffer. For regular read request，we can just share read buffer with less lock time:

func (b *backend) ConcurrentReadTxNoCopy() ReadTx {
	b.readTx.RLock()
	defer b.readTx.RUnlock()

	b.readTx.txWg.Add(1)

	// no copy b.readTx.buf
	return &concurrentReadTx{
		baseReadTx: baseReadTx{
			buf:          b.readTx.buf,
			bufferCopied: false,
			mu:           b.readTx.mu,
			txMu:         b.readTx.txMu,
			tx:           b.readTx.tx,
			buckets:      b.readTx.buckets,
			txWg:         b.readTx.txWg,
		},
	}
}

func (baseReadTx *baseReadTx) UnsafeRangeWithLock(bucketType Bucket, key, endKey []byte, limit int64, b Backend) ([][]byte, [][]byte) {
	...
	var keys [][]byte
	var vals [][]byte
	lockHeld := false
	baseReadTx.mu.RLock()
	lockHeld = true
	readBuff := b.ReadTx().UnsafeGetBuffer().(*txReadBuffer)
	if baseReadTx.buf != readBuff {
		// backend.readTx.buf has already been updated to a new one by unsafeCommit
		// batchTxBuffered will write back to the new readTx.buf instead of the baseReadTx.buf
		// baseReadTx.buf do not need be protected by readTx.mu
		lockHeld = false
		baseReadTx.mu.RUnlock()
	}
	if keyRev >= baseReadTx.buf.bufMinRev || endKeyRev >= baseReadTx.buf.bufMinRev {
		keys, vals = baseReadTx.buf.Range(bucketType, key, endKey, limit)
		if int64(len(keys)) == limit {
			if lockHeld {
				baseReadTx.mu.RUnlock()
			}
			return keys, vals
		}
	}
	if lockHeld {
		baseReadTx.mu.RUnlock()
	}

	...
}

Sharing read buffer means read buffer can not simply reset in committing, we need to create a new one:

func (t *batchTxBuffered) unsafeCommit(stop bool) {
	if t.backend.readTx.tx != nil {
		...
		t.backend.readTx.reset()
		// t.backend.readTx.buf is used for ConcurrentReadTxNoCopyMode
		// t.backend.readTx.buf can not be reset simply
		// create a new one, old one can still use
		t.backend.readTx.buf = &txReadBuffer{
			txBuffer:   txBuffer{make(map[BucketID]*bucketBuffer)},
			bufVersion: 0,
			bufMinRev:  0,
		}
	}

	t.batchTx.commit(stop)

	if !stop {
		t.backend.readTx.tx = t.backend.begin(false)
	}
}

For txn rangeKeys, we can determine use ConcurrentReadTx or ConcurrentReadTxNoCopy on the basis of limit:

func (tr *storeTxnRead) rangeKeys(ctx context.Context, key, end []byte, curRev int64, ro RangeOptions) (*RangeResult, error) {
    ...
    limit := int(ro.Limit)
	if limit <= 0 || limit > len(revpairs) {
		limit = len(revpairs)
	}

	if tr.readMode == ConcurrentReadTxNoCopyMode && tr.tx == nil { // ConcurrentReadTxNoCopyMode
		var tx backend.ReadTx
		if limit >= ExpensiveReadLimit { // expensive request
			tx = tr.s.b.ConcurrentReadTx()
		} else {
			tx = tr.s.b.ConcurrentReadTxNoCopy()
		}

		tx.RLock() // RLock is no-op. concurrentReadTx does not need to be locked after it is created.
		tr.tx = tx
	}
    ...
}

Benchmark Result:

./benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=4300000000 --key-size=128 --val-size=10240  --total=1500000 --rate=40000

release-3.5.0:

Summary:
  Total:  98.8474 secs.
  Slowest:  0.0901 secs.
  Fastest:  0.0003 secs.
  Average:  0.0131 secs.
  Stddev: 0.0079 secs.
  Requests/sec: 15174.9014

Response time histogram:
  0.0003 [1]  |
  0.0093 [515654] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0183 [810071] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0272 [80097]  |∎∎∎
  0.0362 [57508]  |∎∎
  0.0452 [23321]  |∎
  0.0542 [9591] |
  0.0632 [2473] |
  0.0722 [823]  |
  0.0811 [179]  |
  0.0901 [282]  |

Latency distribution:
  10% in 0.0059 secs.
  25% in 0.0063 secs.
  50% in 0.0136 secs.
  75% in 0.0144 secs.
  90% in 0.0207 secs.
  95% in 0.0296 secs.
  99% in 0.0444 secs.
  99.9% in 0.0627 secs.

This Version:

Summary:
  Total:  60.6494 secs.
  Slowest:  0.1726 secs.
  Fastest:  0.0004 secs.
  Average:  0.0079 secs.
  Stddev: 0.0069 secs.
  Requests/sec: 24732.2982

Response time histogram:
  0.0004 [1]  |
  0.0176 [1438834]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0348 [53124]  |∎
  0.0520 [4684] |
  0.0693 [355]  |
  0.0865 [200]  |
  0.1037 [202]  |
  0.1209 [995]  |
  0.1382 [633]  |
  0.1554 [652]  |
  0.1726 [320]  |

Latency distribution:
  10% in 0.0054 secs.
  25% in 0.0058 secs.
  50% in 0.0063 secs.
  75% in 0.0076 secs.
  90% in 0.0120 secs.
  95% in 0.0161 secs.
  99% in 0.0286 secs.
  99.9% in 0.1288 secs.

server/storage/backend/read_tx.go

server/storage/backend/tx_buffer.go

serathius · 2021-07-27T12:24:44Z

Thanks, really impressive results. My main worry about this changes is code readability, creating yet another read mode makes it even harder to understand. It's my personal opinion, but I would want to see how we can simply transaction code before we complicate it with this. Would be happy to discuss what can be done here.

LeoYang90 · 2021-07-27T12:52:06Z

Thanks, really impressive results. My main worry about this changes is code readability, creating yet another read mode makes it even harder to understand. It's my personal opinion, but I would want to see how we can simply transaction code before we complicate it with this. Would be happy to discuss what can be done here.

Thanks, I agree with your about readability. But, I think buffer copy is too heavy for rang interface, because it is unnecessary in many time (count only range、keys not exist range、keys not in buffer)
And it makes put operation waiting for readTx lock long time at the end of txn.

wilsonwang371 · 2021-07-27T17:02:49Z

Did you try mixed workload and see the performance? We need to make sure it works for different ratios of r/w operations.

wilsonwang371 · 2021-07-27T17:04:28Z

btw, i usually try it on machines with nvme ssd because it is a common configuration for a large cluster.

server/storage/backend/backend.go

LeoYang90 · 2021-08-26T11:33:00Z

$cat /proc/cpuinfo | grep "physical id" | uniq | wc -l
4
$cat /proc/cpuinfo | grep "cpu cores" | uniq
cpu cores	: 24
$cat /proc/cpuinfo | grep 'model name' |uniq
model name	: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
$cat /proc/meminfo | grep MemTotal
MemTotal:       527541692 kB

$sudo nvme id-ctrl /dev/nvme0n1
NVME Identify Controller:
vid       : 0x144d
ssvid     : 0x144d
sn        : S2UDNX0J500042
mn        : SAMSUNG MZQLW3T8HMLP-00003
fr        : CXV8701Q

master-53d234f1f：

This Version:

@wilsonwang371 @serathius

wilsonwang371 · 2021-09-01T03:23:55Z

This looks good. But I think we may need to reorganize the patch to make it more clear. Maybe we can have a discussion on this and write down a change proposal for this patch.

wilsonwang371 · 2021-09-01T03:29:50Z

From what I understand from the patch, the logic seems like to be:

When we create the txReadBuffer, we don't copy the content. Instead, we hold a revision and uses this revision to get into the shared txReadBuffer to search. Since we have a revision saved when creating the txn, we can ignore the new revisions while going through the txReadBuffer, so that the txn will not get the new changes.

I think this part of the logic is valid. But I guess Yang needs to make some changes to the patch to make it more clear. We also need to double-check to make sure the logic is correct.

FYI @gyuho @ptabor @lilic

LeoYang90 · 2021-09-01T03:45:36Z

From what I understand from the patch, the logic seems like to be:

When we create the txReadBuffer, we don't copy the content. Instead, we hold a revision and uses this revision to get into the shared txReadBuffer to search. Since we have a revision saved when creating the txn, we can ignore the new revisions while going through the txReadBuffer, so that the txn will not get the new changes.

I think this part of the logic is valid. But I guess Yang needs to make some changes to the patch to make it more clear. We also need to double-check to make sure the logic is correct.

FYI @gyuho @ptabor @lilic

Yes, the logic which you understand is right. New revisions can be ignored, because index tree will not return newer revision than storeTxnRead created:

func (s *store) Read(mode ReadTxMode, trace *traceutil.Trace) TxnRead {
	s.revMu.RLock()
	...

	firstRev, rev := s.compactMainRev, s.currentRev
	return newMetricsTxnRead(&storeTxnRead{s, tx, firstRev, rev, trace})
}

func (tr *storeTxnRead) rangeKeys(ctx context.Context, key, end []byte, curRev int64, ro RangeOptions) (*RangeResult, error) {
    ...
    revpairs, total := tr.s.kvindex.Revisions(key, end, rev, int(ro.Limit))
    ...
    for i, revpair := range revpairs[:len(kvs)] {
        ...
        revToBytes(revpair, revBytes)
        _, vs := tr.tx.UnsafeRange(schema.Key, revBytes, nil, 0)
        ...
    }
}

LeoYang90 · 2021-09-03T09:49:14Z

wilsonwang371 · 2021-09-09T00:03:04Z

looks good, lets make the code more clear and do another round of review with more ppl involved.

wilsonwang371 · 2021-09-22T17:34:56Z

@gyuho Hi Gyuho, I think this patch looks good to me. But before we merge, we definitely need to clean it up. Can you take a look and see if there are any potential issues?

wilsonwang371 · 2021-09-23T18:15:59Z

@LeoYang90

@ptabor made some good suggestions today. For normal cases it is fine, but we need to go though the corner cases such as when compaction is in progress. Lets have a discussion on this later.

LeoYang90 · 2021-10-08T07:50:21Z

Thanks.
Performance test is already under compaction, which is set ETCD_AUTO_COMPACTION_RETENTION=5m.
@wilsonwang371

LeoYang90 · 2021-10-27T08:44:12Z

LeoYang90 · 2021-10-27T15:33:27Z

@wilsonwang371 @ptabor

server/storage/backend/batch_tx.go

server/storage/backend/read_tx.go

server/storage/backend/backend.go

server/storage/backend/read_tx.go

server/storage/mvcc/kvstore_txn.go

wilsonwang371 · 2021-12-30T03:55:46Z

So far, things look fine to me. I think we may also get few others to review this.

@gyuho @ptabor

wilsonwang371 · 2022-02-09T04:41:34Z

Can you guys take a look at this? @gyuho @ptabor @serathius

From my side, I think this patch can further improve etcd performance and we can see if we can put it into the next release. Can we put this bug on our release plan?

Btw, I will later double-check this logic again.

Wilson

spzala

@LeoYang90 can you please fix conflicts? Thanks!

LeoYang90 · 2022-02-25T10:01:38Z

@LeoYang90 can you please fix conflicts? Thanks!

Already done. @spzala

ahrtr · 2022-03-25T08:47:07Z

server/storage/backend/batch_tx.go

+		// t.backend.readTx.buf is used for ConcurrentReadTxNoCopyMode
+		// t.backend.readTx.buf can not be reset simply
+		// create a new one, old one can still use
+		t.backend.readTx.buf = &txReadBuffer{


I see that the line rt.buf.reset() is removed in readTx.reset(), instead you create a new txReadBuffer here for safety. The logic looks good to me, but the comment above isn't clear enough, and other developers may get confused when they read the source code.

So suggest to reword the above comment to something like below:

The buf isn't reset in readTx.reset() because it may still be used by another range sessions at the moment. But it's safe to create a new txReadBuffer, because the old one keeps unchanged.

ahrtr · 2022-03-25T08:53:17Z

server/storage/backend/read_tx.go

@@ -74,7 +76,15 @@ func (baseReadTx *baseReadTx) UnsafeForEach(bucket Bucket, visitor func(k, v []b
 	return baseReadTx.buf.ForEach(bucket, visitor)
 }

+func bytesToRev(bytes []byte) int64 {


The name is a little confusing, because it has the same name but different logic as other implementations, such as revision.go#L56.

So suggest to rename it to something like getMainRevFromBytes.

ahrtr · 2022-03-25T08:55:20Z

server/storage/backend/read_tx.go

+		if len(k2) != 0 {
+			// If rate of buffRangeHitCounter/rangeTotalCounter is high, we should not use ConcurrentReadTxNoCopyMode
+			buffRangeHitCounter.Inc()
+		}


I do not see any need to handle single key range separately, and we should also try to get the k/v from the buf firstly.

stale · 2022-07-10T05:41:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

k8s-ci-robot · 2024-06-11T05:16:27Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

serathius reviewed Jul 27, 2021

View reviewed changes

server/storage/backend/read_tx.go Outdated Show resolved Hide resolved

serathius reviewed Jul 27, 2021

View reviewed changes

server/storage/backend/tx_buffer.go Show resolved Hide resolved

wilsonwang371 reviewed Jul 27, 2021

View reviewed changes

server/storage/backend/backend.go Show resolved Hide resolved

wilsonwang371 requested a review from gyuho September 22, 2021 17:35

wilsonwang371 added the area/performance label Sep 22, 2021

LeoYang90 force-pushed the txbuffer_with_revision branch 2 times, most recently from f24799c to ffb4631 Compare October 28, 2021 07:39