Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lessor: fix le.itemMap leak after lease id revoked #16035

Closed
wants to merge 0 commits into from

Conversation

qsyqian
Copy link
Contributor

@qsyqian qsyqian commented Jun 8, 2023

When lease is revoked, we should delete item in le.itemMap.

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

server/lease/lessor.go Outdated Show resolved Hide resolved
@tjungblu
Copy link
Contributor

tjungblu commented Jun 9, 2023

I've been trying to repro this a bit further with a testcase.

It seems when you do a grant -> attach -> revoke, the keys attached are not freed from the itemMap. Is this the same leakage you've observed?

just for reference on further discussion:

func TestLessorGrantAttachRevokeDetach(t *testing.T) {
	lg := zap.NewNop()
	dir, be := NewTestBackend(t)
	defer os.RemoveAll(dir)
	defer be.Close()

	le := newLessor(lg, be, clusterLatest(), LessorConfig{MinLeaseTTL: minLeaseTTL})
	defer le.Stop()
	le.SetRangeDeleter(func() TxnDelete { return newFakeDeleter(be) })

	// grant a lease with long term (100 seconds) to
	// avoid early termination during the test.
	l, err := le.Grant(1, 100)
	if err != nil {
		t.Fatalf("could not grant lease for 100s ttl (%v)", err)
	}

	items := []LeaseItem{
		{"foo"},
		{"bar"},
	}

	if err := le.Attach(l.ID, items); err != nil {
		t.Fatalf("failed to attach items to the lease: %v", err)
	}

	if err := le.Revoke(l.ID); err != nil {
		t.Fatalf("failed to revoke lease: %v", err)
	}

	err = le.Detach(l.ID, items[0:1])
	assert.Equal(t, ErrLeaseNotFound, err)
	assert.Equal(t, 0, len(le.leaseMap))
	assert.Equal(t, 2, len(le.itemMap)) <-- leak
}

@qsyqian
Copy link
Contributor Author

qsyqian commented Jun 9, 2023

@tjungblu almost yes, when lease is revoked, the keys attached with it will delete forever, and the lessor.ItemMap will never delete them.
But i have little confused because i am not similar to the ut code. The ut you provide has something different with real env.
In real env, the RangeDeleter.DeleteRange will call le.Detach. so when revoked, all the keyItem will be called with le.Detach. No need to call err = le.Detach(l.ID, items[0:1]).
In ut, it seems that should call le.Detach after le.Revoke.

Use my code, it will PASS just like:
err = le.Detach(l.ID, items)
assert.Equal(t, ErrLeaseNotFound, err)
assert.Equal(t, 0, len(le.leaseMap))
assert.Equal(t, 0, len(le.itemMap))

@tjungblu
Copy link
Contributor

tjungblu commented Jun 9, 2023

I assume the rangeDeleter uses this code below?

https://github.com/etcd-io/etcd/blob/main/server/storage/mvcc/kvstore_txn.go#L304-L315

Looking from the other angle of the revoke, the lease in the rangeDeleter won't find the lease anymore because it was already deleted earlier?

https://github.com/etcd-io/etcd/blob/main/server/lease/lessor.go#L330-L344

edit: looks like this ordering was changed in #14080

@qsyqian
Copy link
Contributor Author

qsyqian commented Jun 9, 2023

Looking from the other angle of the revoke, the lease in the rangeDeleter won't find the lease anymore because it was already deleted earlier?

it seems not. in Revoke it delete leaseMap, but in the rangeDeleter, it call GetLease with LeaseItem. In GetLease, it find lease from itemMap.

@ahrtr
Copy link
Member

ahrtr commented Jun 12, 2023

@qsyqian
Good catch! It will cause memory leak. When a lease is revoked, the lease can be removed from leaseMap, and keys attached to the lease can also be removed. But the leaseItem in itemSet and itemMap will not be cleared due to ErrLeaseNotFound.

@qsyqian @tjungblu Please try to add a test case to intentionally reproduce the issue; and the apply the patch in this PR to fix the issue.

@qsyqian
Copy link
Contributor Author

qsyqian commented Jun 12, 2023

@ahrtr thx for review. @tjungblu i will add the ut code you provide above if you don't mind. :)

@qsyqian
Copy link
Contributor Author

qsyqian commented Jun 12, 2023

@ahrtr @tjungblu updated, PTAL.

func TestLessorGrantAttachRevokeDetach(t *testing.T) {
lg := zap.NewNop()
dir, be := NewTestBackend(t)
defer os.RemoveAll(dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is unnecessary, because the data directory is created via t.TempDir(), and it will be cleanuped automatically when the test finishes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ahrtr
Copy link
Member

ahrtr commented Jun 12, 2023

This fix needs to be backported to 3.4 and 3.5.

Could you raise an issue to track this bug fix? We keep the issue open until all backport and changelog are done. We see many times that bug fixes were forgotten to be backported.

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you @qsyqian

server/lease/lessor.go Outdated Show resolved Hide resolved
@chaochn47
Copy link
Member

chaochn47 commented Jun 15, 2023

FWIW, if you run the e2e test, it should reproduce the issue with the following error message. It does not fail the test and we could possibly use log assertion to guarantee detach in RangeDeleter succeeds with no error.

cd tests/common && EXPECT_DEBUG=true go test -v -run "^TestLeaseGrantTimeToLiveExpired/NoTLS$" --tags e2e

"msg":"failed to detach old lease from a key","error":"lease not found"

Support for assigning time to live to a key during a write - used for master leases and limiting growth of Kubernetes Events

Trying to understand deeply about this issue in k8s etcd usage, why I don't see those lease not found error messages in production etcd clusters log scanning by past 1 day? == (We are using etcd 3.5.7). I expect event TTL expired will trigger this issue.

Okay, release-3.5 lessor and release-3.4 is confirmed not impacted. Only main branch is impacted introduced by #14080.

I can remove the backport tag if we are on the same page :p

@qsyqian
Copy link
Contributor Author

qsyqian commented Jun 16, 2023

FWIW, if you run the e2e test, it should reproduce the issue with the following error message. It does not fail the test and we could possibly use log assertion to guarantee detach in RangeDeleter succeeds with no error.

cd tests/common && EXPECT_DEBUG=true go test -v -run "^TestLeaseGrantTimeToLiveExpired/NoTLS$" --tags e2e

"msg":"failed to detach old lease from a key","error":"lease not found"

Support for assigning time to live to a key during a write - used for master leases and limiting growth of Kubernetes Events

Trying to understand deeply about this issue in k8s etcd usage, why I don't see those lease not found error messages in production etcd clusters log scanning by past 1 day? == (We are using etcd 3.5.7). I expect event TTL expired will trigger this issue.

Okay, release-3.5 lessor and release-3.4 is confirmed not impacted. Only main branch is impacted introduced by #14080.

I can remove the backport tag if we are on the same page :p

yes, I think you are right. In Revoke(), if delete(le.leaseMap, l.ID) is behind txn.DeleteRange([]byte(key), nil), i will not has memory leak. I think we can remove the backport tag.

@qsyqian
Copy link
Contributor Author

qsyqian commented Jun 16, 2023

@serathius added two ut for kv store. Also add one Items() function for Lessor interface. because it seems there is no way to get ltemMap out of package lease. PTAL

@@ -572,6 +572,9 @@ func (le *lessor) Detach(id LeaseID, items []LeaseItem) error {

l := le.leaseMap[id]
if l == nil {
for _, it := range items {
delete(le.itemMap, it)
}
return ErrLeaseNotFound
Copy link
Member

@chaochn47 chaochn47 Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right fix. lessor.Detach should not throw error ErrLeaseNotFound. I view the proposed change as a band-aid of the issue.

I recommend us to rethink #14080 that introduces this memory leak. I have proposed my idea in the introducing PR comment. PTAL, thanks!

Copy link
Contributor Author

@qsyqian qsyqian Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for review. With your suggested change if the lease which attched a lot keys and spend a lot of time for

for _, key := range keys {
		txn.DeleteRange([]byte(key), nil)
	}

And in this time, le.Lookup() or le.Leases() will return the lease which just revoked.

I don't know whether it's right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in this time, le.Lookup() or le.Leases() will return the lease which just revoked.

Yeah, that's a valid concern..

Maybe acquiring le.Lock() and defer the Unlock and then txn.DeleteRange() is the right choice. It could be slow when the lease is attached a lot keys but correct.

Copy link
Member

@ahrtr ahrtr Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chaochn47 . Indeed #14080 introduced the leak. The lease logic is tightly coupled with the mvcc (key space operation), it's really spaghetti code. When mvcc deletes a key, it will automatically detach the lease attached to the key. We should execute delete(le.leaseMap, id) after removing all the keys.

Probably we should think about how to decouple them, and improve the readability and maintainability. This can be discussed separately.

Will send a PR to rollback the change.

Actually the deadlock fixed in #14080 doesn't likely happen because all lease operations (grant, revoke, checkpoint) are executed sequentially in the applying workflow. If anyone still thinks it's possible to run into a deadlock, please create a test to reproduce it firstly.

@stale
Copy link

stale bot commented Sep 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 17, 2023
@ahrtr ahrtr added stage/tracked and removed stale labels Sep 18, 2023
@serathius
Copy link
Member

ping @qsyqian

@qsyqian qsyqian force-pushed the main branch 4 times, most recently from 2d1aaa9 to d619fd9 Compare November 28, 2023 06:49
@qsyqian
Copy link
Contributor Author

qsyqian commented Nov 28, 2023

@serathius @chaochn47 fix it in another way. add step to delete le.itemMap in le.Revoke() after txn.End().
please take a look.

@qsyqian qsyqian force-pushed the main branch 4 times, most recently from 9aefb51 to 563662f Compare November 28, 2023 08:06
@serathius
Copy link
Member

cc @ahrtr @chaochn47 for re-review.

@ahrtr
Copy link
Member

ahrtr commented Dec 5, 2023

@qsyqian The le.itemMap leak issue has been resolved in #17058. Please rebase this PR, and remove the change in lessor.go. The test cases should still pass. Thanks.

@qsyqian qsyqian force-pushed the main branch 2 times, most recently from e5e6d19 to 7453f3f Compare December 6, 2023 01:12
@qsyqian
Copy link
Contributor Author

qsyqian commented Dec 6, 2023

@ahrtr updated, remove changes in lessor.go and also change the commit message.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ahrtr, qsyqian

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants