Skip to content

Conversation

@mattisonchao
Copy link
Member

@mattisonchao mattisonchao commented Apr 20, 2022

Motivation

In #4455 and #10740, The OpAddEntry try to using opAddEntry.addOpCount to ensure the opAddEntry we got is what we want.(avoid recycle cause race condition)

It seems exists a case causing another race condition. consider the case as follow:

Precondition

  • We have an OpAddEntry that name is A.
  • We got two threads: a timeout check thread and another is a write thread that wants to complete and recycle this OpAddEntry.
  • Relative code is as follow(timeout check logic):

private void checkAddTimeout() {
long timeoutSec = config.getAddEntryTimeoutSeconds();
if (timeoutSec < 1) {
return;
}
OpAddEntry opAddEntry = pendingAddEntries.peek();
if (opAddEntry != null) {
boolean isTimedOut = opAddEntry.lastInitTime != -1
&& TimeUnit.NANOSECONDS.toSeconds(System.nanoTime() - opAddEntry.lastInitTime) >= timeoutSec;
if (isTimedOut) {
log.error("Failed to add entry for ledger {} in time-out {} sec",
(opAddEntry.ledger != null ? opAddEntry.ledger.getId() : -1), timeoutSec);
opAddEntry.handleAddTimeoutFailure(opAddEntry.ledger, opAddEntry.addOpCount);
}
}
}

Race condition process

  1. When the timeout checker thread gets OpAddEntry A and continues to run until line 3835.
  2. At the same time, the writing thread also gets OpAddEntryA, completes and recycles.
  3. After step 2, OpAddEntryA is taken out of the recycling pool by other threads as OpAddEntryB. But the timeout thread has passed the timeout check(line 3834), and addOpCount is also obtained from OpAddEntry B, which will cause the compareAndSet check to pass.

Affect

  • The timeout checker will make a new OpAddEntry complete timeout.

Modifications

  • Record opAddCount before timeout judgment.

Verifying this change

  • Make sure that the change passes the CI checks.

Documentation

  • no-need-doc

@lhotari
Copy link
Member

lhotari commented Apr 20, 2022

Good catch!

btw. previous revisit to the solution was in #10740

Copy link
Contributor

@Jason918 Jason918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and thank you for the great analysis @mattisonchao!

@lhotari lhotari merged commit b083e9a into apache:master Apr 21, 2022
lhotari pushed a commit to datastax/pulsar that referenced this pull request Apr 22, 2022
lhotari pushed a commit to datastax/pulsar that referenced this pull request Apr 22, 2022
lhotari pushed a commit to datastax/pulsar that referenced this pull request Apr 22, 2022
@nicoloboschi
Copy link
Contributor

@lhotari this pull must be ported to branch-2.10 also, thanks

codelipenghui pushed a commit that referenced this pull request Apr 28, 2022
@codelipenghui codelipenghui added this to the 2.11.0 milestone Apr 28, 2022
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Apr 28, 2022
codelipenghui pushed a commit that referenced this pull request Apr 28, 2022
@codelipenghui codelipenghui added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Apr 28, 2022
codelipenghui pushed a commit that referenced this pull request Apr 29, 2022
@codelipenghui codelipenghui added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants