-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace read/write lock in JarResource to avoid virtual threads pinning #40942
Conversation
Thanks for your pull request! The title of your pull request does not follow our editorial rules. Could you have a look?
This message is automatically generated by a bot. |
} | ||
Thread.yield(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still blocks, it just doesn't look like blocking (the thread can loop for an arbitrarily long time in this case, keeping the thread pinned the whole time because the pinning native call
is still on the stack). I suspect that the only way around this is to not publish the class loader object at all, until the zip file is actually opened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my doubt too. I can simply remove this statement (and the other yield one on the close), but I'm just afraid that this could make the potential monopolization problem even worse. This implementation however should prevent the deadlock problem reported in the original issue (and that in all honesty I've never been able to reproduce anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah AFAICT that is possible. Or it might turn the deadlock into a livelock, but that seems less likely I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true but the livelock could be especially bad with virtual threads since they are not preemptive. Unfortunately the only way to make them preemptive at that point is performing a blocking operation which is exactly the thing that we want to avoid. In a way or another this is probably still a sub-optimal solution, but I think better than the existing one. I'm not strongly opinionated in keeping or removing the yield
s, please advice on what I should do about them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have to keep them. Better would be to park/unpark but I don't see how that would be feasible, since the one part of the problem that this solves is avoiding a dependency on the owner thread to release waiter thread(s).
This comment has been minimized.
This comment has been minimized.
I tried to format the code using that maven command, but when I run it, it doesn't change anything in the java source (or in any other file). Am I missing something? |
I would make a separate RefCntResourceFile class which contains a volatile field pointing to the file to create and an atomic ref count to handle ownership and eventually the close (the release which put the ref can't back to zero got the rights to close the file) In the JarResource I would have an The algorithm can be made more cooperative and progressive by allowing contended operations to "help" eg concurrent opens can compete to populate the volatile field in the currently set RefCntResourceFile, avoiding to spin wait, while allocating more memory. |
https://quarkusio.zulipchat.com/#narrow/stream/187038-dev/topic/source.20code.20not.20formatted seems to be relevant |
} | ||
zipFileReadersCounter.incrementAndGet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here a close can happen (and a subsequent open too!!!) which means that we can have the zip file in 2 possible wrong states:
- already closed
- a different one(!)
I suggest to move the ref cnt into a atomic ref < ref cnt obj > which keep the 2 states dependent and tied together
I reworked this pull request following @franz1981's suggestions. Please review again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than a minor comment.
...nt-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/JarFileReference.java
Outdated
Show resolved
Hide resolved
|
||
private final AtomicInteger readersCounter = new AtomicInteger(0); | ||
|
||
private volatile boolean releasing = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this: you need to use compareAndSet loops on both retain/release to make sure that you cannot "resurrect" a released resource
eg
- ref count start at 1 i.e. alive
- a release put it to 0 (and could have rights to close the file)
- an acquire just find it at 0, returning
false
because the resource is already released
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using this boolean
for a twofold reason:
- checking that a release hasn't been already requested when
readLockAcquire
is called because in that case I cannot return a reference to thejarFile
that could have been closed in the meanwhile. - checking if a release has been already requested when
readLockRelease
is called: in this case if the readers counter reaches to 0 I'm actually allowed to close thejarFile
.
I don't think that I can reuse the readersCounter
for this same purpose, for instance setting it to 0 when a release is requested, because doing so I would lose track of how many readers are still currently accessing the jarFile
. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still introducing a level of freedom which can break things: we should always relies on a single source of truth related the state of the file ownership; and the math which bring 1 -> 0 as a final state is very easy to enforce via compareAndSet loops.
The boolean hint could be re-implemented with an isReleased() { return rf.cnt() == 0; }
still granting a nice external API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The boolean hint could be re-implemented with an
isReleased() { return rf.cnt() == 0; }
still granting a nice external API.
The 2 things are unrelated though: the fact that the counter is zero (there are no current readers) doesn't imply in any way that somebody call a close on the JarResource
thus actually asking a release of the jarFile
.
I understand that ideally it would be much better to have the state of the JarFileReference
, but still I don't see how to achieve it in this case. Moreover I tried to (mentally) double check all the possible interactions and situations and I believe that it shouldn't be any gap. Please review it if you have time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the happens before relation twice I cannot see evident problems (so this reduce my concerns), but i personally prefer to reduce the level of freedoms to make more easy to reason of things, but is personal, really.
// and then it is necessary to avoid blocking on it. | ||
private final JarFile jarFile; | ||
|
||
private final AtomicInteger readersCounter = new AtomicInteger(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uses an AtomicIntegerFieldUpdater with a volatile field to reduce the footprint of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This obviously makes sense, but I don't think it will make much of a difference since you are not likely to have many more than 100 jar files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @geoand that this will introduce further complexity (and also the need of accessing the field through reflection) for a very limited memory gain, but I'm open to change this if required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same trick also applies here, that you can use AtomicInteger()
to make it slightly faster
// To enable virtual threads compatibility and avoid pinning it is not possible to use an explicit read/write lock | ||
// because the jarFile access may happen inside a native call (for example triggered by the RunnerClassLoader) | ||
// and then it is necessary to avoid blocking on it. | ||
private final JarFile jarFile; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be better a volatile field with a AtomicReferenceFiledUpdater to allow concurrent threads to "help" (or not - is up to the algorithm): it means that a reader can successfully create and set and new JarFileReference yet empty, because in the process of being read from the disk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the same logic that you're suggesting has been already implement in the JarResource::jarFileRef
method here. I think that it is more readable/maintainable if this JarFileReference
keeps a stable and final reference to the JarFile
it is referring to.
//To close the jar, the exclusive lock must be owned, and reference will be set to null before releasing it. | ||
//Likewise, opening a JarFile requires the exclusive lock. | ||
private volatile JarFile zipFile; | ||
private final AtomicReference<JarFileReference> jarFileReference = new AtomicReference<>(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@franz1981 not that it makes any difference here, but it isn't a tiny bit faster to do new AtomicReference<>()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks for suggestion.
@@ -69,13 +62,22 @@ public void init() { | |||
|
|||
@Override | |||
public byte[] getResourceData(String resource) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can become a method on JarFileReference, actually, hiding the ownershipt logic within it.
It allows reading a reasouce (with its own retain/release pairs, to be able to close the file, if the release grant rights to do it...)
In order to achieve it we need to pass the AtomicReference to this method so it can "optmistiacally" attempts a compareAndSet(this, null) to it.
If it fails is not a problem, because it means another thread has succeeded to create a new one OR nulling it out
@franz1981 I tried to follow all your suggestions and I believe I reached a far better level of encapsulation in the |
Also, when all else is done, we'll need to squash the commits |
readersCounter.incrementAndGet(); | ||
if (releasing) { | ||
if (readersCounter.decrementAndGet() == 0) { | ||
close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This the part which can break:
- T1 on release: set
release = true
and arrive tillreadersCounter.get() == 0
check, not yet happening - T2 has been able to read
releasing == true
and has decremented the read counter till 0 - T1 now check
readersCounter.get() == 0
as successfull too - T2 and T1 both compete to perform the
close
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true and I also found this possibility, anyway this shouldn't be a problem: at worst 2 threads will both call close()
which however is idempotent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK, the javadoc doesn't mention the concurrent behaviour of such method and I see that the zip file close will issue a close to https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/zip/ZipFile.java#L457 which seems thread-safe but only because the implementation looks like that, undocumented.
Said that I believe the state machine could be simplified here; I have proposed a possible way in a slack chat, reusing the same logic Netty uses for its reference counted objects; let's see if it helps to simplify the logic, removing racing states.
@dmlloyd too
Furthermore, I am a bit surprised to find synchronized blocks there...and it is likely there are others we are using that we haven't yet be warned as pinning ones, but will likely do :"(
e.g. https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/zip/ZipFile.java#L348 and https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/zip/ZipFile.java#L385
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling close multiple times happens all the time at the HTTP layer for example and it's not a problem because as Mario mentions, it's idempotent.
All that to say that I don't believe this is an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, but the "problem" i see is that is not a degree of freedom which give use any practical advantage and still force to have 2 separate variables which have happens-before relationship between them, making the logic more complex to maintain and document (but as said, that's my brain...so it's personal).
I still suggest to reduce further the mutability surface and states in order to make it easier to maintain..
* accessible. | ||
*/ | ||
private JarFile readLockAcquire() { | ||
readersCounter.incrementAndGet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I'm preventively increasing the counter here is that this ensure that the counter is at least 1 before checking the releasing
variable thus preventing the release()
method to actually close the jarFile
. Then if (even in the meanwhile) somebody called the release, I check it here: in this case I cannot return the jarFile
but I decrement again the counter that I increased just before, thus preventing it to go overboard, and if it reaches to 0 I also close the jarFile
.
...pendent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/JarResource.java
Outdated
Show resolved
Hide resolved
@franz1981 I think I've finally understood what you meant (I'm getting old) and implemented the reference counter accordingly. Please review again. |
Given the amount of work and discussions related to this issue, could we have a simple revert PR targeting 3.11? Whatever gets into 3.12 wouldn't be backportable. Thanks. |
} | ||
if (count == 0) { | ||
try { | ||
jarFile.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would pretend the compare and set before closing this and don't only if the previous value was 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use compare and set because 2 close can make the counter to go less then 0 breaking the acquire which return null
We want the close to still keep the counter correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I change this, but please give another look, since I'm not 100% sure and feel free to ping me in case you have any question or doubt.
+1 |
Do you mean we will have a 3.11.1 patch release where we simply revert the use of |
Bumping this so it does not get lost :) |
For what I have experimented these days, I don't think that a full revert will be necessary, but it would be sufficient to just remove the statement calling
That's ok, I didn't forget it, I just wanted to finish discussing all the details with @franz1981 before squashing :) |
In the comment you pasted I can see a reference to "main" without a commit ID and I don't know if it is related the parallel or not version... I can see that you modified the order of eviction of jars which give some additional benefit on this pr compared to master which doesn't contain it - and this could be a separate PR/issue per se really. In short I would like to see how this version compare against:
To understand the isolated impacts of the changes here |
//To close the jar, the exclusive lock must be owned, and reference will be set to null before releasing it. | ||
//Likewise, opening a JarFile requires the exclusive lock. | ||
private volatile JarFile zipFile; | ||
final AtomicReference<JarFileReference> jarFileReference = new AtomicReference<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a single state here. The completable future is good for both cases because allow to be created already completed or can be shared completing it only if we win the race to share it, making easier to follow what's going on.
In the case of virtual threads, it means that virtual threads could spin while waiting for it (because it will be eventually populated although it can later fail to acquire it) - or, in the case of platform threads, just wait.
The waiting part requires some attention for interruption, I think, but is a separate problem; i.e. if who is supposed to populate fail because of I/O or interruption (or OOM), we shouldn't let the others to wait for nothing, unblocking them, by completing it with the thrown exception or some other mechanism.
@franz1981 I implemented your latest suggestion on unifying all the concurrent state in one single Regarding the performance I reran the StartStop fullmicroprofile benchmark against main obtaining these results:
while running it on this pull request I got:
So it seems that this pull request doesn't introduce any relevant penalty in terms of performance and memory occupation, while implementing a non-blocking parallelism that is compatible with virtual threads. Regarding them I also ran the benchmark forcing the isVirtualThread() method to always return
where the (expected) slightly higher memory occupation is very likely due to the fact that in this case the algorithm is allowed to eventually load the same jar in parallel from multiple threads. |
writeLock.unlock(); | ||
var ref = jarFileReference.get(); | ||
if (ref != null) { | ||
ref.join().close(this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in theory join shouldn't wait here: you could use ref.getNow() and adding an assert that you always expect some value here, because the file should be already populated due to the ref count logic (which should close only if there are no acquisition pending)
// Happy path: the jar reference already exists and it's ready to be used | ||
final var localJarFileRefFuture = jarResource.jarFileReference.get(); | ||
if (localJarFileRefFuture != null) { | ||
if (localJarFileRefFuture.join().acquire()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is going to await till the jar is loaded - or we assume it to always be populated here?
If is it the case we should use getNow
and put an assert that we assume is going to always be true
} | ||
|
||
// Virtual threads needs to load the jarfile synchronously to avoid blocking. This means that eventually | ||
// multiple threads could load the same jarfile in parallel and this duplication has to be reconciled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK for now; hopefully this is all temporary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
private static CompletableFuture<JarFileReference> getJarFileReferenceFuture(JarResource jarResource) { | ||
CompletableFuture future = new CompletableFuture(); | ||
if (jarResource.jarFileReference.compareAndSet(null, future)) { | ||
future.completeAsync(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not "complete" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is going to use the "common FJ pool" I don't think is what we want here, especially because this one is lazily allocated and usually people access to it via parallel
on the Stream API, which is not that common.
The fact that it didn't popup in the RSS data is puzzling, because FJ thread should impact RSS due to the dedicated OS space for the stack size (which is 512K per thread by default? I don't remember by heart...)
@franz1981 I fixed all your suggestions with my latest commit, thanks for your feedback, please give another look. I also added a few comments to further clarify the code and ran one more time the usual benchmark without noticing any relevant difference. |
This comment has been minimized.
This comment has been minimized.
@franz1981 @geoand @dmlloyd @Sanne As agreed I rebased this pull request against the main branch, reenabled the parallel capabilities of the |
Any news on this? |
I'll leave it to @franz1981 since he is the one that had the latest comments |
// else, we drop one element from the cache, | ||
// and inserting the latest resource on the tail: | ||
toEvict = currentlyBufferedResources[currentlyBufferedResources.length - 1]; | ||
currentlyBufferedResources[currentlyBufferedResources.length - 1] = resource; | ||
for (int j = currentlyBufferedResources.length - 1; j > 0; j--) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code path is the same as the "bubble up" case, just with a different starting point: you could create a method out of it to make it easier to remember what it does - or changing it in a single place.
CompletableFuture future = new CompletableFuture(); | ||
if (jarResource.jarFileReference.compareAndSet(null, future)) { | ||
try { | ||
future.complete(new JarFileReference(JarFiles.create(jarResource.jarPath.toFile()), false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not creating this as already acquired? we want to make sure that if the thread which win the race to move the state from null -> future has its own chance to read it or we risk that another thread will acquire it as soon as it is published, release it and making it to close.
CompletableFuture<JarFileReference> newJarFileRef; | ||
do { | ||
final var jarFileRefFuture = getJarFileReferenceFuture(jarResource); | ||
newJarFileRef = jarFileRefFuture == null ? jarResource.jarFileReference.get() : jarFileRefFuture; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newJarFileRef
can be null if jarResource.jarFileReference.get
return null
?
This can happen if there is a racy cleanup of the current held future to a closed jar file
I've added a couple of comment where one is related a potential NPE (apparent to my monday brain) and another is to make it more progressive (who perform some work has to be rewarded by using its own resource at least once) |
@franz1981 I reworked this pull request following your latest suggestions (thanks again for this), please review it again when you have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done @mariofusco !!
This comment has been minimized.
This comment has been minimized.
Status for workflow
|
Status | Name | Step | Failures | Logs | Raw logs | Build scan |
---|---|---|---|---|---|---|
✔️ | JVM Tests - JDK 17 | Logs | Raw logs | 🚧 | ||
✔️ | JVM Tests - JDK 21 | Logs | Raw logs | 🚧 | ||
✔️ | JVM Tests - JDK 17 Windows | Failures | Logs | Raw logs | 🚧 |
Full information is available in the Build summary check run.
Failures
⚙️ JVM Tests - JDK 17 Windows #
- Failing: extensions/amazon-lambda/deployment
! Skipped: extensions/amazon-lambda-http/deployment extensions/amazon-lambda-rest/deployment integration-tests/amazon-lambda and 9 more
📦 extensions/amazon-lambda/deployment
✖ io.quarkus.amazon.lambda.deployment.testing.InputCollectionOutputCollectionLambdaTest.requestHandler_InputCollectionInputPerson_OutputCollectionOutputPerson
line 40
- History - More details - Source on GitHub
java.lang.AssertionError:
1 expectation failed.
JSON path doesn't match.
Expected: a collection containing map containing ["outputname"->"Chris"]
Actual: <[{outputname=Fred}]>
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
Flaky tests - Develocity
⚙️ JVM Tests - JDK 17
📦 extensions/smallrye-reactive-messaging/deployment
✖ io.quarkus.smallrye.reactivemessaging.hotreload.ConnectorChangeTest.testUpdatingConnector
- History
Expecting actual: ["-4","-5","-6","-7","-8","-9","-10","-11"] to start with: ["-3", "-4", "-5", "-6"]
-java.lang.AssertionError
java.lang.AssertionError:
Expecting actual:
["-4","-5","-6","-7","-8","-9","-10","-11"]
to start with:
["-3", "-4", "-5", "-6"]
at io.quarkus.smallrye.reactivemessaging.hotreload.ConnectorChangeTest.testUpdatingConnector(ConnectorChangeTest.java:36)
@dmlloyd any objections to merging? |
No objections here. |
👍🏼 Let's get this in then and keep a close eye on any suspicious failures from now on |
Thanks a lot for this work @mariofusco!!! |
FYI #42067 |
This is a proposed solution for the problem reported here.
I'm avoiding blocking on the
ReadWriteLock
and emulating its behavior with a state machine + an atomic readers counter because the zipFile access may happen inside a native call (for example triggered by the RunnerClassLoader) and then it is necessary to avoid blocking on it.Note that while avoid blocking and thus preventing pinning, I'm afraid that this solution could suffer of monopolization. I tried to mitigate the problem with the yields at lines 160 and 183 but I'm not sure if this is a good idea, especially with virtual threads. Any feedback or suggestion to improve is welcome.
/cc @franz1981 @geoand @Sanne @dmlloyd @cescoffier