-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VM build failure - VMState: 0x00050080 or 0x000501ff #9416
Comments
Also seen here #9291 |
@andrewcraik @gita-omr Can someone take a look at this? It seems to be a newish assert failure. This one occurred on osx, #9291 on aix ~10 days ago. |
@cathyzhyi might this be related to the AOT issues we have been looking at? |
This failure seems very likely a |
assigned to @Leonardo2718 |
@Leonardo2718 any update on this? We're coming up on the v0.21.0 code split on June 7 |
So, I wasn't able to get very far because I couldn't get the build that the assert was triggered in. There's really not a lot that can be done with the available information, so the best thing to do is to try to capture more information the next time this failure is seen. I've been trying to think of a way of improving the assert message but haven't found one yet. |
#9291 seems very similar. Is that easier to reproduce to debug? |
No, #9291 has the same problem, unfortunately. |
The same problem = can't get the build? Are you able to run the test locally or on a cloud machine to reproduce the issue? We have the nightly builds and debug images that can be downloaded from Adopt, especially for recent builds so reproducing this - locally or on a build farm - seems like the next step to me. |
So, the problem is that the failure is happening during the build, which makes reproducing the issue and extracting useful information very time-consuming. I'm going to try requesting access to the machine at Adopt to see if I have better luck reproducing the failure. It's probably still going to take a lot of time to figure this out, though. |
Any update on this @Leonardo2718? Do you expect it to make the 0.22.0 code split on Aug 9? |
Unfortunately, I haven't been able to reproduce this so there's no chance of it making the 0.22 code split. |
I've moved it to the 0.23 milestone |
From #10313 (comment)
|
I think this has to move to the backlog until the adopt issue is fixed |
@M-Davies do we have any corefiles to debug as yet? If we don't have them then this will need to move to backlog until we can get diagnostic files since attempts at reproduction have not been successful to the best of my understaning. |
It's hard to pick up diagnostic files from nightly builds at the best of times and with adoptium/temurin-build#1778, it makes it near impossible. |
@DanHeidinga FYI this one is missing diagnostics and so cannot be progressed - should we backlog it or close it pending diagnostics? |
`findOrCreateShadowSymbol()` tries to reuse symrefs produced by `findOrFabricateShadowSymbol()` when the shadow corresponds to the same field. Detecting that the fields are in fact the same requires checking that the defining class is the same. Normally, the expectation is that if a cp field ref is resolved, then the cp ref pointing to the defining class must also be resolved. However, in eclipse-openj9#9416, it appears as though this may not always be the case, which triggers an assert. As a temporary work around, the assert is changed to an abort compilation to avoid taking the JVM. Because the assert is useful for detecting such cases, this patch is only submitted for the release branch so that further investigation can continue on the main development branch. The mainline code uses a different assert when compiling relocatable code to add extra debug information to the assert message in those cases. Since this commit effectively hides any occurrences of the issue reported in eclipse-openj9#9416, I don't believe there is any value in keeping the extra debug information. So, I've simplified the two asserts back into a single check that throws an exception of failure. Signed-off-by: Leonardo Banderali <leonardo2718@protonmail.com>
Moving this out to 0.26. A failure has not been seen since adding the new instrumentation to help us narrow this down, and the workaround has been committed to the 0.25 branch already. |
Saw this on a PR build:
Unfortunately, the PR that added this tracing missed a change that meant we didn't trace anything in The tracing output that did show tells us is that we got the class chain the first time, and failed to get the class chain the second time. Therefore, either the ROMClass stopped being in the the SCC or
is returning NULL in one instance and non-NULL in another. Will have to enable the tracing to know for sure. That said... as I mentioned above, with HCR it is actually possible for us to get non-NULL the first time and NULL the second time. It's possible for the class of the method we're compiling to get redefined and the compilation wouldn't be interrupted till we explicitly checked, which would not happen between the two queries. So, we're legitimately going to need to handle the possibility a null class chain. |
`findOrCreateShadowSymbol()` tries to reuse symrefs produced by `findOrFabricateShadowSymbol()` when the shadow corresponds to the same field. Detecting that the fields are in fact the same requires checking that the defining class is the same. Normally, the expectation is that if a cp field ref is resolved, then the cp ref pointing to the defining class must also be resolved. However, in eclipse-openj9#9416, it appears as though this may not always be the case, which triggers an assert. As a temporary work around, the assert is changed to an abort compilation to avoid taking the JVM. Because the assert is useful for detecting such cases, this patch is only submitted for the release branch so that further investigation can continue on the main development branch. The mainline code uses a different assert when compiling relocatable code to add extra debug information to the assert message in those cases. Since this commit effectively hides any occurrences of the issue reported in eclipse-openj9#9416, I don't believe there is any value in keeping the extra debug information. So, I've simplified the two asserts back into a single check that throws an exception of failure. Signed-off-by: Leonardo Banderali <leonardo2718@protonmail.com>
I found a failure in an internal build. No diagnostic files though.
|
The place where The
where the last value The second time
notice that the last value is As such, I don't think this is a JIT issue. fyi @hangshao0 |
The only possible cases that
|
The documentation of It won't be able to return the record in an error case. |
Thanks @hangshao0 @0xdaryl @Leonardo2718 I think the assert should just be removed:
|
If the assert is removed (implying it is invalid), will the subsequent code work as expected assuming that |
I don't think simply removing the assert is valid. If However, I believe there's a deeper issue as well.
So, to me, it seems pretty reasonable to assume that once something is resolved it stays resolved (ignoring the HCR issue for a minute). If
Redefining VM structures like this from under the JIT's nose seems rather dangerous. I would think that we would probably want to abort all current compilations if a class gets redefined (at least all the compilations that might try to look at the class that got redefined). Otherwise, for similar reasons as above, assumptions made in the JIT about the stability of the data structures could be violated and lead to a crash (even if the compiled body is thrown away in the end, the JIT could crash before compilation completes). |
A way to deal with the HCR problem, in this specific case anyway, is to check whether the comp should be interrupted before the assert - If it does then we abort the compilation; if it doesn't then the assert will only fire when we hit an inconsistent state. One way for us to deal with the fact that the SCC API could return NULL even if it returned non-NULL previously is to cache the class chain for each J9Class. There exists some (currently inactive) infrastructure to cache the validation result of a class chain for each J9Class - this is only used in the AOT load run. I believe we can piggy-back off this infra to also just cache the class chain per J9Class. That said, the infra uses the CH Table; this is because we already have a whole bunch of code that ensures that the CH Table maintains coherence through all of the runtime events (such as class loading/unloading, redefinition, etc.). Therefore, if we were to piggy back off the aforementioned infra, we would depend on the existence of the CHTable. Currently we disable the maintenance of the table if we disable CHOpts for whatever reason. However, there's no reason why we can't maintain the table even if we're not going to make use of it. Therefore, if we can ensure that the CHTable will always be coherent, then we can cache the classchain and not have to worry about whether or not we're going to incorrectly fail an AOT validation we passed earlier. However, I don't have the time to do this right now; I can open an issue to describe exactly what has to be done, but it's very unlikely I'll get to it in the next month or so. |
So, given there doesn't appear to be an easy way to fix the underlying issue, I've opened #12259 to change the assert into a |
`findOrCreateShadowSymbol()` tries to reuse symrefs produced by `findOrFabricateShadowSymbol()` when the shadow corresponds to the same field. Detecting that the fields are in fact the same requires checking that the defining class is the same. Normally, the expectation is that if a cp field ref is resolved, then the cp ref pointing to the defining class must also be resolved. However, in eclipse-openj9#9416, it was discovered that this assumption does not hold in some cases. Under AOT, loading class data from the SCC using `findSharedData()` may fail, which is propagated to the JIT as a NULL defining class. In addition, if a class is redefined between the point where a field is looked up and where the defining class is looked, the latter may become NULL. Since the code in question will not work correctly if the defining class is null, this commit tactically replaces the assert with a call to `failCompilation()`. Fixes eclipse-openj9#9416 Signed-off-by: Leonardo Banderali <leonardo2718@protonmail.com>
Got this on https://ci.adoptopenjdk.net/view/Failing%20Builds/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-mac-x64-openj9/573 (
build-macstadium-macos1014-1
)Access to the machine can be granted by opening an issue at https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues
The text was updated successfully, but these errors were encountered: