Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC crashed with an assertion failure in Scavenger.cpp #17052

Closed
connglli opened this issue Mar 29, 2023 · 26 comments
Closed

GC crashed with an assertion failure in Scavenger.cpp #17052

connglli opened this issue Mar 29, 2023 · 26 comments

Comments

@connglli
Copy link

connglli commented Mar 29, 2023

Java version

openjdk version "11.0.19-internal" 2023-04-18
OpenJDK Runtime Environment (build 11.0.19-internal+0-adhoc..openj9-openjdk-jdk11)
Eclipse OpenJ9 VM (build master-f14707f, JRE 11 Linux amd64-64-Bit Compressed References 20230303_000000 (JIT enabled, AOT enabled)
OpenJ9   - f14707f
OMR      - 342b647
JCL      - a8cc681 based on jdk-11.0.19+4)

Javac version

javac 11.0.19-internal

Code and summary of the problem

Code and logs: see issue17052.tar.gz. The code Test.java is a bit long because I ain't able to reduce it.

This seems to be the most reproducible problem out of the 12 as mentioned in our previous chat. I can reproduce exactly the same assertion failure (Scavenger.cpp:1669) 3 times out of 4 runs. I put logs of all these 4 runs in the link above, each in a directory named log.XXXXX. Hope the same frequency applies to you too. It makes OpenJ9 to crash in GC and should be a JIT bug since there aren't any crash if -Xint is enabled.

The test exited with code 255 and the assertion log (log.bOTRV, log.D1Uma, log.RtuV0) looks like

Corruption in Evacuate at 00000000FFD00048: calculated object size 140448003833796 larger than available 1572792, Forwarded Header at 00007FBC752B7300
21:12:28.372 0x25a500    j9mm.141    *   ** ASSERTION FAILED ** at /root/hostdir/openj9-openjdk-jdk11/omr/gc/base/standard/Scavenger.cpp:1669: ((false))
JVMDUMP039I Processing dump event "traceassert", detail "" at 2023/03/15 21:12:28 - please wait.
JVMDUMP032I JVM requested System dump using '/home/cli/ax-exp/openj9/Artemis/core.20230315.211228.900368.0001.dmp' in response to an event
JVMPORT030W /proc/sys/kernel/core_pattern setting "|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" specifies that the core dump is to be piped to an external program.  Attempting to rename either core or core.900864.  Review the manual for the external program to find where the core dump is written and ensure the program does not truncate it.

JVMPORT049I The core file created by child process with pid = 900864 was not found. Review the documentation for the /proc/sys/kernel/core_pattern program "|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" to find where the core file is written and ensure that program does not truncate it.

JVMDUMP012E Error in System dump: /home/cli/ax-exp/openj9/Artemis/core.20230315.211228.900368.0001.dmp
JVMDUMP032I JVM requested Java dump using '/home/cli/ax-exp/openj9/Artemis/javacore.20230315.211228.900368.0002.txt' in response to an event
JVMDUMP010I Java dump written to /home/cli/ax-exp/openj9/Artemis/javacore.20230315.211228.900368.0002.txt
JVMDUMP032I JVM requested Snap dump using '/home/cli/ax-exp/openj9/Artemis/Snap.20230315.211228.900368.0003.trc' in response to an event
JVMDUMP010I Snap dump written to /home/cli/ax-exp/openj9/Artemis/Snap.20230315.211228.900368.0003.trc
JVMDUMP013I Processed dump event "traceassert", detail "".

The last run (log.d3ZAp) crashed with a different assertion failure:

07:29:17.836 0x1aa00    j9mm.107    *   ** ASSERTION FAILED ** at /root/hostdir/openj9-openjdk-jdk11/omr/gc/base/MemoryPoolAddressOrderedList.cpp:1428: ((false && ((__null == nextFreeEntry) || (addrTop <= nextFreeEntry))))
JVMDUMP039I Processing dump event "traceassert", detail "" at 2023/03/28 15:29:17 - please wait.
JVMDUMP032I JVM requested System dump using '/home/simon/Desktop/test-jitcomp/ax-eval/bugs2/12.openj9/mutant/red/core.20230328.152917.2166141.0001.dmp' in response to an event
JVMPORT030W /proc/sys/kernel/core_pattern setting "|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" specifies that the core dump is to be piped to an external program.  Attempting to rename either core or core.2166212.  Review the manual for the external program to find where the core dump is written and ensure the program does not truncate it.

JVMPORT049I The core file created by child process with pid = 2166212 was not found. Review the documentation for the /proc/sys/kernel/core_pattern program "|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" to find where the core file is written and ensure that program does not truncate it.

JVMDUMP012E Error in System dump: /home/simon/Desktop/test-jitcomp/ax-eval/bugs2/12.openj9/mutant/red/core.20230328.152917.2166141.0001.dmp
JVMDUMP032I JVM requested Java dump using '/home/simon/Desktop/test-jitcomp/ax-eval/bugs2/12.openj9/mutant/red/javacore.20230328.152917.2166141.0002.txt' in response to an event
JVMDUMP010I Java dump written to /home/simon/Desktop/test-jitcomp/ax-eval/bugs2/12.openj9/mutant/red/javacore.20230328.152917.2166141.0002.txt
JVMDUMP032I JVM requested Snap dump using '/home/simon/Desktop/test-jitcomp/ax-eval/bugs2/12.openj9/mutant/red/Snap.20230328.152917.2166141.0003.trc' in response to an event
JVMDUMP010I Snap dump written to /home/simon/Desktop/test-jitcomp/ax-eval/bugs2/12.openj9/mutant/red/Snap.20230328.152917.2166141.0003.trc
JVMDUMP013I Processed dump event "traceassert", detail "".
@connglli
Copy link
Author

connglli commented Mar 29, 2023

For the rest 11, see issue17052rest11.tar.gz

For convenience, I summarize them in the following table. I hope this can help you, simplifying your work (I know it's somewhat a headache, so I'm sorry). I put each test per directory named X.openj9, each following the same directory structure as the main issue mentioned above. In the table, the key perhaps_unique is the unique symptom (if it is) that I wanted to display to you. But as you may noticed, for some of them, I'm not able to reproduce for a second time. The key log.XXXXX is a summary of the failure I have encountered when trying to reproduce the corresponding test.

In addition, besides the (perhaps) unique symptom, there're chances that different tests can lead to the same crash symptom because of their non-determinism (and some tests can even trigger crashes that I've reported before like #15476). This indicate they are maybe, not unique?

3.openj9
- perhaps_unique: assertion failed ScavengerDelegate.cpp:469: ((false))
- log.1B22R: assertion failed ScavengerDelegate.cpp:469: ((false))
- log.DZo00: assertion failed Scavenger.cpp:1669: ((false))
- log.xhKX6: segmentation error ZN12MM_Scavenger14copyForVariantILb0EEEP8J9ObjectP22MM_EnvironmentStandardP18MM_ForwardedHeader+0x77e

5.openj9 
- perhaps_unique: assertion failed ScavengerRootScanner.hpp:105: ((MM_StackSlotValidator(MM_StackSlotValidator::COULD_BE_FORWARDED, *slotPtr, stackLocation, walkState).validate(_env)))
- log.vfCGQ: assertion failed ScavengerRootScanner.hpp:105: ((MM_StackSlotValidator(MM_StackSlotValidator::COULD_BE_FORWARDED, *slotPtr, stackLocation, walkState).validate(_env)))
- log.3u8ai: assertion failed ScavengerRootScanner.hpp:109: ((MM_StackSlotValidator(MM_StackSlotValidator::NOT_ON_HEAP, *slotPtr, stackLocation, walkState).validate(_env)))
- log.QDWaj: assertion failed ScavengerRootScanner.hpp:109: ((MM_StackSlotValidator(MM_StackSlotValidator::NOT_ON_HEAP, *slotPtr, stackLocation, walkState).validate(_env)))

6.openj9
- perhaps_unique: assertion failed Scavenger.cpp:1634: ((false))
- log.ES0IO: assertion failed Scavenger.cpp:1669: ((false))
- log.DHfO3: segmentation error _ZN18MM_ForwardedHeader17copyOrWaitOutlineEP8J9Object+0x30
- log.eJxJG: assertion failed ScavengerDelegate.cpp:469: ((false))

7.openj9
- perhaps_unique: segmentation error old_fast_jitLookupInterfaceMethod+0x4d
- log.wER9m: segmentation error old_fast_jitLookupInterfaceMethod+0x4d
- log.g4c0e: assertion failed ScavengerDelegate.cpp:469: ((false))
- log.AYVgj: assertion failed Scavenger.cpp:2731: ((false))

8.openj9
- perhaps_unique: assertion failed SweepPoolManagerAddressOrderedListBase.cpp:481: ((false && (objectSizeDelta <= heapFreeByteCount)))
- log.jYkx3: assertion failed SweepPoolManagerAddressOrderedListBase.cpp:481: ((false && (objectSizeDelta <= heapFreeByteCount)))
- log.26AEM: assertion failed SweepPoolManagerAddressOrderedListBase.cpp:481: ((false && (objectSizeDelta <= heapFreeByteCount)))
- log.PFDIo: segme error _ZN22GC_ObjectModelDelegate29calculateObjectDetailsForCopyEP18MM_EnvironmentBaseP18MM_ForwardedHeaderPmS4_S4_+0x34

9.openj9
- perhaps_unique: segmentation error _ZN23MM_TLHAllocationSupport7refreshEP18MM_EnvironmentBaseP22MM_AllocateDescriptionb+0x338 
- log.0gxUk: segmentation error _ZN23MM_TLHAllocationSupport7refreshEP18MM_EnvironmentBaseP22MM_AllocateDescriptionb+0x338 
- log.4f6XI: segmentation error _ZN22GC_ObjectModelDelegate29calculateObjectDetailsForCopyEP18MM_EnvironmentBaseP18MM_ForwardedHeaderPmS4_S4_+0x34
- log.nBVOa: segmentation error _ZN22GC_ObjectModelDelegate29calculateObjectDetailsForCopyEP18MM_EnvironmentBaseP18MM_ForwardedHeaderPmS4_S4_+0x34

11.openj9
- perhaps_unique: segmentation error findFieldInClass+0x99
- log.R1thu: assertion failed ScavengerRootScanner.hpp:109: ((MM_StackSlotValidator(MM_StackSlotValidator::NOT_ON_HEAP, *slotPtr, stackLocation, walkState).validate(_env)))
- log.COHB9: segmentation error _ZN23MM_ScavengerRootClearer26scavengeUnfinalizedObjectsEP22MM_EnvironmentStandard+0x300
- log.E7Gfs: assertion failed ScavengerRootClearer.cpp:198: ((false && (__null != forwardedPtr)))

13.openj9
- perhaps_unique: segmentation error haltThreadForInspection+0xa3 
- log.nUWNA: segmentation error _ZN31MM_MemoryPoolAddressOrderedList11allocateTLHEP18MM_EnvironmentBaseP22MM_AllocateDescriptionmRPvS5_+0x7f
- log.cfpHi segmentation error protectedBacktrace+0x12
- log.QUpX3 segmentation error protectedBacktrace+0x12

14.openj9
- perhaps_unique: assertion failed MemorySubSpace.cpp:185: ((false && (reason <= LOA_EXPAND_LAST_RESIZE_REASON)))
- log.9CGLb: assertion failed MemorySubSpace.cpp:185: ((false && (reason <= LOA_EXPAND_LAST_RESIZE_REASON)))

15.openj9
- perhaps_unique: assertion failed Scavenger.cpp:2731: ((false))
- log.cYvMU: assertion failed Scavenger.cpp:2731: ((false))
- log.TJ8np: assertion failed Scavenger.cpp:2731: ((false))
- log.3FZkI: assertion failed ConcurrentMarkingDelegate.cpp:58: ((MM_StackSlotValidator(0, object, stackLocation, walkState).validate(data->env)))
- log.khhfj: segmentation error romFieldSize+0xc3
- log.tD88r: assertion failed ScavengerDelegate.cpp:469: ((false))

16.openj9
- perhaps_unique: assertion failed MarkingSchemeRootMarker.cpp:48: ((MM_StackSlotValidator(0, object, stackLocation, walkState).validate(_env)))
- log.0xAEW: assertion failed Scavenger.cpp:1669: ((false))
- log.N97lH: segmentation error _ZN22GC_ObjectModelDelegate29calculateObjectDetailsForCopyEP18MM_EnvironmentBaseP18MM_ForwardedHeaderPmS4_S4_+0x34 
- log.vqj3w: segmentation error _ZN12MM_Scavenger16rescanThreadSlotEP22MM_EnvironmentStandardPP8J9Object+0xea 
- log.yTdHX: segmentation error _ZN18MM_ForwardedHeader17copyOrWaitOutlineEP8J9Object+0x30

@pshipton
Copy link
Member

@0xdaryl @hzongaro

@dmitripivkine
Copy link
Contributor

Most of these assertions in GC might be triggered by heap corruption. There are a very few of them I need to double check, the connection with possible heap corruption is not obvious.

@dmitripivkine
Copy link
Contributor

I walked this list of assertions/crashes in GC. There is only one of them can not be explained by heap corruption directly:

  • MemorySubSpace.cpp:185: ((false && (reason <= LOA_EXPAND_LAST_RESIZE_REASON))) this assertion means heap expansion is requested but heap contraction has been performed previous GC. It might be side effect of bad allocation request (to large object?) potentially. We observed this error in the past but can not reproduce. The heuristics responsible for expand/contract logic have been reviewed and adjusted but we did not have a chance to verify correctness.

If somebody needs triaging of failures in GC listed above please let me know, I will help with this. Of course, system core is necessary.

@connglli
Copy link
Author

connglli commented Mar 29, 2023

Oops. That happens to be the test I was unable to reproduce (and my procedure deleted the log periodically)..... Still hope that test could be helpful, to some extend. I'll try again to see if I can reproduce it in more runs.

@connglli
Copy link
Author

I have reproduced the failure mentioned by @dmitripivkine. See logs: issue17052rest11.14.tar.gz.

I edited and also updated the issue17052rest11.tar.gz in my previous comment.

@dmitripivkine
Copy link
Contributor

The problem starts at attempt to allocate object with 1836453136 bytes size. With -Xmx1G specified it is not possible obviously:

14:52:37.237385000                *0x0       j9mm.133  Event       Allocation failure start: newspace=8080/2097152 oldspace=5649544/6291456 loa=0/0 requestedbytes=1836453136

@connglli Do you know is allocation of this array of such size is intentional for this test load?

@dmitripivkine
Copy link
Contributor

@connglli The rational for question above:

  • If allocation of the object size larger than heap is intentional the only problem is GC should not assert but throw OOM.
  • However is allocation of so large object is not intentional there is another problem in JVM somewhere failed to calculate required object size properly. If it is the case we should look to this problem at the first place

@connglli
Copy link
Author

@dmitripivkine The code is generated by our fuzzers. In our fuzzer implementation, we have restricted the size of any array allocation. I have checked the code again, and didn't notice any explicit array allocation of which the size is larger than 524,288(=256x256x8) bytes. But we did generate some library method calls like those of Double, Float, BigDecimal, BigInteger, etc. I suppose they don't have some implicit large array allocations, either (even though I'm not quite sure).

@connglli
Copy link
Author

Could you please check the stack trace such that we can get some glue on which array allocation caused this assertion failure? By the way, I'm sorry I'm on my vacation so I might not be that on call.

@dmitripivkine
Copy link
Contributor

Could you please check the stack trace such that we can get some glue on which array allocation caused this assertion failure? By the way, I'm sorry I'm on my vacation so I might not be that on call.

3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at java/nio/HeapByteBuffer.<init>(HeapByteBuffer.java:61)
4XESTACKTRACE                at java/nio/ByteBuffer.allocate(ByteBuffer.java:348)
4XESTACKTRACE                at Test.mainTest(Test.java:600(Compiled Code))
4XESTACKTRACE                at Test.main(Test.java:726)

@connglli
Copy link
Author

Oh sure yes, it is a library call ax$45.write(ByteBuffer.allocate(i22)). So for this test an OOM should be thrown instead of an assertion failure. I forgot to check the size of the argument i22 in my fuzzer implementation. But seems that it is a good design to abandon all size checks so that we can find some corner-cases like this :-).

How about the others, I didn't notice ByteBuffer.allocate() in the test like the main comment.

@dmitripivkine
Copy link
Contributor

An assertion in MemorySubSpace.cpp:185 is an obvious problem in GC code (and as a GC Team member I have a priority). The rest of listed assertions are triggered by external to GC reasons (bad O-slot in java stack or heap corruption). I can triage such assertion/crash problems and assign to right component (JIT most likely) but I need system dump to look at. And I don't see any at the moment. Do you have any system dumps left around?

@hzongaro
Copy link
Member

hzongaro commented Jul 10, 2023

I was able to reproduce this failure with this option:

-Xjit:limit={Test.fMeth*},optLevel=hot,lastOptIndex=84

but not with lastOptIndex=83. In this case, opt index 84 was Loop Versioner. The transformations listed for that method in Loop Versioner were

[  3719]  84.1    O^O LOOP VERSIONER: Versioning natural loop 326
[  3720]  84.2    O^O LOOP VERSIONER:  Creating test outside loop for deciding if async check is required
[  3721]  84.3    O^O LOOP VERSIONER:  Creating test outside loop for checking if n4045n [00007F7EA875E010] exceeds bounds

This bears some resemblance to the situation described in issue #17249, so it could be related. I'll assign it to myself for now, in case this is a duplicate.

@hzongaro
Copy link
Member

I haven't been able to make progress on this one for 0.41. I will move it to the 0.43 release.

@hzongaro
Copy link
Member

I think this is likely a duplicate of #17249, so I'll assign it to @a7ehuo.

@a7ehuo
Copy link
Contributor

a7ehuo commented Oct 30, 2023

I can still reproduce the GC assert with the fix for #17249. This issue is also reproducible with disableAsyncCheckVersioning.

-Xjit:limit={Test.fMeth*},{*}(disableAsyncCheckVersioning,optLevel=hot,lastOptIndex=84,optDetails,traceFull,traceILGen,traceCG,traceLoopVersioner,log=logs/jitlog)' Test
Bad scan type for object pointer 00000000A3617EA8
14:37:34.934 0x1aa00    j9mm.141    *   ** ASSERTION FAILED ** at /root/home/ahuo/src/openj9-openjdk-jdk17/openj9/runtime/gc_glue_java/ScavengerDelegate.cpp:480: ((false))

The GC assert happens with lastOptSubIndex=2 but not lastOptSubIndex=1. This issue seems related to versioning bound check. Meanwhile neither _conditionalTree nor _duplicatedContionalTree is set, which is related to the root cause of #17249.

[  1974]  84.1    O^O LOOP VERSIONER: Versioning natural loop 14
versionNaturalLoop: DEBUG calling buildBoundCheckComparisonsTree. _conditionalTree 0000000000000000 n-1n _duplicatedContionalTree 0000000000000000 n-1n
[  1975]  84.2    O^O LOOP VERSIONER:  Creating test outside loop for checking if n824n [00007F723047F140] exceeds bounds

@hzongaro
Copy link
Member

Thanks for checking into that, Annabelle @a7ehuo. Brad @BradleyWood, following that investigation I'm now wondering whether this might be a duplicate of issue #17045, which you've been investigating.

@BradleyWood
Copy link
Member

Yes, I think its the same problem.

I see the same kind of snippet.

try {
    for (int ax$16 = 0; ax$16 <= 3; ax$16++) {
        for (int ax$17 = 0; ax$17 <= 3; ax$17++) {
            ax$15[ax$16 * ax$17] += 1;
        }
    }
} catch (Throwable ax$18) {
} finally {
}

I'm guessing the loop limit transformation is off by one. Does changing ax$17 <= 3 to ax$17 < 4 fix it?

@a7ehuo
Copy link
Contributor

a7ehuo commented Oct 31, 2023

Does changing ax$17 <= 3 to ax$17 < 4 fix it?

Yes. The test passes with the change

                        for (int ax$16 = 0; ax$16 <= 3; ax$16++) {
                            for (int ax$17 = 0; ax$17 < 4; ax$17++) { // Pass
                                ax$15[ax$16 * ax$17] += 1;
                            }

@hzongaro
Copy link
Member

I'll reassign this to Brad @BradleyWood, as it looks like this is likely a duplicate of #17045

@hzongaro hzongaro assigned BradleyWood and unassigned a7ehuo Oct 31, 2023
@a7ehuo
Copy link
Contributor

a7ehuo commented Oct 31, 2023

Just post what I have observed on the failure from this test. Hope it helps. Brad will continue the investigation.

I'm guessing the loop limit transformation is off by one

I noticed the difference between ax$17 <= 3 and ax$17 < 4 is adjustMaxValue, which is calculated from incrementJ based on isIndexChildMultiplied.

ax$17 <= 3 case:

  • incrementJ is 1; isIndexChildMultiplied is 1
  • maxValue = maxValue - adjustMaxValue = maxValue - (incrementJ+1) = 4 - (1+1) = 2

ax$17 < 4 case:

  • incrementJ is 1; isIndexChildMultiplied is 0
  • maxValue = maxValue - adjustMaxValue = maxValue - incrementJ = 4 - 1 = 3

isIndexChildMultiplied is different for these two cases from what I observed. It results in the final maxValue for ax$17 as 2 when ax$17 <= 3. I haven't got to trace why isIndexChildMultiplied is different.

@a7ehuo
Copy link
Contributor

a7ehuo commented Nov 1, 2023

isIndexChildMultiplied is different for these two cases from what I observed. It results in the final maxValue for ax$17 as 2 when ax$17 <= 3.

Sorry I realized my previous logging is incorrect. It is the incrementJ that's different for these two cases.

ax$17 <= 3 case:

  • incrementJ is 1; isIndexChildMultiplied is 1
  • maxValue = maxValue - adjustMaxValue = maxValue - (incrementJ+1) = 4 - (1+1) = 2

ax$17 < 4 case:

  • incrementJ is 0; isIndexChildMultiplied is 1
  • maxValue = maxValue - adjustMaxValue = maxValue - (incrementJ+1) = 4 - (0+1) = 3

@BradleyWood
Copy link
Member

Those values are determined by the type of compare operator. I guess the logic doesn't hold when the indices are product of two induction variables?

@hzongaro
Copy link
Member

@BradleyWood, may I ask you to check whether this problem was fixed by either or both the OMR pull requests eclipse-omr/omr#7237 and eclipse-omr/omr#7238?

@BradleyWood
Copy link
Member

@hzongaro Its fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants