Description
We've been seeing ILC crashes in the CI for a while. This may or may not be related to #109800.
Symptom of this one is just "exited with code 57005". E.g. here: https://dev.azure.com/dnceng-public/public/_build/results?buildId=898321&view=logs&j=6a7e26fa-36e7-5a45-28af-dc6c8e6724e6&t=089ef86b-599a-543a-2c8c-82b31601e3ec
I left the compilation of Microsoft.Extensions.FileProviders.Composite.Tests (-rc Checked -lc Release) running in a loop overnight and out of 2000 iterations, I got 5 crashes with dumps. So at least I can repro it. All the dumps are from the scanning phase so we could possibly speed it up further by exiting after scanner is done (not going to run into the bug after).
I tried to make sense of the dump, but I'm at loss, I'm not good with GC and this is some sort of corruption.
Crash is here because MethodTable was null:
> [Inline Frame] ilc.exe!MethodTable::HasComponentSize() Line 228 C++
[Inline Frame] ilc.exe!SVR::my_get_size(Object *) Line 11673 C++
ilc.exe!SVR::gc_heap::mark_object_simple(unsigned char * * po, int thread) Line 28160 C++
[Inline Frame] ilc.exe!SVR::gc_heap::mark_through_cards_helper(unsigned char * *) Line 41722 C++
ilc.exe!SVR::gc_heap::mark_through_cards_for_uoh_objects(void(SVR::gc_heap::*)(unsigned char * *, int) fn, int gen_num, int relocating, SVR::gc_heap * hpt) Line 47244 C++
ilc.exe!SVR::gc_heap::mark_phase(int condemned_gen_number) Line 30150 C++
ilc.exe!SVR::gc_heap::gc1() Line 22500 C++
ilc.exe!SVR::gc_heap::garbage_collect(int n) Line 24679 C++
ilc.exe!SVR::gc_heap::gc_thread_function() Line 7293 C++
ilc.exe!SVR::gc_heap::gc_thread_stub(void * arg) Line 37778 C++
ilc.exe!CreateNonSuspendableThread::__l2::<lambda>(void * argument) Line 592 C++
We read the MethodTable out of a presumed object at 0x00000235a567a7f8, however the bytes are all zeros (except for the first byte that is 1, presumably we just marked it), so MethodTable is null.
Searching through the memory for references to this address finds a couple hits:
0:005> s -q 0x00000000 L?0xffffffffffffffff 235A567A7F8
000000fd`136fe3b8 00000235`a567a7f8 00000000`11ad8690
000000fd`136fe608 00000235`a567a7f8 00000000`11ad8690
000000fd`136fe728 00000235`a567a7f8 00000235`b0d1fab8
00000235`847a48f8 00000235`a567a7f8 00000000`00000000
00000235`a5400068 00000235`a567a7f8 00000000`00000000
00000235`a54000f8 00000235`a567a7f8 00000235`a50e4950
Since g_lowest_address is 0x0000023584c00000, the first 3 hits are not in GC range and I'm going to ignore them.
I don't know what's the reference at 847a48f8, so I'm going to ignore it for now (there's tons of other GC-like references around it, but no MT pointer, likely some queue within the GC - do they get put in the heap range?).
Reference at a5400068 is the _source field on a XNodeNavigator instance. The object is marked. The _parent field is null and _nameTable points to some bogus value in the middle of another object. The object looks to be regurgitated.
Looking for who references it:
0:005> s -q 0x00000000 L?0xffffffffffffffff 235A5400060
00000235`847a48f0 00000235`a5400060 00000235`a567a7f8
00000235`a5400038 00000235`a5400060 00007ff7`6c1dfef8
The first hit is in the 847a4 range again, going to ignore it. The second is from a XPathChildIterator
instance. The name
field is intact and says "assembly". The nav
field points to the XNodeNavigator
.
Looking further for the references to the XPathChildIterator:
0:005> s -q 0x00000000 L?0xffffffffffffffff 235A5400028
00000235`847a48c0 00000235`a5400028 00000235`a57e51d8
00000235`84838a30 00000235`a5400028 00000275`98914bd8
00000235`a50e9a10 00000235`a5400028 00000000`00000000
00000275`989149d0 00000235`a5400028 00000000`00000000
00000275`98914a00 00000235`a5400028 00000235`a5400028
00000275`98914a08 00000235`a5400028 00000000`00000000
235A50E9A08 is a XPathNodeIterator.Enumerator
, it is however not marked. The 275* addresses are out of heap range.
This is basically all I have. I don't know how the GC was made to look at this dead object.
The XPath references make it sound like it could be related to #108743 which was another mystery bug.
I can share the dumps I have or would appreciate any advice on how to root cause this further. I heard of "stress log" before, not sure if that could be helpful here (and if/how that works on native AOT).
Cc @VSadov
Metadata
Metadata
Assignees
Type
Projects
Status