-
Notifications
You must be signed in to change notification settings - Fork 5k
Fix an edge case in the Tarjan GC bridge that leads to losing xref information #112825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Example code that got broken by the bug: // Run this on UI thread (eg. in MainActivity.OnCreate)
Task.Run(GCLoop); // <== Repeat this line up to 4 times depending on HW/emulator to make the repro more reliable
_ = AsyncStreamWriter();
public static async Task GCLoop()
{
while (true)
{
GC.Collect();
await Task.Delay(10);
}
}
public static async Task AsyncStreamWriter()
{
var bs = new ByteArrayOutputStream();
var osi = new Android.Runtime.OutputStreamInvoker(bs);
try
{
while (true)
await osi.WriteAsync(new byte[2]);
}
catch (ObjectDisposedException ex)
{
// <== Fails here because ByteArrayOutputStream got finalized/disposed
System.Environment.FailFast(ex.ToString());
}
} The task machinery of If a GC occurs at this point and starts the GC bridge machinery we end up with both To understand why the GC bridge reported it incorrectly, we must look at the log (produced with custom
The key part of the log is that there are several cycles in the object graph (or strongly connected component / SCC in terms of how the graph is processed). The inner SCC is processed correctly and you can see that |
For those who are more visual and can bear my handscribbling: Yellow are bridge objects. Blue is the first SCC in the graph, it itself points to a bridge object. Green is the second SCC graph. It points to no bridge objects by itself but it does contain the blue SCC which does. When processing the green SCC it erroneously decided that since there's no bridge objects it doesn't need to be colored and threw away the XREFs calculated from the blue SCC. |
/cc @BrzVlad |
@filipnavara I'm still going through this and trying to understand both the current SCC algorithm and the impact of the above. I'm also going to defer to @BrzVlad for the final sign off as he has more experience with this area than me. |
I'm more than happy to answer any questions in case it helps with this PR or the work on CoreCLR bridge.
Sure. I think this bug was essentially cropping up many times over the years but no one really tracked it down... As a follow-up we can try to go down the history lane to see if it fixes the cases where people previously had to switch to the "new" bridge. |
Ah, that history is interesting. If you could do that or get some indication that the other algorithms handle this correctly and now the tarjan one does too, that would help give us more confidence in this change. |
The sample code above runs correctly under .NET 8/9 when switching to the "new" bridge. Likewise, it runs correctly after the fix but crashes reliably without it. There's a mode of the runtime that runs both bridges at the same time and compares the results. I can re-run the sample under that mode (unless the code bit rotted too much since the mono/mono times). |
Oh wow. I didn't know that. If you could check that out please do. I'm also happy to review a PR if it has bit-rot and a few fixes are needed. |
The good news is that the bridge comparison works in released MonoVM builds, so you can run .NET 8:
.NET 9:
However, I would expect it to bail with different values. I need to figure out the details. The Tarjan bridge needs be setup to run in precise mode for this to work, which is not the default. UPD: With this patch applied it no longer crashes even with the |
With
(top being the primary Tarjan bridge, bottom being the New bridge used for comparison) There's a cut & paste error in the |
What does the DUMP_GRAPH look like with the fix, interested to compare before and after. |
Note that the order of the objects is not stable. Sometimes |
Thanks @filipnavara - your help here is appreciated more than you know. From my perspective, this assuages any concerns I have with this change. I'm still going to defer to someone with more experience in this code base though. If you'd like to put up the fixed up code above - still commented out, feel free. I'm confident enough to sign off on that. |
I am playing with the idea of running the test cases as a standalone executable that includes unmodified sgen-tarjan-bridge.c: The branch has a minimal example with the broken object graph scenario. It can run it on desktop and report back the output of the bridge code. With a bit of love we could probably clean it up and pair it with a fuzzer to track down other failures like #106410. |
@filipnavara I agree this is a good opportunity to start testing the |
@filipnavara Sorry for adding confusion. |
Feel free to ask questions in case you hit something non-obvious. I'll be on Discord and GitHub. Notably, neither this PR nor #113044 requires understanding all the details. Both are relatively isolated issues in the bridge code that don't modify the core of the algorithm. I highly recommend running the standalone sample in https://github.com/filipnavara/runtime/tree/standalone-gc-bridge-test under debugger to see the values that trigger the erroneous conditions. |
@@ -748,7 +748,7 @@ create_scc (ScanData *data) | |||
|
|||
for (i = dyn_array_ptr_size (&loop_stack) - 1; i >= 0; --i) { | |||
ScanData *other = (ScanData *)dyn_array_ptr_get (&loop_stack, i); | |||
found_bridge |= other->is_bridge; | |||
found_bridge |= other->is_bridge || dyn_array_ptr_size (&other->xrefs) > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking something like this would be more clear with the intention of the change.
@@ -745,10 +745,16 @@ create_scc (ScanData *data)
gboolean found = FALSE;
gboolean found_bridge = FALSE;
ColorData *color_data = NULL;
+ gboolean can_reduce_color = TRUE;
for (i = dyn_array_ptr_size (&loop_stack) - 1; i >= 0; --i) {
ScanData *other = (ScanData *)dyn_array_ptr_get (&loop_stack, i);
found_bridge |= other->is_bridge;
+ if (dyn_array_ptr_size (&other->xrefs) > 0) {
+ // This scc will have more xrefs than the ones from the color_merge_array,
+ // we will need to create a new color to store this information.
+ can_reduce_color = FALSE;
+ }
if (found_bridge || other == data)
break;
}
@@ -756,9 +762,12 @@ create_scc (ScanData *data)
if (found_bridge) {
color_data = new_color (TRUE);
++num_colors_with_bridges;
- } else {
+ } else if (can_reduce_color) {
color_data = reduce_color ();
+ } else {
+ color_data = new_color (FALSE);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that should work too and it seems reasonable.
* (ie. the task continuation is hooked through the SynchronizationContext | ||
* implentation and rooted only by Android bridge objects). | ||
*/ | ||
static void NestedCycles () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seem to have a starting point for some passing bridge tests (comparing between new and tarjan). I will add them in a separate PR, maybe also including this test, so we can start incrementally fixing and testing the tarjan bridge.
In the Tarjan SCC bridge processing there's a color graph used to find out connections between SCCs. There was a rare case which only manifested when a cycle in the object graph points to another cycle that points to a bridge object. We only recognized direct bridge pointers but not pointers to other non-bridge SCCs that in turn point to bridges and where we already calculated the xrefs. These xrefs were then lost.
why not backport to release/9.0? |
I can repro this with a much simpler graph like this -
this produces 0 xrefs. the bug seems to manifest as long as in the NonBridge SCC, the object B1 points to is different from the object that points to B2. the sequence is -
|
Amazing find! I started reducing the test case from a real-world example, but this simple one is much nicer for some synthetic testing. |
thank you @filipnavara for bringing this up and fixing it! |
Really great this is fixed! Thanks a lot! |
…formation (dotnet#112825) * Fix an edge case in the Tarjan SCC that lead to losing xref information In the Tarjan SCC bridge processing there's a color graph used to find out connections between SCCs. There was a rare case which only manifested when a cycle in the object graph points to another cycle that points to a bridge object. We only recognized direct bridge pointers but not pointers to other non-bridge SCCs that in turn point to bridges and where we already calculated the xrefs. These xrefs were then lost. * Add test case to sgen-bridge-pathologies and add an assert to catch the original bug * Add review --------- Co-authored-by: Vlad Brezae <brezaevlad@gmail.com>
* Fix dump_processor_state debug code to compile and work on Android (#112970) Fix typo in GC bridge comparison message (SCCS -> XREFS) * [mono] Add a few bridge tests (#113703) * [mono][sgen] Fix DUMP_GRAPH debug option build for tarjan bridge * [mono][sgen] Don't create ScanData* during debug dumping of SCCs It serves no purpose and it would later crash the runtime since we didn't patch the lockword back in place. * [mono][sgen] Fix some null deref crashes in DUMP_GRAPH debug option * [mono][tests] Add bridge tests These are ported from some of the bridge tests we had on mono/mono. In order to test them we compare between the output of the new and the tarjan bridge. * Fix an edge case in the Tarjan GC bridge that leads to losing xref information (#112825) * Fix an edge case in the Tarjan SCC that lead to losing xref information In the Tarjan SCC bridge processing there's a color graph used to find out connections between SCCs. There was a rare case which only manifested when a cycle in the object graph points to another cycle that points to a bridge object. We only recognized direct bridge pointers but not pointers to other non-bridge SCCs that in turn point to bridges and where we already calculated the xrefs. These xrefs were then lost. * Add test case to sgen-bridge-pathologies and add an assert to catch the original bug * Add review --------- Co-authored-by: Vlad Brezae <brezaevlad@gmail.com> * [SGen/Tarjan] Handle edge case with node heaviness changing due to deduplication (#113044) * [SGen/Tarjan] Handle edge case with node heaviness changing due to deduplication Do early deduplication Fix Windows build Add test cases to sgen-bridge-pathologies * Move test code * Remove old code * Add extra check (no change to functionality) * Disable test on wasm --------- Co-authored-by: Vlad Brezae <brezaevlad@gmail.com> Co-authored-by: Alexander Köplinger <alex.koeplinger@outlook.com>
…formation (dotnet#112825) * Fix an edge case in the Tarjan SCC that lead to losing xref information In the Tarjan SCC bridge processing there's a color graph used to find out connections between SCCs. There was a rare case which only manifested when a cycle in the object graph points to another cycle that points to a bridge object. We only recognized direct bridge pointers but not pointers to other non-bridge SCCs that in turn point to bridges and where we already calculated the xrefs. These xrefs were then lost. * Add test case to sgen-bridge-pathologies and add an assert to catch the original bug * Add review --------- Co-authored-by: Vlad Brezae <brezaevlad@gmail.com>
We're seeing a recurrent crash on CI, which is still inexplicable: E droid.NET_Test: JNI ERROR (app bug): accessed deleted Global 0x3056 F droid.NET_Test: java_vm_ext.cc:570] JNI DETECTED ERROR IN APPLICATION: use of deleted global reference 0x3056 The obvious thing to do to track this down is to enable GREF logging, which makes the error disappear. (Figures.) However, attempting to enable GREF logging found *other* issues: The GREF log couldn't be created (!): W monodroid: Failed to create directory '/data/user/0/Mono.Android.NET_Tests/files/.__override__/arm64-v8a'. No such file or directory … E monodroid: fopen failed for file /data/user/0/Mono.Android.NET_Tests/files/.__override__/arm64-v8a/grefs.txt: No such file or directory The apparent cause for this is that the ABI is now part of the path, `…/.__override__/arm64-v8a/grefs.txt` and not `…/.__override__/grefs.txt`. Additionally, `create_public_directory()` was a straight `mkdir()` call, which *does not* create intermediate directories. Update `create_public_directory()` to use `create_directory()` instead, which *does* create intermediate directories. This allows `grefs.txt` to be created. *Next*, I started getting a *bizarre* failure within `MoarThreadingTests()`: I NUnit : 1) Java.InteropTests.JnienvTest.MoarThreadingTests (Mono.Android.NET-Tests) I NUnit : No exception should be thrown [t2]! Got: System.ObjectDisposedException: Cannot access disposed object with JniIdentityHashCode=158880748. I NUnit : Object name: 'Java.Lang.Integer'. I NUnit : at Java.Interop.JniPeerMembers.AssertSelf(IJavaPeerable self) I NUnit : at Java.Interop.JniPeerMembers.JniInstanceMethods.InvokeAbstractInt32Method(String encodedMember, IJavaPeerable self, JniArgumentValue* parameters) I NUnit : at Java.Lang.Integer.IntValue() I NUnit : at Java.Lang.Integer.System.IConvertible.ToInt32(IFormatProvider provider) I NUnit : at System.Convert.ChangeType(Object value, Type conversionType, IFormatProvider provider) I NUnit : at Android.Runtime.JNIEnv.GetObjectArray(IntPtr array_ptr, Type[] element_types) I NUnit : at Java.InteropTests.JnienvTest.<>c__DisplayClass26_0.<MoarThreadingTests>b__1() I NUnit : Expected: null I NUnit : But was: <System.ObjectDisposedException: Cannot access disposed object with JniIdentityHashCode=158880748. Relevant code from `JNIEnv.GetObjectArray()`: for (int i = 0; i < cnt; i++) { Type? targetType = (element_types != null && i < element_types.Length) ? element_types [i] : null; object? value = converter ((targetType == null || targetType.IsValueType) ? null : targetType, array_ptr, i); ret [i] = value; ret [i] = targetType == null || targetType.IsInstanceOfType (value) ? value : Convert.ChangeType (value, targetType, CultureInfo.InvariantCulture); } It's *conceivable* that `value` is finalized *during* the `Convert.ChangeType()` call (…really?! I'm grasping at straws here!), which could explain the `ObjectDisposedException` being thrown. (This doesn't *actually* make sense, but it's the best I can come up with.) Add a `GC.KeepAlive()` call to keep the object alive until after `Convert.ChangeType()` is called: ret [i] = targetType == null || targetType.IsInstanceOfType (value) ? value : Convert.ChangeType (value, targetType, CultureInfo.InvariantCulture); GC.KeepAlive (value); This "fixed" the `ObjectDisposedException` issue, only to raise a new, different, and *not* fixed issue: I NUnit : 1) Java.InteropTests.JnienvTest.MoarThreadingTests (Mono.Android.NET-Tests) I NUnit : No exception should be thrown [t2]! Got: System.ArgumentException: Handle must be valid. (Parameter 'instance') I NUnit : at Java.Interop.JniEnvironment.InstanceMethods.CallIntMethod(JniObjectReference instance, JniMethodInfo method, JniArgumentValue* args) in /Volumes/Xamarin-Work/src/dotnet/android/external/Java.Interop/src/Java.Interop/obj/Debug/net9.0/JniEnvironment.g.cs:line 20191 I NUnit : at Java.Interop.JniPeerMembers.JniInstanceMethods.InvokeAbstractInt32Method(String encodedMember, IJavaPeerable self, JniArgumentValue* parameters) in /Volumes/Xamarin-Work/src/dotnet/android/external/Java.Interop/src/Java.Interop/Java.Interop/JniPeerMembers.JniInstanceMethods_Invoke.cs:line 492 I NUnit : at Java.Lang.Integer.IntValue() in /Volumes/Xamarin-Work/src/dotnet/android/src/Mono.Android/obj/Debug/net10.0/android-36/mcw/Java.Lang.Integer.cs:line 354 I NUnit : at Java.Lang.Integer.System.IConvertible.ToInt32(IFormatProvider provider) in /Volumes/Xamarin-Work/src/dotnet/android/src/Mono.Android/Java.Lang/Integer.cs:line 58 I NUnit : at System.Convert.ChangeType(Object value, Type conversionType, IFormatProvider provider) I NUnit : at Android.Runtime.JNIEnv.GetObjectArray(IntPtr array_ptr, Type[] element_types) in /Volumes/Xamarin-Work/src/dotnet/android/src/Mono.Android/Android.Runtime/JNIEnv.cs:line 1070 I NUnit : at Java.InteropTests.JnienvTest.<>c__DisplayClass26_0.<MoarThreadingTests>b__1() in /Volumes/Xamarin-Work/src/dotnet/android/tests/Mono.Android-Tests/Mono.Android-Tests/Java.Interop/JnienvTest.cs:line 390 which makes *no* sense, and I'm starting to have flashbacks to issue #9039, which was a GC bug fixed in dotnet/runtime#112825. (Don't we have that fix?! Perhaps not.) Next, the contents of `grefs.txt` occasionally looked "wrong". Turns out, `WriteGlobalReferenceLine()` doesn't consistently append a newline to the message, which impacts e.g. the `Created …` message: Created PeerReference=0x3c06/G IdentityHashCode=0x2e29bd Instance=0xbe0aab36 Instance.Type=Android.OS.Bundle, Java.Type=android/os/Bundle+g+ grefc 19 gwrefc 0 obj-handle 0x706f711035/L -> new-handle 0x3d06/G from thread '<null>'(1) Update `WriteGlobalReferenceLine()` to always append a newline to the message. Finally, update `env.txt` to always set debug.mono.debug=1. Filenames and line numbers in stack traces are handy!
In the Tarjan SCC bridge processing there's a color graph used to find out connections between SCCs. There was a rare case which only manifested when a cycle in the object graph points to another cycle that points to a bridge object. We only recognized direct bridge pointers but not pointers to other non-bridge SCCs that in turn point to bridges and where we already calculated the xrefs. These xrefs were then lost.
Consider the following object graph:
RunnableImplementor
andByteArrayOutputStream
are bridge objects that are eligible for collection on the .NET side.ByteArrayOutputStream
is assigned a color in the color graph. We start traversing the graph through depth first search fromRunnableImplementor
to find the strongly connected components using the Tarjan algorithm. Once we discover the link from<>c__DisplayClass2_0
toByteArrayOutputStream
we make a note of the color that needs to be recorded in the color graph to produce an xref. We then exit the recursion and find thatInner SCC
forms a strongly connected component and we merge all the recorded colors into[AsyncStateMachineBox'1]->xref
array. Then we proceed to continue with the Tarjan algorithm which eventually finds theOuter SCC
. Once again we want to merge all the colors of the nodes in the outer SCC. The implementation assumes that there could be any links in the color graph only if any objects in the SCC point to a bridge object. However, it failed to consider thatInner SCC
's nodes were already processed and contain the summary information innode->xref
field. If there were any other links to bridge objects it would still allocate a color for theOuter SCC
and behave correctly. However, if there are no other links to bridge objects in the reminder ofOuter SCC
it would fail to allocate a color and silently drop thenode->xref
list on any contained node.Fixes #115611
Ref dotnet/android#9789 for discussion on the root cause