Skip to content

Conversation

@sharwell
Copy link
Contributor

@sharwell sharwell commented Jun 12, 2017

Fixes #19761

Ask Mode

Customer scenario

The specific steps to encounter this issue are not completely clear. It appears to involve working in a file which has a deeply nested structure. When the bug arises, Visual Studio may read from a pointer which has been freed, resulting in corruption of internal data structures or an access violation.

Bugs this fixes:

Fixes #19761
Fixes #20170

Workarounds, if any

None.

Risk

The change made here is theoretically capable of producing either a deadlock or a performance regression. Neither of these seems likely from the particular change, and the situations we are avoiding are certainly worse (stack overflow or heap corruption). @CyrusNajmabadi Can you provide an evaluation of the risk for this change?

Performance impact

This change disables a cancellation point in the code, which could in some cases cause a regression related to serialization performance. However, this cancellation point was implemented as a recursion guard, not a proper cancellation point, so it's likely we were not relying on its behavior as the latter for performance.

Is this a regression from a previous update?

Yes, this bug was introduced in #16798 (specifically 6427a2b) for release 15.1.

Root cause analysis:

The code accessing unmanaged memory is extremely performance sensitive, so most checks capable of detecting this concurrency bug early have too much overhead for deployment.

How was the bug found?

Internal customer report.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explicitly mention that waiting with cancellation would be a problem if the underlying task doesn't support the cancellation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

➡️ Updated the comment

@Pilchie
Copy link
Member

Pilchie commented Jun 12, 2017

Thanks for the comment.

// CANCELLATION: If cancellation is required, DO NOT attempt to cancel the operation by cancelling this
// wait. Cancellation must only be implemented by modifying 'task' to cancel itself in a timely manner
// so the wait can complete.
value = task.GetAwaiter().GetResult();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to me. Though i would like @stephentoub to weigh in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub is this the appropriately idiomatic way to say "i would like to wait on this task, and cancel early if it hasn't started running, but block on it until it's IsCompleted becomes true"?

Copy link
Member

@stephentoub stephentoub Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right to me, in that it's effectively the same as the else clause below (well, slightly closer would be if the _cancellationToken passed to StartNew were instead changed to be CancellationToken.None, but what you have looks fine). As it's now written, either the operation will be canceled before it's ever started (due to the token having cancellation requested prior to the StartNew delegate getting invoked, although because it's LongRunning, a new thread is going to be spawned and ReadValueWorker will likely be invoked very quickly as it won't be waiting in any thread pool queues), or any cancellation will be up to the ReadValueWorker implementation itself, which is what you want. The previous version would have allowed code to continue running even as ReadValueWorker is still executing (since the wait was being canceled rather than ReadValueWorker itself), and thus if what comes after this might attempt to access the same state as ReadValueWorker potentially touches, you end up with race conditions. Even if ReadValueWorker itself pays attention to / polls for cancellation, there are still potentially races, depending on the implementation.

There is one difference functionally, though: if ReadValueWorker throws an exception, Wait will wrap it in an AggregateException, whereas GetAwaiter().GetResult() will propagate it directly. If you wanted this change to simply remove the cancellation of the wait, you could just remove the whole task.Wait(_cancellationToken) line, as task.Result itself is will block until the task completes and will similarly wrap the exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks stephen. Sam's code looks to be exaclty what we want then (i.e. throw exception directly, not an aggregate exception). Preemptively cancel, but only have it respected on the next potential thread launch. Thanks!

Copy link
Member

@CyrusNajmabadi CyrusNajmabadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good if Stephen is good.

@CyrusNajmabadi
Copy link
Member

Hrmm... the failure of all tests is concerning though. What's up with that?

@CyrusNajmabadi
Copy link
Member

Also, we have this thread spawning in two other places (in ObjectWriter). We probably want to have this fix applied there as well, right?

@sharwell
Copy link
Contributor Author

@MattGertz this issue fixes a possible access violation (memory corruption) when editing files with deeply nested structures (but not so deep such that they don't exist; both @Pilchie and I have hit this in practice).

@MattGertz
Copy link
Contributor

@sharwell Is there any telemetry we can add to help pin down the repro steps, which will help us evaluate if there might be other effects lurking behind this one?

@Pilchie
Copy link
Member

Pilchie commented Jun 13, 2017

Just hit this again in my service hub process :-/

@sharwell
Copy link
Contributor Author

@MattGertz We talked about this and we're not quite sure how to reproduce it specifically. We could probably trigger a deterministic failure by wrapping the stream passed to the ObjectReader and forcing it to trigger a cancellation when it starts getting read requests from a different thread than it started on, but it could still be quite a bit of effort. My hope is the "obnoxious comment" provides sufficient deterrent for regressions in the future.

Between @Pilchie and I, we've seen the specific crash a total of 3 times now over the past several weeks.

@dpoeschl
Copy link
Contributor

dpoeschl commented Jun 16, 2017

retest windows_debug_vs-integration_prtest please
Some issues fixed by #20260

@dpoeschl
Copy link
Contributor

dpoeschl commented Jun 16, 2017

retest windows_release_vs-integration_prtest please
Some issues fixed by #20260

@sharwell sharwell merged commit bcfaa14 into dotnet:master Jun 16, 2017
@sharwell sharwell deleted the wait-for-read branch June 16, 2017 11:24
@sharwell sharwell added this to the 15.3 milestone Jun 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dogfooding crash in memory mapped files Access violation in RoslynCodeAnalysisService32 reading from memory mapped file

7 participants