-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Do not allow ReadValue to return while a stream is still being read #20171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explicitly mention that waiting with cancellation would be a problem if the underlying task doesn't support the cancellation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
➡️ Updated the comment
|
Thanks for the comment. |
| // CANCELLATION: If cancellation is required, DO NOT attempt to cancel the operation by cancelling this | ||
| // wait. Cancellation must only be implemented by modifying 'task' to cancel itself in a timely manner | ||
| // so the wait can complete. | ||
| value = task.GetAwaiter().GetResult(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ok to me. Though i would like @stephentoub to weigh in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub is this the appropriately idiomatic way to say "i would like to wait on this task, and cancel early if it hasn't started running, but block on it until it's IsCompleted becomes true"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks right to me, in that it's effectively the same as the else clause below (well, slightly closer would be if the _cancellationToken passed to StartNew were instead changed to be CancellationToken.None, but what you have looks fine). As it's now written, either the operation will be canceled before it's ever started (due to the token having cancellation requested prior to the StartNew delegate getting invoked, although because it's LongRunning, a new thread is going to be spawned and ReadValueWorker will likely be invoked very quickly as it won't be waiting in any thread pool queues), or any cancellation will be up to the ReadValueWorker implementation itself, which is what you want. The previous version would have allowed code to continue running even as ReadValueWorker is still executing (since the wait was being canceled rather than ReadValueWorker itself), and thus if what comes after this might attempt to access the same state as ReadValueWorker potentially touches, you end up with race conditions. Even if ReadValueWorker itself pays attention to / polls for cancellation, there are still potentially races, depending on the implementation.
There is one difference functionally, though: if ReadValueWorker throws an exception, Wait will wrap it in an AggregateException, whereas GetAwaiter().GetResult() will propagate it directly. If you wanted this change to simply remove the cancellation of the wait, you could just remove the whole task.Wait(_cancellationToken) line, as task.Result itself is will block until the task completes and will similarly wrap the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks stephen. Sam's code looks to be exaclty what we want then (i.e. throw exception directly, not an aggregate exception). Preemptively cancel, but only have it respected on the next potential thread launch. Thanks!
CyrusNajmabadi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good if Stephen is good.
|
Hrmm... the failure of all tests is concerning though. What's up with that? |
|
Also, we have this thread spawning in two other places (in ObjectWriter). We probably want to have this fix applied there as well, right? |
…ill being written
|
@MattGertz this issue fixes a possible access violation (memory corruption) when editing files with deeply nested structures (but not so deep such that they don't exist; both @Pilchie and I have hit this in practice). |
|
@sharwell Is there any telemetry we can add to help pin down the repro steps, which will help us evaluate if there might be other effects lurking behind this one? |
|
Just hit this again in my service hub process :-/ |
|
@MattGertz We talked about this and we're not quite sure how to reproduce it specifically. We could probably trigger a deterministic failure by wrapping the stream passed to the ObjectReader and forcing it to trigger a cancellation when it starts getting read requests from a different thread than it started on, but it could still be quite a bit of effort. My hope is the "obnoxious comment" provides sufficient deterrent for regressions in the future. Between @Pilchie and I, we've seen the specific crash a total of 3 times now over the past several weeks. |
|
retest windows_debug_vs-integration_prtest please |
|
retest windows_release_vs-integration_prtest please |
Fixes #19761
Ask Mode
Customer scenario
The specific steps to encounter this issue are not completely clear. It appears to involve working in a file which has a deeply nested structure. When the bug arises, Visual Studio may read from a pointer which has been freed, resulting in corruption of internal data structures or an access violation.
Bugs this fixes:
Fixes #19761
Fixes #20170
Workarounds, if any
None.
Risk
The change made here is theoretically capable of producing either a deadlock or a performance regression. Neither of these seems likely from the particular change, and the situations we are avoiding are certainly worse (stack overflow or heap corruption). @CyrusNajmabadi Can you provide an evaluation of the risk for this change?
Performance impact
This change disables a cancellation point in the code, which could in some cases cause a regression related to serialization performance. However, this cancellation point was implemented as a recursion guard, not a proper cancellation point, so it's likely we were not relying on its behavior as the latter for performance.
Is this a regression from a previous update?
Yes, this bug was introduced in #16798 (specifically 6427a2b) for release 15.1.
Root cause analysis:
The code accessing unmanaged memory is extremely performance sensitive, so most checks capable of detecting this concurrency bug early have too much overhead for deployment.
How was the bug found?
Internal customer report.