Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AggregateTrainingStopManager is trying to cancel disposed tokens #6416

Closed
ericstj opened this issue Oct 28, 2022 · 1 comment
Closed

AggregateTrainingStopManager is trying to cancel disposed tokens #6416

ericstj opened this issue Oct 28, 2022 · 1 comment
Assignees
Milestone

Comments

@ericstj
Copy link
Member

ericstj commented Oct 28, 2022

This failure is occuring in multiple PRs during AutoMLExperiment_throw_timeout_exception_when_ct_is_canceled_and_no_trial_completed_Async test

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-machinelearning-refs-pull-6415-merge-bd52c1f30a6d4e1990/Microsoft.ML.AutoML.Tests/1/console.cd8a8fcf.log?helixlogtype=result

#6412 (comment)

Starting test: Microsoft.ML.AutoML.Test.AutoMLExperimentTests.AutoMLExperiment_throw_timeout_exception_when_ct_is_canceled_and_no_trial_completed_Async
Unhandled exception: System.AggregateException: One or more errors occurred. (The CancellationTokenSource has been disposed.)
 ---> System.ObjectDisposedException: The CancellationTokenSource has been disposed.
   at System.Threading.CancellationTokenSource.Cancel()
   at Microsoft.ML.AutoML.AutoMLExperiment.<>c__DisplayClass26_1.<RunAsync>g__handler|4(Object o, EventArgs e) in /__w/1/s/src/Microsoft.ML.AutoML/AutoMLExperiment/AutoMLExperiment.cs:line 270
   at Microsoft.ML.AutoML.AggregateTrainingStopManager.<.ctor>b__4_0(Object o, EventArgs e) in /__w/1/s/src/Microsoft.ML.AutoML/AutoMLExperiment/IStopTrainingManager.cs:line 129
   at Microsoft.ML.AutoML.TimeoutTrainingStopManager.<.ctor>b__5_0(Object o, EventArgs e) in /__w/1/s/src/Microsoft.ML.AutoML/AutoMLExperiment/IStopTrainingManager.cs:line 72
   at Microsoft.ML.AutoML.CancellationTokenStopTrainingManager.<.ctor>b__5_0() in /__w/1/s/src/Microsoft.ML.AutoML/AutoMLExperiment/IStopTrainingManager.cs:line 38
   at System.Threading.CancellationToken.<>c.<Register>b__12_0(Object obj)
   at System.Threading.CancellationTokenSource.CallbackNode.<>c.<ExecuteCallback>b__9_0(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   --- End of inner exception stack trace ---
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   at System.Threading.CancellationTokenSource.TimerCallback(Object state)
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()
Unhandled exception. System.AggregateException: One or more errors occurred. (The CancellationTokenSource has been disposed.)
 ---> System.ObjectDisposedException: The CancellationTokenSource has been disposed.
   at System.Threading.CancellationTokenSource.Cancel()
   at Microsoft.ML.AutoML.AutoMLExperiment.<>c__DisplayClass26_1.<RunAsync>g__handler|4(Object o, EventArgs e) in /__w/1/s/src/Microsoft.ML.AutoML/AutoMLExperiment/AutoMLExperiment.cs:line 270
   at Microsoft.ML.AutoML.AggregateTrainingStopManager.<.ctor>b__4_0(Object o, EventArgs e) in /__w/1/s/src/Microsoft.ML.AutoML/AutoMLExperiment/IStopTrainingManager.cs:line 129
   at Microsoft.ML.AutoML.TimeoutTrainingStopManager.<.ctor>b__5_0(Object o, EventArgs e) in /__w/1/s/src/Microsoft.ML.AutoML/AutoMLExperiment/IStopTrainingManager.cs:line 72
   at Microsoft.ML.AutoML.CancellationTokenStopTrainingManager.<.ctor>b__5_0() in /__w/1/s/src/Microsoft.ML.AutoML/AutoMLExperiment/IStopTrainingManager.cs:line 38
   at System.Threading.CancellationToken.<>c.<Register>b__12_0(Object obj)
   at System.Threading.CancellationTokenSource.CallbackNode.<>c.<ExecuteCallback>b__9_0(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   --- End of inner exception stack trace ---
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   at System.Threading.CancellationTokenSource.TimerCallback(Object state)
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()
Aborted (core dumped)

Bug appears to be here:

// only force-canceling running trials when there's completed trials.
// otherwise, wait for the current running trial to be completed.
if (_bestTrialResult != null)
trialCancellationTokenSource.Cancel();

I see that then handler is detached later in a finally statement. Perhaps there is a race condition?

@ghost ghost added the untriaged New issue has not been triaged label Oct 28, 2022
@LittleLittleCloud LittleLittleCloud self-assigned this Oct 31, 2022
@michaelgsharp michaelgsharp added this to the ML.NET 3.0 milestone Nov 28, 2022
@ghost ghost removed the untriaged New issue has not been triaged label Nov 28, 2022
@ericstj
Copy link
Member Author

ericstj commented Mar 28, 2023

Just noticed this was hit again here: #6607

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-machinelearning-refs-pull-6607-merge-91a906864d64404587/Microsoft.ML.AutoML.Tests/1/console.bc7a67f7.log?helixlogtype=result

6.00
Starting test: Microsoft.ML.AutoML.Test.AutoMLExperimentTests.AutoMLExperiment_cancel_trial_when_exceeds_memory_limit_Async
Unhandled exception: System.AggregateException: One or more errors occurred. (The CancellationTokenSource has been disposed.)
 ---> System.ObjectDisposedException: The CancellationTokenSource has been disposed.
   at System.Threading.CancellationTokenSource.Cancel()
   at Microsoft.ML.AutoML.AutoMLExperiment.<>c__DisplayClass24_1.<RunAsync>g__handler|4(Object o, EventArgs e) in D:\a\_work\1\s\src\Microsoft.ML.AutoML\AutoMLExperiment\AutoMLExperiment.cs:line 247
   at Microsoft.ML.AutoML.AggregateTrainingStopManager.<.ctor>b__4_0(Object o, EventArgs e) in D:\a\_work\1\s\src\Microsoft.ML.AutoML\AutoMLExperiment\IStopTrainingManager.cs:line 129
   at Microsoft.ML.AutoML.TimeoutTrainingStopManager.<.ctor>b__5_0(Object o, EventArgs e) in D:\a\_work\1\s\src\Microsoft.ML.AutoML\AutoMLExperiment\IStopTrainingManager.cs:line 72
   at Microsoft.ML.AutoML.CancellationTokenStopTrainingManager.<.ctor>b__5_0() in D:\a\_work\1\s\src\Microsoft.ML.AutoML\AutoMLExperiment\IStopTrainingManager.cs:line 38
   at System.Threading.CancellationToken.<>c.<Register>b__12_0(Object obj)
   at System.Threading.CancellationTokenSource.CallbackNode.<>c.<ExecuteCallback>b__9_0(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   --- End of inner exception stack trace ---
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   at System.Threading.CancellationTokenSource.TimerCallback(Object state)
   at System.Threading.TimerQueueTimer.CallCallback(Boolean isThreadPool)
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()
   at System.Threading.TimerQueue.AppDomainTimerCallback(Int32 id)
Unhandled exception. System.AggregateException: One or more errors occurred. (The CancellationTokenSource has been disposed.)
 ---> System.ObjectDisposedException: The CancellationTokenSource has been disposed.
   at System.Threading.CancellationTokenSource.Cancel()
   at Microsoft.ML.AutoML.AutoMLExperiment.<>c__DisplayClass24_1.<RunAsync>g__handler|4(Object o, EventArgs e) in D:\a\_work\1\s\src\Microsoft.ML.AutoML\AutoMLExperiment\AutoMLExperiment.cs:line 247
   at Microsoft.ML.AutoML.AggregateTrainingStopManager.<.ctor>b__4_0(Object o, EventArgs e) in D:\a\_work\1\s\src\Microsoft.ML.AutoML\AutoMLExperiment\IStopTrainingManager.cs:line 129
   at Microsoft.ML.AutoML.TimeoutTrainingStopManager.<.ctor>b__5_0(Object o, EventArgs e) in D:\a\_work\1\s\src\Microsoft.ML.AutoML\AutoMLExperiment\IStopTrainingManager.cs:line 72
   at Microsoft.ML.AutoML.CancellationTokenStopTrainingManager.<.ctor>b__5_0() in D:\a\_work\1\s\src\Microsoft.ML.AutoML\AutoMLExperiment\IStopTrainingManager.cs:line 38
   at System.Threading.CancellationToken.<>c.<Register>b__12_0(Object obj)
   at System.Threading.CancellationTokenSource.CallbackNode.<>c.<ExecuteCallback>b__9_0(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   --- End of inner exception stack trace ---
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   at System.Threading.CancellationTokenSource.TimerCallback(Object state)
   at System.Threading.TimerQueueTimer.CallCallback(Boolean isThreadPool)
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()
   at System.Threading.TimerQueue.AppDomainTimerCallback(Int32 id)
Finished test: Microsoft.ML.AutoML.Test.AutoMLExperimentTests.AutoMLExperiment_cancel_trial_when_exceeds_memory_limit_Async with memory usage 199,766,016.00

Here's a copy in case it goes away:
console.bc7a67f7.txt

@LittleLittleCloud LittleLittleCloud mentioned this issue Mar 29, 2023
4 tasks
@ghost ghost locked as resolved and limited conversation to collaborators May 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants