Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add continuous resource monitoring to AutoML.IMonitor #6520

Conversation

andrasfuchs
Copy link
Contributor

Partially fixes #6320, #6426, #6425 and helps investigating further problems with AutoML trials.

This PR lets the user cancel trials based various performance metrics. It changed my user experience with AutoML experiments significantly, because I regularly had crashes and failed trials when I tried to run experiments for a long time. With this modification I could implement my own IMonitor and react to changes in memory demand, virtual memory usage, remaining disk space and I could skip a trial if it was running unexpectedly long without terminating the experiment.

Before the modifications in this PR my experiments usually stopped with an error in a few hours, but since I have much more control over the experiment with these modifications I could run much longer experiment without any issues.

On the technical level I moved the performance-related properties of the TrialSettings class into a separate TrialPerformanceMetrics subclass, I added a timer to check for those CPU and memory metrics, and I added a new ReportTrialResourceUsage event to the IMonitor class that is called periodically during the trial-run. I also added the CancellationTokenSource class of the trial to the TrialSettings so that the user can skip a trial if they wish.

You can also check my custom IMonitor implementation where the resource monitoring and cancellation logic is demonstrated.

@andrasfuchs andrasfuchs changed the title Add cancellation and resource monitoring to imonitor Add continuous resource monitoring to AutoML.IMonitor Dec 6, 2022
@codecov
Copy link

codecov bot commented Dec 6, 2022

Codecov Report

Merging #6520 (509f963) into main (8c0ceaf) will increase coverage by 0.03%.
The diff coverage is 46.15%.

❗ Current head 509f963 differs from pull request most recent head 5a27af4. Consider uploading reports for the commit 5a27af4 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6520      +/-   ##
==========================================
+ Coverage   68.40%   68.43%   +0.03%     
==========================================
  Files        1174     1174              
  Lines      248045   248047       +2     
  Branches    25909    25909              
==========================================
+ Hits       169670   169756      +86     
+ Misses      71604    71538      -66     
+ Partials     6771     6753      -18     
Flag Coverage Δ
Debug 68.43% <46.15%> (+0.03%) ⬆️
production 62.83% <ø> (+0.05%) ⬆️
test 88.88% <46.15%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/Microsoft.ML.SearchSpace/Parameter.cs 74.05% <ø> (ø)
...Microsoft.ML.AutoML.Tests/AutoMLExperimentTests.cs 89.34% <30.00%> (-7.19%) ⬇️
test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs 81.64% <100.00%> (ø)
...c/Microsoft.ML.FastTree/Utils/ThreadTaskManager.cs 79.48% <0.00%> (-20.52%) ⬇️
src/Microsoft.ML.Core/Data/ProgressReporter.cs 70.95% <0.00%> (-6.99%) ⬇️
src/Microsoft.ML.Maml/MAML.cs 25.38% <0.00%> (-1.53%) ⬇️
...ML.Transforms/Text/StopWordsRemovingTransformer.cs 86.38% <0.00%> (-0.15%) ⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 89.77% <0.00%> (+0.15%) ⬆️
src/Microsoft.ML.Data/Utils/LossFunctions.cs 67.35% <0.00%> (+0.51%) ⬆️
... and 5 more

@JakeRadMSFT
Copy link
Contributor

@LittleLittleCloud can you help review this?

Thanks @andrasfuchs for contributing!

@LittleLittleCloud
Copy link
Contributor

LittleLittleCloud commented Dec 21, 2022

@andrasfuchs I just come back from vacation, sorry for the late reply.

@JakeRadMSFT
Copy link
Contributor

@LittleLittleCloud would we want to ask @andrasfuchs to bring over any of his resource early quitting code?

@andrasfuchs
Copy link
Contributor Author

@JakeRadMSFT
I have these 3 checks in place in my IMonitor class:
a, running time (for this I think you would need the trial start time, that I have just removed)
b, memory usage (you already have this with AutoMLExperiment.SetMaximumMemoryUsageInMegaByte)
c, virtual memory availability (this is Windows only, at least until the Core team implements a more general approach)

LittleLittleCloud and others added 3 commits January 3, 2023 11:58
…on-and-resource-monitoring-to-imonitor

move ReportTrialResourceUsage to IPerformanceMonitor and rename it to OnPerformanceMetricsUpdatedHandler
@andrasfuchs
Copy link
Contributor Author

Thank you for the PR, I already modified my own code to test it.

It looks good, I just have a few remarks:

  • I inherited my own IPerformanceMonitor implementation from DefaultPerformanceMonitor, because I wanted to keep the monitoring methods. It took my a while to realize how to get its constructor parameters and I ended up copy-pasting code.
  • Having the IMonitor and IPerformanceMonitor in separate classes isn't an issue, but since we don't have the starting time of the trial in TrialSettings anymore, I needed to measure the trial's runtime based on its first performance metrics update event handler call which isn't precise, because that event isn't guarantied to be called when a new trial starts. It's probably alright, I don't think it has a practical drawbacks even if the update period is long.
  • The performance metrics updated event handler is called even after the TrialCompleted event and before the next trial's TrialRunning event. It might cause some strange states if the IPerformanceMonitor tries to cancel the trial during this period (due timeout for example). This is more obvious if the update period is small.

@LittleLittleCloud
Copy link
Contributor

Good to hear that the new change in IPerformanceMonitor works for you. And let me know if you met any other issues, whether or not related to this one.

The performance metrics updated event handler is called even after the TrialCompleted event and before the next trial's TrialRunning event. It might cause some strange states if the IPerformanceMonitor tries to cancel the trial during this period (due timeout for example). This is more obvious if the update period is small.

That's a good point. And it's the side effect of moving ReportTrialResourceUsage from IMonitor to IPerformanceMonitor. So as a work-around you might want to check if taskCancellationSource is still cancellable before cancelling a trial. And it will be a good idea to stop firing event from IPerformanceMonitor once the training (pipeline.fit) has stopped.

I inherited my own IPerformanceMonitor implementation from DefaultPerformanceMonitor, because I wanted to keep the monitoring methods. It took my a while to realize how to get its constructor parameters and I ended up copy-pasting code.

That's one way, you can also just list the parameter in constructor and rely on dependency injection if all parameters are available in AutoMLExperiment.ServiceCollection.

@andrasfuchs
Copy link
Contributor Author

I added a small change to pause the performance monitor when the trial is completed and I also trigger the PerformanceMetricsUpdated event just after the trial has started. These changes would solve the two issues I mentioned above.

I have a few more notes:
(1) I think that setting a custom IPerformanceMonitor could be simpler from the user's perspective: could we use a similar syntax as we have for IMonitor with AutoMLExperiment.SetMonitor(IMonitor)?

(2) The other thing I think would be useful in the TrialSettings object is a string array with the estimators of the pipeline. It's an important information, but we only have it from _pipeline.ToString(trialSettings.Parameter) in the IMonitor and IPerformanceMonitor event handlers at the moment. Since just the last estimator is usually changing, I still need to do some string-splitting to get the useful, last one, this isn't ideal. Would you be open to adding a string[] Estimators to TrialSettings?

(3) I know that you preferred the removal of StartedAtUtc property from TrialSettings before, but I still think it would be practical to include it there, because checking the runtime of the trial in a performance monitor is typical I think, and it would make that a lot simpler for the end user. What do you think?

(4) The CPU measurement seems to be off a little, sometimes it goes above 100%:
image

@LittleLittleCloud
Copy link
Contributor

LittleLittleCloud commented Jan 6, 2023

Thanks for the new change that you made, it's really awesome.

(1) I think that setting a custom IPerformanceMonitor could be simpler from the user's perspective: could we use a similar syntax as we have for IMonitor with AutoMLExperiment.SetMonitor(IMonitor)?

IPerformanceMonitor has a different lifespan compared to IMonitor. IPerformanceMonitor's lifespan is transient and AutoMLExperiment will create a new instance of it every time it starts a new trial. While the lifetime of IMonitor will live the entire training process. Because of that, we can't just pass an instance of IPerformanceMonitor like how IMonitor does.

(2) The other thing I think would be useful in the TrialSettings object is a string array with the estimators of the pipeline. It's an important information, but we only have it from _pipeline.ToString(trialSettings.Parameter) in the IMonitor and IPerformanceMonitor event handlers at the moment. Since just the last estimator is usually changing, I still need to do some string-splitting to get the useful, last one, this isn't ideal. Would you be open to adding a string[] Estimators to TrialSettings?

Would it be helpful if we provide a helper function for SweepablePipeline which extracts a list of Estimators from Parameter otherwise? I'm still hesitant to add a string[] Estimator in trialSetting class since it's duplicated information.

(3) I know that you preferred the removal of StartedAtUtc property from TrialSettings before, but I still think it would be practical to include it there, because checking the runtime of the trial in a performance monitor is typical I think, and it would make that a lot simpler for the end user. What do you think?

Sure, and maybe add an EndAtUtc as well

(4) The CPU measurement seems to be off a little, sometimes it goes above 100%:

Might be data racing in SampleCpuAndMemoryUsage? Maybe we need to add a lock to that handker

@andrasfuchs
Copy link
Contributor Author

(1) I didn't know that, makes sense, thank you for the explanation!

(2) A helper function to get the estimator names from the Parameter would be perfect.

(3) I added StartedAtUtc and EndedAtUtc to TrialSettings.

(4) I tried to encapsulate the whole SampleCpuAndMemoryUsage() method in a lock {}, but it didn't solve the problem.
image

(5) The cancellation logic stopped working somewhere along the way:
image
I call the trialCancellationTokenSource.Cancel(), but the trial keeps running. Do you know why?

@LittleLittleCloud
Copy link
Contributor

(4) I tried to encapsulate the whole SampleCpuAndMemoryUsage() method in a lock {}, but it didn't solve the problem.

The CPU usage is calculated using the following formula:

cpuUsage = (Process.TotalProcessorTime - _totalCpuProcessorTime) / (Environment.ProcessorCount * _checkIntervalInMilliseconds)

where _totalCpuProcessorTime is the most recent sampled Process.TotalProcessorTime.

if cpuUsage is over 100%, one situation is (Environment.ProcessorCount * _checkIntervalInMilliseconds) being smaller than actual sampling interval, which might cause by waiting and getting _lock?

To verify that hypothesis, maybe the formula needs to be updated to

cpuUsage = (Process.TotalProcessorTime - _totalCpuProcessorTime) / (Environment.ProcessorCount * (Utc.Now - _lastSamplingUtcTime)

@LittleLittleCloud LittleLittleCloud merged commit d239fda into dotnet:main Feb 9, 2023
@JakeRadMSFT
Copy link
Contributor

@LittleLittleCloud @andrasfuchs thanks for getting this in!

@andrasfuchs
Copy link
Contributor Author

Yes, thank you @LittleLittleCloud for finishing this up and thank you @JakeRadMSFT and @luisquintanilla for keeping an eye on this!

@ghost ghost locked as resolved and limited conversation to collaborators Mar 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add resource (CPU,RAM,GPU,thread count) monitoring to AutoML experiments
3 participants