-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add continuous resource monitoring to AutoML.IMonitor #6520
Add continuous resource monitoring to AutoML.IMonitor #6520
Conversation
…d ReportTrialResourceUsage event to IMonitor
This reverts commit 269b1bd.
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #6520 +/- ##
==========================================
+ Coverage 68.40% 68.43% +0.03%
==========================================
Files 1174 1174
Lines 248045 248047 +2
Branches 25909 25909
==========================================
+ Hits 169670 169756 +86
+ Misses 71604 71538 -66
+ Partials 6771 6753 -18
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@LittleLittleCloud can you help review this? Thanks @andrasfuchs for contributing! |
@andrasfuchs I just come back from vacation, sorry for the late reply. |
@LittleLittleCloud would we want to ask @andrasfuchs to bring over any of his resource early quitting code? |
@JakeRadMSFT |
…on-and-resource-monitoring-to-imonitor move ReportTrialResourceUsage to IPerformanceMonitor and rename it to OnPerformanceMetricsUpdatedHandler
Thank you for the PR, I already modified my own code to test it. It looks good, I just have a few remarks:
|
Good to hear that the new change in
That's a good point. And it's the side effect of moving
That's one way, you can also just list the parameter in constructor and rely on dependency injection if all parameters are available in |
I added a small change to pause the performance monitor when the trial is completed and I also trigger the I have a few more notes: (2) The other thing I think would be useful in the (3) I know that you preferred the removal of (4) The CPU measurement seems to be off a little, sometimes it goes above 100%: |
Thanks for the new change that you made, it's really awesome.
Would it be helpful if we provide a helper function for
Sure, and maybe add an
Might be data racing in |
(1) I didn't know that, makes sense, thank you for the explanation! (2) A helper function to get the estimator names from the (3) I added (4) I tried to encapsulate the whole SampleCpuAndMemoryUsage() method in a lock {}, but it didn't solve the problem. (5) The cancellation logic stopped working somewhere along the way: |
The CPU usage is calculated using the following formula:
where if To verify that hypothesis, maybe the formula needs to be updated to
|
@LittleLittleCloud @andrasfuchs thanks for getting this in! |
Yes, thank you @LittleLittleCloud for finishing this up and thank you @JakeRadMSFT and @luisquintanilla for keeping an eye on this! |
Partially fixes #6320, #6426, #6425 and helps investigating further problems with AutoML trials.
This PR lets the user cancel trials based various performance metrics. It changed my user experience with AutoML experiments significantly, because I regularly had crashes and failed trials when I tried to run experiments for a long time. With this modification I could implement my own IMonitor and react to changes in memory demand, virtual memory usage, remaining disk space and I could skip a trial if it was running unexpectedly long without terminating the experiment.
Before the modifications in this PR my experiments usually stopped with an error in a few hours, but since I have much more control over the experiment with these modifications I could run much longer experiment without any issues.
On the technical level I moved the performance-related properties of the
TrialSettings
class into a separateTrialPerformanceMetrics
subclass, I added a timer to check for those CPU and memory metrics, and I added a newReportTrialResourceUsage
event to theIMonitor
class that is called periodically during the trial-run. I also added theCancellationTokenSource
class of the trial to theTrialSettings
so that the user can skip a trial if they wish.You can also check my custom IMonitor implementation where the resource monitoring and cancellation logic is demonstrated.