-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SetMaximumMemoryUsageInMegaByte in AutoMLExperiment #6305
Add SetMaximumMemoryUsageInMegaByte in AutoMLExperiment #6305
Conversation
@LittleLittleCloud looks good to me. What happens when memory usage hits the specified limit? Is the entire experiment cancelled or only that currently running trial? Does enforcing the memory limit mean that GC kicks in more frequently? |
Only current trial will be cancelled GC will be called every time a trial get completed to clean up and release memory, no matter if that trial is successful or not. So enforcing memory limit won't affect GC kick frequency. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #6305 +/- ##
==========================================
+ Coverage 68.56% 68.63% +0.06%
==========================================
Files 1170 1170
Lines 247158 247310 +152
Branches 25675 25683 +8
==========================================
+ Hits 169475 169732 +257
+ Misses 70940 70849 -91
+ Partials 6743 6729 -14
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@@ -332,7 +338,7 @@ private SweepablePipeline CreateBinaryClassificationPipeline(IDataView trainData | |||
|
|||
internal class BinaryClassificationRunner : ITrialRunner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great question here. The IDisposable pattern is mainly because I want to make sure MLContext.CancelExecuation
get called and set to null after the trial is finished while I don't want to explicitly call it's deconstructor or call GC. But I can go another route if you have any recommendation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dispose pattern is not only for GC finalization or deconstruction. It is used to clean up any object when done using it. You already implemented the Dispose method. Having IDisposable will benefit with the using ()
code pattern.
@@ -81,6 +108,12 @@ public AutoMLExperiment SetIsMaximize(bool isMaximize) | |||
return this; | |||
} | |||
|
|||
public AutoMLExperiment SetMaximumMemoryUsageInMegaByte(double value = double.MaxValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a check here to make sure it's either not NaN
or non-positive.
what happen if 2 threads called this in same time? or this is not supported scenario? ReplyIt's not designed to support on Refers to: src/Microsoft.ML.AutoML/AutoMLExperiment/AutoMLExperiment.cs:234 in 16eeeae. [](commit_id = 16eeeae, deletion_comment = False) |
@@ -187,53 +236,93 @@ public async Task<TrialResult> RunAsync(CancellationToken ct = default) | |||
ValidateSettings(); | |||
Initialize(); | |||
_settings.CancellationToken = ct; | |||
_cts.CancelAfter((int)_settings.MaxExperimentTimeInSeconds * 1000); | |||
_settings.CancellationToken.Register(() => _cts.Cancel()); | |||
_globalCancellationTokenSource.CancelAfter((int)_settings.MaxExperimentTimeInSeconds * 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that, switch to TimeSpan.FromSeconds
to avoid overflow in int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
Also, Did we think making _globalCancellationTokenSource not class field and only having it as local inside RunAsync? The current code cause problems in multi-threading scenarios. In reply to: 1233449717 Refers to: src/Microsoft.ML.AutoML/AutoMLExperiment/AutoMLExperiment.cs:234 in 16eeeae. [](commit_id = 16eeeae, deletion_comment = False) |
src/Microsoft.ML.AutoML/AutoMLExperiment/IPerformanceMonitor.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.ML.AutoML/AutoMLExperiment/IPerformanceMonitor.cs
Outdated
Show resolved
Hide resolved
var currentCpuProcessorTime = Process.GetCurrentProcess().TotalProcessorTime; | ||
var elapseCpuProcessorTime = currentCpuProcessorTime - _totalCpuProcessorTime; | ||
var cpuUsedMs = elapseCpuProcessorTime.TotalMilliseconds; | ||
var cpuUsageInTotal = cpuUsedMs / (Environment.ProcessorCount * _checkIntervalInMilliseconds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compiler should be able to optimize that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you assuming that? or you have tried it?
Trying quickly in csharplab I am seeing it is not optimized. note I didn't use Environment.ProcessorCount
because it is just 1 and will be optimized at that time. so, I used other constant.
Anyway, it is a minor point, and you can ignore it if you are ok with the current code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah indeed it's not, turns out I'm too optimistic on the capability of how much optimization compiler can be.
Thanks for introducing csharplab, that's such a wonderful tool
trialCancellationTokenSource.Cancel(); | ||
|
||
GC.AddMemoryPressure(Convert.ToInt64(m) * 1024 * 1024); | ||
GC.Collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we are doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean GC.Collect
or GC.AddMemoryPressure
part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because if we hit that code, it means the current trial uses memory that exceed the limit. Therefore we would like to run a GC to collect memory after that trial get cancelled.
GC.AddMemoryPressure
is mostly for LightGbm which might allocate large chunk of unmanaged memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't GC kick off automatically when it needs more memory? at that time will collect any stale objects. I am asking why you are forcing that to happen at this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it wouldn't if GC thinks it still has enough memory. And in that case, the next trial will likely to be canceled even if it's not using a lot of memory because the memory used by the last trial is not released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it wouldn't if GC thinks it still has enough memory.
Could you please talk more about that? Why GC will think there will not be enough memory? I expect GC will run to collect any unneeded objects at that time which should free a lot of memory. no?
I am not objecting here, I am trying to understand the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect GC will run to collect any unneeded objects at that time which should free a lot of memory. no?
Only if GC thinks there's not enough memory and runs a collection, but that's not always the case when a trial gets canceled. For example, if the user sets a very small number for maximum memory usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit I am not fully understanding the whole picture :-) I'll leave it to you if you are sure about that. You may ignore my comment then.
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
We are excited to review your PR.
So we can do the best job, please check:
Fixes #nnnn
in your description to cause GitHub to automatically close the issue(s) when your PR is merged.This PR adds
SetMaximumMemoryUsageInMegaByte
, which sets limit to the maximum memory a trial can use, and cancel that running trial if it exceeds the maximum memory limitation.#6293
Example
How it work
AutoMLExperiment
monitors the memory and cpu usage periodically (the default period is 2 seconds) usingIPerformanceMonitor
, and once the trial exceed the boundary,AutoMLExperiment
cancel that trial viaMLContext.CancelExecution
and does clean up.Known limitation
This requires estimator support cancel during training by checking if context is still alive from time to time, which is not true for all estimators in ML.Net.
For example,
LGBM
can't check if context is alive in unmanaged code part, which makes it can't be cancelled during the time when it's calling into native LightGbm dll. In that situation, tiral memory usage might still exceed maximum limit.