Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SetMaximumMemoryUsageInMegaByte in AutoMLExperiment #6305

Merged

Conversation

LittleLittleCloud
Copy link
Contributor

@LittleLittleCloud LittleLittleCloud commented Aug 26, 2022

We are excited to review your PR.

So we can do the best job, please check:

  • There's a descriptive title that will make sense to other developers some time from now.
  • There's associated issues. All PR's should have issue(s) associated - unless a trivial self-evident change such as fixing a typo. You can use the format Fixes #nnnn in your description to cause GitHub to automatically close the issue(s) when your PR is merged.
  • Your change description explains what the change does, why you chose your approach, and anything else that reviewers should know.
  • You have included any necessary tests in the same PR.

This PR adds SetMaximumMemoryUsageInMegaByte, which sets limit to the maximum memory a trial can use, and cancel that running trial if it exceeds the maximum memory limitation.

#6293

Example

// set maximum memory usage to 8gb
experiment.SetMaximumMemoryUsageInMegaByte(8 * 1024);

How it work

AutoMLExperiment monitors the memory and cpu usage periodically (the default period is 2 seconds) using IPerformanceMonitor, and once the trial exceed the boundary, AutoMLExperiment cancel that trial via MLContext.CancelExecution and does clean up.

Known limitation

This requires estimator support cancel during training by checking if context is still alive from time to time, which is not true for all estimators in ML.Net.
For example, LGBM can't check if context is alive in unmanaged code part, which makes it can't be cancelled during the time when it's calling into native LightGbm dll. In that situation, tiral memory usage might still exceed maximum limit.

@LittleLittleCloud LittleLittleCloud changed the title U/xiaoyun/memory limit Add SetMaximumMemoryUsageInMegaByte in AutoMLExperiment Aug 29, 2022
@luisquintanilla
Copy link
Contributor

@LittleLittleCloud looks good to me.

What happens when memory usage hits the specified limit? Is the entire experiment cancelled or only that currently running trial? Does enforcing the memory limit mean that GC kicks in more frequently?

@LittleLittleCloud
Copy link
Contributor Author

@LittleLittleCloud looks good to me.

What happens when memory usage hits the specified limit? Is the entire experiment cancelled or only that currently running trial? Does enforcing the memory limit mean that GC kicks in more frequently?

Only current trial will be cancelled

GC will be called every time a trial get completed to clean up and release memory, no matter if that trial is successful or not. So enforcing memory limit won't affect GC kick frequency.

@codecov
Copy link

codecov bot commented Aug 29, 2022

Codecov Report

Merging #6305 (5496403) into main (9652e59) will increase coverage by 0.06%.
The diff coverage is 92.90%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6305      +/-   ##
==========================================
+ Coverage   68.56%   68.63%   +0.06%     
==========================================
  Files        1170     1170              
  Lines      247158   247310     +152     
  Branches    25675    25683       +8     
==========================================
+ Hits       169475   169732     +257     
+ Misses      70940    70849      -91     
+ Partials     6743     6729      -14     
Flag Coverage Δ
Debug 68.63% <92.90%> (+0.06%) ⬆️
production 63.06% <100.00%> (+0.06%) ⬆️
test 89.05% <92.41%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...Microsoft.ML.AutoML.Tests/AutoMLExperimentTests.cs 92.38% <87.64%> (-7.62%) ⬇️
src/Microsoft.ML.FastTree/FastTree.cs 80.48% <100.00%> (+0.07%) ⬆️
src/Microsoft.ML.LightGbm/LightGbmTrainerBase.cs 79.77% <100.00%> (+0.30%) ⬆️
...c/Microsoft.ML.LightGbm/WrappedLightGbmTraining.cs 45.45% <100.00%> (+0.71%) ⬆️
...osoft.ML.Tests/TrainerEstimators/TreeEstimators.cs 97.80% <100.00%> (+0.19%) ⬆️
src/Microsoft.ML.Maml/MAML.cs 24.36% <0.00%> (-2.54%) ⬇️
src/Microsoft.ML.Data/Utils/LossFunctions.cs 67.35% <0.00%> (+0.51%) ⬆️
...oft.ML.StandardTrainers/Standard/SdcaMulticlass.cs 92.49% <0.00%> (+1.02%) ⬆️
... and 4 more

@@ -332,7 +338,7 @@ private SweepablePipeline CreateBinaryClassificationPipeline(IDataView trainData

internal class BinaryClassificationRunner : ITrialRunner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BinaryClassificationRunner

Should we use IDisposable with this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great question here. The IDisposable pattern is mainly because I want to make sure MLContext.CancelExecuation get called and set to null after the trial is finished while I don't want to explicitly call it's deconstructor or call GC. But I can go another route if you have any recommendation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispose pattern is not only for GC finalization or deconstruction. It is used to clean up any object when done using it. You already implemented the Dispose method. Having IDisposable will benefit with the using () code pattern.

@@ -81,6 +108,12 @@ public AutoMLExperiment SetIsMaximize(bool isMaximize)
return this;
}

public AutoMLExperiment SetMaximumMemoryUsageInMegaByte(double value = double.MaxValue)
Copy link
Member

@tarekgh tarekgh Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value

what happen if someone passed NaN value? can you try it and look what you'll get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a check here to make sure it's either not NaN or non-positive.

@tarekgh
Copy link
Member

tarekgh commented Aug 31, 2022

    public async Task<TrialResult> RunAsync(CancellationToken ct = default)

what happen if 2 threads called this in same time? or this is not supported scenario?

Reply

It's not designed to support on AutoMLExperiment level. So if user wants to launch several AutoMLExperiment on multi-thread, they need to create new AutoMLExperiment to do that.


Refers to: src/Microsoft.ML.AutoML/AutoMLExperiment/AutoMLExperiment.cs:234 in 16eeeae. [](commit_id = 16eeeae, deletion_comment = False)

@@ -187,53 +236,93 @@ public async Task<TrialResult> RunAsync(CancellationToken ct = default)
ValidateSettings();
Initialize();
_settings.CancellationToken = ct;
_cts.CancelAfter((int)_settings.MaxExperimentTimeInSeconds * 1000);
_settings.CancellationToken.Register(() => _cts.Cancel());
_globalCancellationTokenSource.CancelAfter((int)_settings.MaxExperimentTimeInSeconds * 1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)_settings.MaxExperimentTimeInSeconds * 1000

Any concern with overflowing this calculation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that, switch to TimeSpan.FromSeconds to avoid overflow in int

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

@tarekgh
Copy link
Member

tarekgh commented Aug 31, 2022

    public async Task<TrialResult> RunAsync(CancellationToken ct = default)

Also, Did we think making _globalCancellationTokenSource not class field and only having it as local inside RunAsync? The current code cause problems in multi-threading scenarios.


In reply to: 1233449717


Refers to: src/Microsoft.ML.AutoML/AutoMLExperiment/AutoMLExperiment.cs:234 in 16eeeae. [](commit_id = 16eeeae, deletion_comment = False)

var currentCpuProcessorTime = Process.GetCurrentProcess().TotalProcessorTime;
var elapseCpuProcessorTime = currentCpuProcessorTime - _totalCpuProcessorTime;
var cpuUsedMs = elapseCpuProcessorTime.TotalMilliseconds;
var cpuUsageInTotal = cpuUsedMs / (Environment.ProcessorCount * _checkIntervalInMilliseconds);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment.ProcessorCount * _checkIntervalInMilliseconds

nit: This can be calculated once in the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compiler should be able to optimize that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you assuming that? or you have tried it?

Trying quickly in csharplab I am seeing it is not optimized. note I didn't use Environment.ProcessorCount because it is just 1 and will be optimized at that time. so, I used other constant.

Anyway, it is a minor point, and you can ignore it if you are ok with the current code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah indeed it's not, turns out I'm too optimistic on the capability of how much optimization compiler can be.

Thanks for introducing csharplab, that's such a wonderful tool

trialCancellationTokenSource.Cancel();

GC.AddMemoryPressure(Convert.ToInt64(m) * 1024 * 1024);
GC.Collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we are doing that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean GC.Collect or GC.AddMemoryPressure part?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because if we hit that code, it means the current trial uses memory that exceed the limit. Therefore we would like to run a GC to collect memory after that trial get cancelled.

GC.AddMemoryPressure is mostly for LightGbm which might allocate large chunk of unmanaged memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

Copy link
Member

@tarekgh tarekgh Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't GC kick off automatically when it needs more memory? at that time will collect any stale objects. I am asking why you are forcing that to happen at this time.

Copy link
Contributor Author

@LittleLittleCloud LittleLittleCloud Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it wouldn't if GC thinks it still has enough memory. And in that case, the next trial will likely to be canceled even if it's not using a lot of memory because the memory used by the last trial is not released.

Copy link
Member

@tarekgh tarekgh Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it wouldn't if GC thinks it still has enough memory.

Could you please talk more about that? Why GC will think there will not be enough memory? I expect GC will run to collect any unneeded objects at that time which should free a lot of memory. no?

I am not objecting here, I am trying to understand the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect GC will run to collect any unneeded objects at that time which should free a lot of memory. no?

Only if GC thinks there's not enough memory and runs a collection, but that's not always the case when a trial gets canceled. For example, if the user sets a very small number for maximum memory usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit I am not fully understanding the whole picture :-) I'll leave it to you if you are sure about that. You may ignore my comment then.

@LittleLittleCloud
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@LittleLittleCloud
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@LittleLittleCloud LittleLittleCloud merged commit e99dfd4 into dotnet:main Sep 8, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Oct 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants