Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add async GPU dependency Engine #20331

Merged
merged 8 commits into from
Oct 30, 2021
Merged

Add async GPU dependency Engine #20331

merged 8 commits into from
Oct 30, 2021

Conversation

Kh4L
Copy link
Contributor

@Kh4L Kh4L commented Jun 4, 2021

Description

Changes how the GPU operations are synced in the ThreadedEngine.

GPU operators needed to be completed so the output variables (either NDArray or Resource) could be tagged as readable. The engine workers needed to cudaStreamSynchronize on the GPU stream to know when the variable is ready to read. This prevented an optimal GPU kernels overlapping with CPU operations.

This PR introduces a sync mechanism which leverages cudaEvent to avoid host synchronization between GPU operations.
The GPU writing operators tag the variables with a cudaEvent based sync object, and the operators reading the variables sync or wait on these objects.

Illustration

Screenshot 2021-06-04 at 17 10 43

Note

Use the MXNET_ENGINE_TYPE environment variable to enable the new async engine feature, by adding Async at the end of the engine type:

ThreadedEngine -> ThreadedEngineAsync
ThreadedEnginePerDevice -> ThreadedEnginePerDeviceAsync

When not set, the GPU operations synchronization will be done as previously: with a host-device synchronization after the op function.

Detailed changes

The changes primarily concern the ThreadedEnginePerDevice and the ThreadedEnginePooled.

Removing synchronization

This PR removes all the stream synchronizations that used to be called after the operator function, such as rctx.get_stream<gpu>()->Wait();.

SyncObject

A new structure, SyncObject is added to the engine variable. It carries the CUDA events used to defer the data dependency to the CUDA Driver.
Therefore, each variable (NDArray or Resource) handled by the engine owns a SyncObject.

There are two categories of events, the reader events and writer events.

A sync object can have 0 or more reader events: several operators might be waiting to read this Variable.
However, it can carry only 0 or 1 writer event: only one writer can be writing on the variable at a time.

New functions: OnCompleteGPU, OnStartGPU and OnStartCPU

Previously, an OnComplete callback would be called by the engine worker right after executing the op function. Its main purpose was to set the data dependency. This PR adds a new OnComplete overrides and another callback type, the on
3 new hooks were added:

  • OnCompleteGPU. The existing OnComplete callback has new a new version, the OnCompleteGPU. This functions captures the kernels scheduled on the stream in an event, and assigns it to the SyncObject of each variable.

  • OnStartGPU: this hook is called before GPU op functions and will iterate through all the events of variables read and written by the function. It calls cudaStreamWaitEvent to guarantee that all the needed variables will be ready before the current function's kernels are scheduled.

  • OnStartCPU: this hook acts similarly as the previous one, but it is called before GPU op functions and it uses host-device synchronisation with cudaEventSynchronize.

These changes result in a new signature for the Engine's async function:
Before:

  typedef std::function<void(RunContext, CallbackOnStart, CallbackOnComplete)> AsyncFn;

After:

  typedef std::function<void(RunContext, CallbackOnStart)> AsyncFn;

CUDAEventPool

The events are managed by a CUDAEventPool that contains a counter. This is needed both for performance and to track the event order.
In the ThreadedEnginePerDevice, there is one pool per worker. In the ThreadedEnginePooled, there is a one ThreadedEnginePooled per stream, owned by the stream manager.

Pooled allocator

The PooledStorageManager is also aware of the sync objects in order to recycle them with the data Chunk. The free function now also recycles the SyncObject, which can be later synchronized at alloc time.

WaitAll, WaitToRead and WaitToWrite updated

The NDArray synchronisation method WaitAll, WaitToRead and WaitToWrite now call cudaEventSynchronize to guarantee that the GPU data is ready.

TODO:

  • Re-enable the byteps test, once byteps is updated

Co-author: @ptrendx

@Kh4L Kh4L requested a review from eric-haibin-lin as a code owner June 4, 2021 08:05
@mxnet-bot
Copy link

Hey @Kh4L , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-cpu, centos-gpu, sanity, windows-cpu, miscellaneous, centos-cpu, unix-gpu, windows-gpu, edge, website, clang]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@Kh4L
Copy link
Contributor Author

Kh4L commented Jun 14, 2021

@mxnet-bot run ci [centos-cpu, centos-gpu, edge, unix-cpu, unix-gpu, windows-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu, windows-gpu, edge, unix-gpu, centos-gpu, centos-cpu]

@TristonC
Copy link
Contributor

@szha @sandeep-krishnamurthy Do you have someone in your team or elsewhere for a code review?

@szha
Copy link
Member

szha commented Jun 22, 2021

I plan to take a closer look this week. could @ptrendx also take a look?

@ptrendx ptrendx self-requested a review June 25, 2021 05:21
include/mxnet/base.h Outdated Show resolved Hide resolved
*/
bool is_bulk;
void *event_pool = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this comment here just as a way to explain why the is_bulk is removed. is_bulk was a way to mark operations as not needing synchronization when being a part of a single engine bulk. This PR moves the handling of the synchronization to the engine itself, so this is no longer useful.

@ptrendx
Copy link
Member

ptrendx commented Jun 25, 2021

I will review this PR next week.

@Kh4L Kh4L changed the title [WIP] Add async GPU depency Engine [WIP] Add async GPU dependency Engine Jun 29, 2021
@Kh4L Kh4L force-pushed the new_dep_engine branch from 31a7610 to 8c1fbd7 Compare June 30, 2021 01:11

void ThreadedEngine::OnStartCPU(Engine *engine, void *opr_block,
const dmlc::Error* error) {
static bool use_new_dep_engine = dmlc::GetEnv("MXNET_ASYNC_GPU_ENGINE", false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have MXNET_ENGINE_TYPE which decides whether NaiveEngine (i.e. synchronous scheduler) is used, so unless there's a strong reason I'd prefer not to introduce new environment variables here.

@sandeep-krishnamurthy
Copy link
Contributor

@apeforest @eric-haibin-lin can you please help review this PR?

@sandeep-krishnamurthy
Copy link
Contributor

@Zha0q1 @mseth10 can you please help review this PR?

@Kh4L
Copy link
Contributor Author

Kh4L commented Jul 21, 2021

@sandeep-krishnamurthy what's the review status?

Copy link
Contributor

@barry-jin barry-jin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work! The PR looks good to me. The synchronization in new dlpack API #20546 will probably rely on the sync_obj introduced here. Looking forward to seeing this new feature to be merged.

Comment on lines +557 to +568
if ((*events_per_stream)[event_stream].pool_index < cuda_event.pool_index) {
(*events_per_stream)[event_stream] = cuda_event;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should the event be replaced by others with larger pool_index?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the pool is per stream so for 2 events from the same pool the one with the larger pool_index had to be recorded later (and synchronizing with that event will synchronize with both).

Comment on lines +679 to +697
if (sync_obj.writer_event[0].event.expired()) {
sync_obj.writer_event.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate why should all the writer events be removed when the first writer event expires.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can only be 1 writer event active (vs multiple readers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be only one event in the writer_event vector, clear here could be replaced by pop_back.

Comment on lines +736 to +755
sync_obj.reader_events.clear();
sync_obj.writer_event.clear();
sync_obj.writer_event.push_back({event, worker_stream->stream_, event_pool_idx});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably answers my last question. But what's the purpose of clearing all the reader and writer events? Is it because current event should synchronize with the cleared reader and writer events?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - since if you are the writer to the variable, you had to wait for every previous reader/writer already there is no reason to keep those events anymore.

Comment on lines 121 to 128
CallbackOnStart on_start = this->CreateOnStart(ThreadedEngine::OnStartStatic,
opr_block);
CallbackOnComplete callback = this->CreateCallback(ThreadedEngine::OnCompleteStatic,
opr_block);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use OnStartStatic and OnCompleteStatic here instead of creating GPU onstart and oncomplete for GPU case, creating static-onstart and static-oncomplete for CPU case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mechanism introduced in this PR is in addition to the dependency tracking that already existed, and the OnCompleteStatic contains dependency update code (so it has to be called from the GPU version as well).

@Kh4L Kh4L force-pushed the new_dep_engine branch 3 times, most recently from d863554 to ee5a6bb Compare September 8, 2021 16:14
@Kh4L Kh4L changed the title [WIP] Add async GPU dependency Engine Add async GPU dependency Engine Sep 13, 2021
@mseth10 mseth10 removed the pr-work-in-progress PR is still work in progress label Sep 13, 2021
@mseth10 mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 13, 2021
@Kh4L
Copy link
Contributor Author

Kh4L commented Oct 13, 2021

@mxnet-bot run ci [sanity]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [sanity]

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Oct 13, 2021
Signed-off-by: Serge Panev <spanev@nvidia.com>
@mseth10 mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 13, 2021
Signed-off-by: Serge Panev <spanev@nvidia.com>
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 14, 2021
@Kh4L
Copy link
Contributor Author

Kh4L commented Oct 18, 2021

@mxnet-bot run ci [unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu]

@Kh4L
Copy link
Contributor Author

Kh4L commented Oct 20, 2021

@mxnet-bot run ci [centos-cpu, unix-cpu, windows-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-gpu, centos-cpu, unix-cpu]

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 20, 2021
@szha
Copy link
Member

szha commented Oct 20, 2021

@Kh4L glad that the CI passed. @ptrendx @barry-jin could you take a look and see if there's any other concern to be addressed?

Copy link
Contributor

@barry-jin barry-jin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks! My only concern is that use_new_dep_engine is probably not tested in CI and new environment variable is not added into website. But, I think its OK to do these in a seperate pr.

Comment on lines +566 to +577
static inline bool IsEngineAsync() {
std::string type = dmlc::GetEnv("MXNET_ENGINE_TYPE", std::string(""));
std::string async_engine_tag("Async");
auto tag_pos = type.find(async_engine_tag);
return tag_pos != std::string::npos;
}

void ThreadedEngine::OnStartCPU(Engine* engine, void* opr_block, const dmlc::Error* error) {
static bool use_new_dep_engine = IsEngineAsync();
if (!use_new_dep_engine) {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part is not tested in CI because I didn't see any changes to use Async tag inside test functions. https://github.com/apache/incubator-mxnet/blob/master/ci/docker/runtime_functions.sh

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not test it in the CI now? @Kh4L

@@ -49,7 +49,8 @@ core_logic: {
custom_steps.test_unix_cpp_package_gpu('gpu'),
// TODO(szha): fix and reenable the hanging issue. tracked in #18098
// custom_steps.test_unix_distributed_kvstore_gpu('gpu'),
custom_steps.test_unix_byteps_gpu('gpu'),
// TODO(spanev): reenable when byteps is updated with the new dep engine API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an issue to track this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, tracked here #20697

Copy link
Member

@ptrendx ptrendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants