Skip to content

Conversation

@hyunwoongko
Copy link
Contributor

@hyunwoongko hyunwoongko commented Sep 25, 2021

What for?

Add flexibility of pipeline parallel module and engine (#1347)

Request review

@ShadenSmith @jeffra @sdtblck

@stas00
Copy link
Collaborator

stas00 commented Sep 26, 2021

Heh, @thomasw21 was just dealing with it - bigscience-workshop/Megatron-DeepSpeed#107

You can see from that PR @ShadenSmith is looking at fixing bool tensor support and removing this hack altogether.

@thomasw21
Copy link
Contributor

thomasw21 commented Sep 26, 2021

Heh, @thomasw21 was just dealing with it - bigscience-workshop/Megatron-DeepSpeed#107

You can see from that PR @ShadenSmith is looking at fixing bool tensor support and removing this hack altogether.

Hey thanks @stas00 . Yeah I've been working to fix some issues I've been having with current pipeline parallelism implementation. I've just open a draft PR if you want to check it out. #1400 Note it doesn't work yet, and I'm in the process of debugging some deadlock issue.

@tjruwase
Copy link
Contributor

@hyunwoongko, thanks for your hard work on this PR. I will take a look today so we can wrap it up soon. Thanks for your patience.

@tjruwase
Copy link
Contributor

@hyunwoongko, I left some comments. Thanks.

@hyunwoongko hyunwoongko changed the title Add flexibility of pipeline module and engine & fix contiguous checkpointing bugs Add flexibility of pipeline module and engine Sep 27, 2021
@hyunwoongko
Copy link
Contributor Author

I have separated the two PRs separately.

@hyunwoongko hyunwoongko changed the title Add flexibility of pipeline module and engine Add flexibility of pipeline parallel module and engine Sep 28, 2021
@tjruwase tjruwase merged commit 30965ea into deepspeedai:master Oct 1, 2021
hyunwoongko referenced this pull request in EleutherAI/DeeperSpeed Oct 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants