Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cpu all reduce #5849

Merged
merged 12 commits into from
Sep 12, 2021
Merged

Cpu all reduce #5849

merged 12 commits into from
Sep 12, 2021

Conversation

lixinqi
Copy link
Contributor

@lixinqi lixinqi commented Aug 12, 2021

cpu 版 all_reduce,基于transport。

transport_token,
[&](void** buffer, std::size_t* size, std::function<void()>* Cb) -> Maybe<void> {
*buffer = const_cast<T*>(send_ptr);
*size = send_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里单位好像不对

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯。好像要乘上GetSizeOfDataType

BalancedSplitter bs(size, thread_num);
MultiThreadLoop(thread_num, [&](size_t thread_idx) {
size_t end = bs.At(thread_idx).end();
for (size_t i = bs.At(thread_idx).begin(); i < end; ++i) { out[i] = in0[i] + in1[i]; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiThreadLoop(size, [&](size_t i) {
out[i] = in0[i] + in1[i];
});
这里可以直接这样写吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该可以,当时我只是想增加更多的局部性,毕竟如果写成你这样,会在MultiThreadLoop内部的for循环里不断执行一个std::function,它并不高效。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该可以,当时我只是想增加更多的局部性,毕竟如果写成你这样,会在MultiThreadLoop内部的for循环里不断执行一个std::function,它并不高效。

嗯嗯,明白了

send_ptr = &in[bs.At(send_part_id).begin()];
} else {
send_ptr = &out[bs.At(send_part_id).begin()];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const T* send_ptr = &(i == 0 ? in : out)[bs.At(send_part_id).begin()]
这里这样写,i != 0, 无论bs.At(send_part_id).begin()是任何数字,总是会拿到out首地址,不懂为啥。

JUST(TransportUtil::ReceiveFromPrevRankInRing(rank_group, transport_token, &ctx));
}
JUST(TransportUtil::WaitUntilDoneOrTimeout(ctx, TransportUtil::TimeoutSeconds()));
const T* cur_in = &in[bs.At(recv_part_id).begin()];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不会从out拿数据。

return JUST(one::functional::ConsistentAllReduce(tensor));
}

COMMAND(RegisterBoxingFunction("cpu-p-to-b", CheckCpuP2B, &CpuP2B));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"cpu-p-to-b"可以改为"ccl-p-to-b",与"nccl-p-to-b"对应

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"cpu-p-to-b"可以改为"ccl-p-to-b",与"nccl-p-to-b"对应

CheckCpuP2B类似这样的字样,也改一下,咋样?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"cpu-p-to-b"可以改为"ccl-p-to-b",与"nccl-p-to-b"对应

已改

…, placement

Conflicts:
	oneflow/core/boxing/ccl_boxing_function.cpp
	oneflow/user/kernels/eager_nccl_kernels.cpp
@@ -95,7 +95,8 @@ Maybe<BoxingExprIf> RawMainBoxingExpr() {
| JUST(BoxingExpr(JUST(InPlacementAndBroadcast()), JUST(BoxingExpr("nccl-s-to-b")),
JUST(BoxingExpr("naive-b-to-p"))))
| JUST(BoxingExpr("asymmetric-x-to-b")) | JUST(OneToNBoxingExpr()) | JUST(NToOneBoxingExpr())
| JUST(BoxingExpr("naive-1-to-1")) | JUST(GenericBoxingExpr());
| JUST(BoxingExpr("naive-1-to-1")) | JUST(GenericBoxingExpr())
| JUST(BoxingExpr("ccl-p-to-b"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

放在nccl BoxingExpr后面

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

放在nccl BoxingExpr后面

已修改

@@ -13,6 +13,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
#include <atomic>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把这个文件revert吧

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把这个文件revert吧

已revert

@oneflow-ci-bot oneflow-ci-bot removed their request for review September 12, 2021 09:10
@oneflow-ci-bot oneflow-ci-bot self-requested a review September 12, 2021 09:11
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 127.6ms (= 6380.5ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 140.5ms (= 7023.4ms / 50, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 140.5ms / 127.6ms)

OneFlow resnet50 time: 74.0ms (= 3699.0ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.4ms (= 4220.8ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.14 (= 84.4ms / 74.0ms)

OneFlow resnet50 time: 46.8ms (= 2339.2ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.6ms (= 2932.4ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.25 (= 58.6ms / 46.8ms)

OneFlow resnet50 time: 43.5ms (= 2174.5ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 48.9ms (= 2447.4ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.13 (= 48.9ms / 43.5ms)

OneFlow resnet50 time: 43.3ms (= 2162.8ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 43.7ms (= 2185.4ms / 50, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.01 (= 43.7ms / 43.3ms)

OneFlow resnet50 time: 152.2ms (= 7611.7ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.9ms (= 8044.6ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.06 (= 160.9ms / 152.2ms)

OneFlow resnet50 time: 101.5ms (= 5074.2ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.4ms (= 5071.0ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.00 (= 101.4ms / 101.5ms)

OneFlow resnet50 time: 83.9ms (= 4194.9ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.6ms (= 3980.3ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.95 (= 79.6ms / 83.9ms)

OneFlow resnet50 time: 79.9ms (= 3996.2ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.0ms (= 3547.5ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.89 (= 71.0ms / 79.9ms)

OneFlow resnet50 time: 69.4ms (= 3470.2ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 58.7ms (= 2932.6ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.85 (= 58.7ms / 69.4ms)

@github-actions
Copy link
Contributor

CI failed, removing label automerge

@oneflow-ci-bot oneflow-ci-bot removed their request for review September 12, 2021 09:59
@liufengwei0103
Copy link
Contributor

port unavailable error, 所以手动重新跑上了。

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 127.5ms (= 6375.0ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 141.3ms (= 7066.5ms / 50, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.11 (= 141.3ms / 127.5ms)

OneFlow resnet50 time: 73.9ms (= 3694.0ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.1ms (= 4154.6ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.12 (= 83.1ms / 73.9ms)

OneFlow resnet50 time: 47.4ms (= 2368.4ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 59.2ms (= 2959.5ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.25 (= 59.2ms / 47.4ms)

OneFlow resnet50 time: 39.8ms (= 1992.1ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 47.6ms (= 2379.3ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.19 (= 47.6ms / 39.8ms)

OneFlow resnet50 time: 34.7ms (= 1736.7ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 43.1ms (= 2154.4ms / 50, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.24 (= 43.1ms / 34.7ms)

OneFlow resnet50 time: 149.0ms (= 7448.8ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 158.9ms (= 7944.6ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.07 (= 158.9ms / 149.0ms)

OneFlow resnet50 time: 102.7ms (= 5133.7ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 107.1ms (= 5355.7ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.04 (= 107.1ms / 102.7ms)

OneFlow resnet50 time: 81.7ms (= 4086.5ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 91.0ms (= 4550.3ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.11 (= 91.0ms / 81.7ms)

OneFlow resnet50 time: 68.4ms (= 3421.8ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.8ms (= 3487.8ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.02 (= 69.8ms / 68.4ms)

OneFlow resnet50 time: 68.8ms (= 3438.1ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 62.7ms (= 3133.8ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.91 (= 62.7ms / 68.8ms)

@oneflow-ci-bot oneflow-ci-bot merged commit 1d698b5 into master Sep 12, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the cpu_all_reduce branch September 12, 2021 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants