Cpu all reduce #5849

lixinqi · 2021-08-12T05:01:30Z

cpu 版 all_reduce，基于transport。

…reduce

Conflicts: oneflow/core/ccl/ccl.cpp oneflow/core/ccl/ccl.h

liufengwei0103 · 2021-09-06T02:45:04Z

oneflow/core/ccl/ccl.cpp

+          transport_token,
+          [&](void** buffer, std::size_t* size, std::function<void()>* Cb) -> Maybe<void> {
+            *buffer = const_cast<T*>(send_ptr);
+            *size = send_size;


这里单位好像不对

嗯。好像要乘上GetSizeOfDataType

liufengwei0103 · 2021-09-06T02:45:48Z

oneflow/core/ccl/ccl.cpp

+  BalancedSplitter bs(size, thread_num);
+  MultiThreadLoop(thread_num, [&](size_t thread_idx) {
+    size_t end = bs.At(thread_idx).end();
+    for (size_t i = bs.At(thread_idx).begin(); i < end; ++i) { out[i] = in0[i] + in1[i]; }


MultiThreadLoop(size, [&](size_t i) {
out[i] = in0[i] + in1[i];
});
这里可以直接这样写吗？

应该可以，当时我只是想增加更多的局部性，毕竟如果写成你这样，会在MultiThreadLoop内部的for循环里不断执行一个std::function，它并不高效。

应该可以，当时我只是想增加更多的局部性，毕竟如果写成你这样，会在MultiThreadLoop内部的for循环里不断执行一个std::function，它并不高效。

嗯嗯，明白了

liufengwei0103 · 2021-09-06T07:30:44Z

oneflow/core/ccl/ccl.cpp

+        send_ptr = &in[bs.At(send_part_id).begin()];
+      } else {
+        send_ptr = &out[bs.At(send_part_id).begin()];
+      }


const T* send_ptr = &(i == 0 ? in : out)[bs.At(send_part_id).begin()]
这里这样写，i != 0, 无论bs.At(send_part_id).begin()是任何数字，总是会拿到out首地址，不懂为啥。

liufengwei0103 · 2021-09-06T07:32:17Z

oneflow/core/ccl/ccl.cpp

+        JUST(TransportUtil::ReceiveFromPrevRankInRing(rank_group, transport_token, &ctx));
+      }
+      JUST(TransportUtil::WaitUntilDoneOrTimeout(ctx, TransportUtil::TimeoutSeconds()));
+      const T* cur_in = &in[bs.At(recv_part_id).begin()];


这里不会从out拿数据。

clackhan · 2021-09-07T01:43:41Z

oneflow/core/boxing/cpu_boxing_function.cpp

+  return JUST(one::functional::ConsistentAllReduce(tensor));
+}
+
+COMMAND(RegisterBoxingFunction("cpu-p-to-b", CheckCpuP2B, &CpuP2B));


"cpu-p-to-b"可以改为"ccl-p-to-b"，与"nccl-p-to-b"对应

"cpu-p-to-b"可以改为"ccl-p-to-b"，与"nccl-p-to-b"对应

CheckCpuP2B类似这样的字样，也改一下，咋样？

"cpu-p-to-b"可以改为"ccl-p-to-b"，与"nccl-p-to-b"对应

已改

…, placement Conflicts: oneflow/core/boxing/ccl_boxing_function.cpp oneflow/user/kernels/eager_nccl_kernels.cpp

clackhan · 2021-09-08T10:05:08Z

oneflow/core/boxing/eager_boxing_interpreter_mgr.cpp

@@ -95,7 +95,8 @@ Maybe<BoxingExprIf> RawMainBoxingExpr() {
      | JUST(BoxingExpr(JUST(InPlacementAndBroadcast()), JUST(BoxingExpr("nccl-s-to-b")),
                        JUST(BoxingExpr("naive-b-to-p"))))
      | JUST(BoxingExpr("asymmetric-x-to-b")) | JUST(OneToNBoxingExpr()) | JUST(NToOneBoxingExpr())
-      | JUST(BoxingExpr("naive-1-to-1")) | JUST(GenericBoxingExpr());
+      | JUST(BoxingExpr("naive-1-to-1")) | JUST(GenericBoxingExpr())
+      | JUST(BoxingExpr("ccl-p-to-b"));


放在nccl BoxingExpr后面

放在nccl BoxingExpr后面

已修改

lixinqi · 2021-09-11T15:16:32Z

oneflow/core/thread/thread_manager.cpp

@@ -13,6 +13,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 */
+#include <atomic>


把这个文件revert吧

把这个文件revert吧

已revert

github-actions · 2021-09-12T09:40:33Z

Speed stats:

GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 127.6ms (= 6380.5ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 140.5ms (= 7023.4ms / 50, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 140.5ms / 127.6ms)

OneFlow resnet50 time: 74.0ms (= 3699.0ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.4ms (= 4220.8ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.14 (= 84.4ms / 74.0ms)

OneFlow resnet50 time: 46.8ms (= 2339.2ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.6ms (= 2932.4ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.25 (= 58.6ms / 46.8ms)

OneFlow resnet50 time: 43.5ms (= 2174.5ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 48.9ms (= 2447.4ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.13 (= 48.9ms / 43.5ms)

OneFlow resnet50 time: 43.3ms (= 2162.8ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 43.7ms (= 2185.4ms / 50, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.01 (= 43.7ms / 43.3ms)

OneFlow resnet50 time: 152.2ms (= 7611.7ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.9ms (= 8044.6ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.06 (= 160.9ms / 152.2ms)

OneFlow resnet50 time: 101.5ms (= 5074.2ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.4ms (= 5071.0ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.00 (= 101.4ms / 101.5ms)

OneFlow resnet50 time: 83.9ms (= 4194.9ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.6ms (= 3980.3ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.95 (= 79.6ms / 83.9ms)

OneFlow resnet50 time: 79.9ms (= 3996.2ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.0ms (= 3547.5ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.89 (= 71.0ms / 79.9ms)

OneFlow resnet50 time: 69.4ms (= 3470.2ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 58.7ms (= 2932.6ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.85 (= 58.7ms / 69.4ms)

github-actions · 2021-09-12T09:56:36Z

CI failed, removing label automerge

liufengwei0103 · 2021-09-12T10:01:58Z

port unavailable error, 所以手动重新跑上了。

github-actions · 2021-09-12T10:11:47Z

Speed stats:

GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 127.5ms (= 6375.0ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 141.3ms (= 7066.5ms / 50, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.11 (= 141.3ms / 127.5ms)

OneFlow resnet50 time: 73.9ms (= 3694.0ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.1ms (= 4154.6ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.12 (= 83.1ms / 73.9ms)

OneFlow resnet50 time: 47.4ms (= 2368.4ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 59.2ms (= 2959.5ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.25 (= 59.2ms / 47.4ms)

OneFlow resnet50 time: 39.8ms (= 1992.1ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 47.6ms (= 2379.3ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.19 (= 47.6ms / 39.8ms)

OneFlow resnet50 time: 34.7ms (= 1736.7ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 43.1ms (= 2154.4ms / 50, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.24 (= 43.1ms / 34.7ms)

OneFlow resnet50 time: 149.0ms (= 7448.8ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 158.9ms (= 7944.6ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.07 (= 158.9ms / 149.0ms)

OneFlow resnet50 time: 102.7ms (= 5133.7ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 107.1ms (= 5355.7ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.04 (= 107.1ms / 102.7ms)

OneFlow resnet50 time: 81.7ms (= 4086.5ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 91.0ms (= 4550.3ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.11 (= 91.0ms / 81.7ms)

OneFlow resnet50 time: 68.4ms (= 3421.8ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.8ms (= 3487.8ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.02 (= 69.8ms / 68.4ms)

OneFlow resnet50 time: 68.8ms (= 3438.1ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 62.7ms (= 3133.8ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.91 (= 62.7ms / 68.8ms)

lixinqi added 2 commits August 12, 2021 12:53

cpu_all_reduce

9c8c57e

Merge branch 'master' of github.com:Oneflow-Inc/oneflow into cpu_all_…

a73cf69

…reduce

lixinqi added enhancement automerge eager labels Aug 12, 2021

lixinqi requested review from liujuncheng and daquexian August 12, 2021 05:01

liufengwei0103 added 2 commits September 3, 2021 13:43

fix conflicts and update called function

ef508c0

Conflicts: oneflow/core/ccl/ccl.cpp oneflow/core/ccl/ccl.h

add cpu p2b BoxingExpr and cpu_all_reduce kernel

a6c4dc0

liufengwei0103 reviewed Sep 6, 2021

View reviewed changes

liufengwei0103 added 2 commits September 6, 2021 15:21

fix bug and add op test

34efe1f

format

c827b89

liufengwei0103 reviewed Sep 6, 2021

View reviewed changes

clackhan reviewed Sep 7, 2021

View reviewed changes

liufengwei0103 added 3 commits September 7, 2021 10:13

change cpu_p2b to ccl_p2b

6d1aa85

fix conflicts and test cpu all reduce using random data, dtype, shape…

2e3dbb7

…, placement Conflicts: oneflow/core/boxing/ccl_boxing_function.cpp oneflow/user/kernels/eager_nccl_kernels.cpp

format

cb52633

clackhan reviewed Sep 8, 2021

View reviewed changes

refine test cast and move the box expr forward

7062476

lixinqi commented Sep 11, 2021

View reviewed changes

revert file

ccc4d6f

liufengwei0103 approved these changes Sep 12, 2021

View reviewed changes

liufengwei0103 requested a review from oneflow-ci-bot September 12, 2021 09:09

oneflow-ci-bot removed their request for review September 12, 2021 09:10

Merge branch 'master' into cpu_all_reduce

bb81705

oneflow-ci-bot self-requested a review September 12, 2021 09:11

github-actions bot removed the automerge label Sep 12, 2021

oneflow-ci-bot removed their request for review September 12, 2021 09:59

liufengwei0103 requested a review from oneflow-ci-bot September 12, 2021 10:02

liufengwei0103 added the automerge label Sep 12, 2021

oneflow-ci-bot merged commit 1d698b5 into master Sep 12, 2021

oneflow-ci-bot deleted the cpu_all_reduce branch September 12, 2021 10:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cpu all reduce #5849

Cpu all reduce #5849

lixinqi commented Aug 12, 2021

liufengwei0103 Sep 6, 2021

lixinqi Sep 6, 2021

liufengwei0103 Sep 6, 2021

lixinqi Sep 6, 2021

liufengwei0103 Sep 6, 2021

liufengwei0103 Sep 6, 2021

liufengwei0103 Sep 6, 2021

clackhan Sep 7, 2021

liufengwei0103 Sep 7, 2021

liufengwei0103 Sep 7, 2021

clackhan Sep 8, 2021

liufengwei0103 Sep 8, 2021

lixinqi Sep 11, 2021

liufengwei0103 Sep 12, 2021

github-actions bot commented Sep 12, 2021

github-actions bot commented Sep 12, 2021

liufengwei0103 commented Sep 12, 2021

github-actions bot commented Sep 12, 2021

Cpu all reduce #5849

Cpu all reduce #5849

Conversation

lixinqi commented Aug 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 12, 2021

github-actions bot commented Sep 12, 2021

liufengwei0103 commented Sep 12, 2021

github-actions bot commented Sep 12, 2021