Skip to content

Split send_op into fetch_vars_op and send_vars_op #9161

Closed
@Yancey0623

Description

@Yancey0623

Currently, trainer would send all gradients after execution of all the backward ops, like:

w1-->opA->w2->opB->opB(backward)->w2'->opB(backward)->w1'->send(w1',w2')

For the above process, send op will send all gradients until all the forward, backward ops done.

But actually, we would send the w2' after opB(backward), send w1' after opA(backward), parallel execution of computing op and IO op would improve the performance. For another hand, current SendOp would not only do SEND, but also wait all send request finished and receive parameters from pserver, so we also need to split these into multiple Op.

For sync update

fetch(w1)-->opA->fetch(w2)->opB->opB(backward)->w2'->send(w2')->opB(backward)->w1'->send(w1')->send_barrier()

for async update, there is no send_varrier() op at the end of the process.

fetch(w1)-->opA->fetch(w2)->opB->opB(backward)->w2'->send(w2')->opB(backward)->w1'->send(w1')

TODO

  • Implement AsyncSendOp, SendBarrierOp.
  • Implement an IO threadpool to deal with Async Send.
  • Enhancement distribute transpiler with the async send op.
  • Update benchmark report.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions