-
Notifications
You must be signed in to change notification settings - Fork 666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release memory of task node on the fly #4735
base: master
Are you sure you want to change the base?
Conversation
* del object_storage.cpp * use name GLOBAL_PARA_SYM2SHARED_OPKENEL_OBJ_MUTEX * mig CheckRefInBlobObjectParallelDesc and OperandBlobObjects rel api * mig _StatelessCall * mig _StatelessCall * [one::OpBuilder] Refactor Operation. * mig StatelessCall api * mig StatefulCall * mig callback api * mig MakeLazyRefBlobObject * refactor CudaHostPinBlob * sort out InstructionsBuilder api * [one::OpBuilder] Refine * mig PhysicalRun and LogicalRun * use oneflow_api.deprecated.LogicalRun & PhysicalRun * delete vm_util.py * change FindOrCreateDelegateBlobObject args * add SetShuttingDown * rm python_interpreter_util.py * add blank line * mig BlobCache * use FindOrCreateDelegateBlobObject in c++ * refactor session_util.cpp * use IsShutDown * refactor BlobRegister * [one::OpBuilder] Refactor OpExpr and OpExprInterpreter. * [one::OpBuilder] Remove member function evaluate. * [one::OpBuilder] Remove OpInterpreter to facilitate CR. * [one::OpBuilder] Refine * fix distribute test exit bug * del comment * [one::OpBuilder] Add more op exprs. * mig id_util and scope_util * use cfg_op_conf and Object* * use Object* * del _ * fix func name error * [one::OpBuilder] Refine * [one::OpBuilder] Modify op input names by InOutBnAccessor. * [one::OpBuilder] Fix op expr python api. * use MapAt and shared_ptr * [one::OpBuilder] Using indexed ibns and obns instead of tensor names. * [one::OpBuilder] Update * [one::OpBuilder] Complete op interpreter in the main. * use shared_ptr or const ref * minor fix * add todo * [one::OpBuilder] Export and extend BoxingUtil in python. * [one::OpBuilder] Support variable op interpretation. * [one::OpBuilder] Refine * [one::OpBuilder] Refine placeholder prefix. * minor fix * minor djustment * minor fix * [one::OpBuilder] Migrate snapshot manager. * minor optimize * minor fix * minor optimize * [one::OpBuilder] Migrate Session. * [one::OpBuilder] Fix return type. * [one::OpBuilder] Refine variable interpretation. * minor optimize * fix bug * [one::OpBuilder] Call python FeePath. * [one::OpBuilder] Fix typo. * minor fix * [one::OpBuilder] Fix merge bugs. * [one::OpBuilder] Bugfix. * [one::OpBuilder] Refine * [one::OpBuilder] Bugfix * [one::OpBuilder] Refine * [one::OpBuilder] Fix typo. * [one::OpBuilder] Fix placeholder prefix of op builer. * [one::OpBuilder] Set output blob object after running the instruction. * [one::OpBuilder] Remove TensorNameScope temporarily. * [one::OpBuilder] Fix api. * [one::OpBuilder] Add TensorNameScope. * [one::OpBuilder] Modify the op builder apis to return Maybe type. * [one::OpBuilder] Remove unused header. * [one::OpBuilder] Fix * [one::OpBuilder] Fix typo * [one::OpBuilder] Fix bugs. * [one::OpBuilder] Create tensor from blob object. * [one::OpBuilder] Return Maybe in op interpreter. * [one::OpBuilder] Fix merge conflicts and reformat. * [one::OpBuilder] Remove unused code. * [one::OpBuilder] Go through eager mode. * [one::OpBuilder] Fix typo * [one::OpBuilder] Refine * [one::OpBuilder] Remove redundant file system and refine code style. * [one::OpBuilder] Use TensorTuple instead of TensorList and provide method to access the valide interpreter. * [one::OpBuilder] Fix bug * [one::OpBuilder] Remove state input. * [one::OpBuilder] Create output tensors for lazy mode. * [one::OpBuilder] Change pybind hold type to shared_ptr for Callback, Watch and BoxingUtil. * [one::OpBuilder] Move to deprecated. * [one::OpBuilder] Remove unused code. * [one::OpBuilder] Fix typo. * [one::OpBuilder] Support TensorTuple input and fix typo. * [one::OpBuilder] Fix bugs. * [one::OpBuilder] Change pybind hold type to shared_ptr for Callback, Watch and BoxingUtil. * [one::OpBuilder] Fix bugs. * [one::OpBuilder] Migrate snapshot manager. * [one::OpBuilder] Remove redundant file system and refine code style. * [one::OpBuilder] Migrate Session. * [one::OpBuilder] Refine * [one::OpBuilder] Refine * [one::OpBuilder] Call python FeePath. * [one::OpBuilder] Export and extend BoxingUtil in python. * [one::OpBuilder] Deprecate boxing util python api. * [one::OpBuilder] Reformat * Merge * Refine * Use reference op expr. * Fix default value * Fix LocalFS use bug * Fix and refine * Return Maybe * Reformat * Refine * Remove initialization to simplify variable op interpretation. * Revert SnapshotManager and Session * Fix register BoxingUtil * Remove unused header * Revert foreign callback * Refine code style * Fix register BoxingUtil * Get device from parallel desc to make a new tensor. * Boxing call instructions and refine. * Fix ci error * Swap the op conf since the move constructor was not provided by protobuf messages. * Refine * Refine code style * Reformat * Fix error merge * Make interpreter stateless and remove OpExprInterpContext. * Refine * reformat * Fix typo * Revert interpreter interface * Use move assignment since protobuf has been upgrade. * Refine * Refine Co-authored-by: clackhan <han_binbin@163.com> Co-authored-by: hjchen2 <hjchen2> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 8c7476c
* refine * fix yml syntax * use pr.* * refine Former-commit-id: ad4dbf3
* add xla back * allow fail * refine * refine Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: b2c5212
* interface_op support parallel_distribution * add JUST * fix * fix Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 9d6ab3f
* refine * add check Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 031dd54
* refactor_builder_instr_pb_list_to_instr_msg_list todo list * modified intruction proto list to msg list * remove header to cpp, add cfg constructor * modified variable life-time, using template for instruction msg and operand * fix bug in instruction builder construction * add cfg and template init for operand * fix id to symbol * fix function parameter name * fix template link error * format Co-authored-by: lixinqi <lixinqi0703106@163.com> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: eab2f0d
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 14d8c25
* feat(AutoGradMode): add AutoGradMode * style(*): fix typo * feat(AutoGradMode): refine codes * style(*): refine codes * style(*): refine codes * style(*): use namespace instead of static * style(AutoGradMode): use class instead of struct Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 9a11c3f
* always build local rpc backend * only use grpc if it is multi process or multi machine * rm py35 * add flags to enable Co-authored-by: binbinHan <han_binbin@163.com> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: dbcc369
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 306e0e9
* add min_max_observer and moving_average_min_max_observer conversion in onnx * fix op's name * fix moving_average_min_max_observer * update quantization ops conversion * update quantization ops conversion and tests * update ops version * delete auto-imported package * update min_max_observer op * format * fix test_quantization_aware_training * update test_quantize_op Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: f5eb0fa
* user_kernel support parallel_distribution * refine * fix Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 0f4870f
* refactor_builder_instr_pb_list_to_instr_msg_list todo list * modified intruction proto list to msg list * remove header to cpp, add cfg constructor * modified variable life-time, using template for instruction msg and operand * fix bug in instruction builder construction * add cfg and template init for operand * fix id to symbol * fix function parameter name * fix template link error * format * replace mirror * broadcast object * build send instruction * build recv instruction * cuda host * lazy reference * remain instruction * minor fix * minor fix * minor fix * minor fix * add_del_object_operand * fix typo * add delete dependent instructions * fix mutable operand error * remove useless functions * add ignore instruction function * fix wrong use for intr_msg_list constructor Co-authored-by: lixinqi <lixinqi0703106@163.com> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 4c26152
Former-commit-id: a7b53ed
* always enable local * add static_assert * add log info Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 54d5520
Co-authored-by: Juncheng <liujuncheng1022@gmail.com> Former-commit-id: 640084e
* add sequantial callback instruction * add a test_case for sequential instruction type * refactor RunLogicalInstruction/RunPhysicalInstruction * refactor RunLogicalInstruction/RunPhysicalInstruction Former-commit-id: a6d0307
* hierarchical boxing sub_graph * refine * fix * refine * refine Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 3d84630
* handle ctrl msg from other rank * static way * use check * add todo * add CHECK * add DumpToConsumedRegstDescId2Addr function * fix comment * handle returned_regst_num * optimize code * rename regst_desc_id2regst_desc_addr_ * fix Segfault fault * remove returned_regst_num * rename arg and function * add info im plan * use name producer_task_id Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 4380494
* use symbol::Storage<OperatorConfSymbol> * _NewOpKernelObject * mig OpKernelObject * mig object_storage * make of_format * del comment * std::function<void(Object*) * mig NewOpKernelObject and _StatefulCallOpKernel * mig _StatefulCallOpKernel and GetSharedOpKernelObject4ParallelConfSymbol * del object_storage.cpp * use name GLOBAL_PARA_SYM2SHARED_OPKENEL_OBJ_MUTEX * mig CheckRefInBlobObjectParallelDesc and OperandBlobObjects rel api * mig _StatelessCall * mig _StatelessCall * [one::OpBuilder] Refactor Operation. * mig StatelessCall api * mig StatefulCall * mig callback api * mig MakeLazyRefBlobObject * refactor CudaHostPinBlob * sort out InstructionsBuilder api * [one::OpBuilder] Refine * mig PhysicalRun and LogicalRun * use oneflow_api.deprecated.LogicalRun & PhysicalRun * delete vm_util.py * change FindOrCreateDelegateBlobObject args * add SetShuttingDown * rm python_interpreter_util.py * add blank line * mig BlobCache * use FindOrCreateDelegateBlobObject in c++ * refactor session_util.cpp * use IsShutDown * refactor BlobRegister * [one::OpBuilder] Refactor OpExpr and OpExprInterpreter. * [one::OpBuilder] Remove member function evaluate. * [one::OpBuilder] Remove OpInterpreter to facilitate CR. * [one::OpBuilder] Refine * fix distribute test exit bug * del comment * [one::OpBuilder] Add more op exprs. * mig id_util and scope_util * use cfg_op_conf and Object* * use Object* * del _ * fix func name error * [one::OpBuilder] Refine * [one::OpBuilder] Modify op input names by InOutBnAccessor. * [one::OpBuilder] Fix op expr python api. * use MapAt and shared_ptr * [one::OpBuilder] Using indexed ibns and obns instead of tensor names. * [one::OpBuilder] Update * [one::OpBuilder] Complete op interpreter in the main. * use shared_ptr or const ref * minor fix * add todo * [one::OpBuilder] Export and extend BoxingUtil in python. * [one::OpBuilder] Support variable op interpretation. * [one::OpBuilder] Refine * [one::OpBuilder] Refine placeholder prefix. * minor fix * minor djustment * minor fix * [one::OpBuilder] Migrate snapshot manager. * minor optimize * minor fix * minor optimize * [one::OpBuilder] Migrate Session. * [one::OpBuilder] Fix return type. * [one::OpBuilder] Refine variable interpretation. * minor optimize * fix bug * [one::OpBuilder] Call python FeePath. * [one::OpBuilder] Fix typo. * minor fix * [one::OpBuilder] Fix merge bugs. * [one::OpBuilder] Bugfix. * [one::OpBuilder] Refine * [one::OpBuilder] Bugfix * [one::OpBuilder] Refine * [one::OpBuilder] Fix typo. * [one::OpBuilder] Fix placeholder prefix of op builer. * [one::OpBuilder] Set output blob object after running the instruction. * [one::OpBuilder] Remove TensorNameScope temporarily. * [one::OpBuilder] Fix api. * [one::OpBuilder] Add TensorNameScope. * [one::OpBuilder] Modify the op builder apis to return Maybe type. * [one::OpBuilder] Remove unused header. * [one::OpBuilder] Fix * [one::OpBuilder] Fix typo * [one::OpBuilder] Fix bugs. * [one::OpBuilder] Create tensor from blob object. * [one::OpBuilder] Return Maybe in op interpreter. * [one::OpBuilder] Fix merge conflicts and reformat. * [one::OpBuilder] Remove unused code. * [one::OpBuilder] Go through eager mode. * [one::OpBuilder] Fix typo * [one::OpBuilder] Refine * [one::OpBuilder] Remove redundant file system and refine code style. * [one::OpBuilder] Use TensorTuple instead of TensorList and provide method to access the valide interpreter. * [one::OpBuilder] Fix bug * [one::OpBuilder] Remove state input. * [one::OpBuilder] Create output tensors for lazy mode. * [one::OpBuilder] Change pybind hold type to shared_ptr for Callback, Watch and BoxingUtil. * [one::OpBuilder] Move to deprecated. * [one::OpBuilder] Remove unused code. * [one::OpBuilder] Fix typo. * [one::OpBuilder] Support TensorTuple input and fix typo. * [one::OpBuilder] Fix bugs. * [one::OpBuilder] Change pybind hold type to shared_ptr for Callback, Watch and BoxingUtil. * [one::OpBuilder] Fix bugs. * [one::OpBuilder] Migrate snapshot manager. * [one::OpBuilder] Remove redundant file system and refine code style. * [one::OpBuilder] Migrate Session. * [one::OpBuilder] Refine * [one::OpBuilder] Refine * [one::OpBuilder] Call python FeePath. * [one::OpBuilder] Export and extend BoxingUtil in python. * [one::OpBuilder] Deprecate boxing util python api. * [one::OpBuilder] Reformat * Merge * Refine * Use reference op expr. * Fix default value * Fix LocalFS use bug * Fix and refine * Return Maybe * Reformat * Refine * Remove initialization to simplify variable op interpretation. * Revert SnapshotManager and Session * Fix register BoxingUtil * Remove unused header * Revert foreign callback * Refine code style * Fix register BoxingUtil * Get device from parallel desc to make a new tensor. * Boxing call instructions and refine. * Boxing call instructions and refine. * Construct variable and user op from Python. * Construct variable and user op from Python. * Fix ci error * Swap the op conf since the move constructor was not provided by protobuf messages. * Refine * Refine code style * Reformat * Fix error merge * Make interpreter stateless and remove OpExprInterpContext. * Refine * reformat * Refine Co-authored-by: clackhan <han_binbin@163.com> Co-authored-by: hjchen2 <hjchen2> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 1117632
* operator infer parallel_distribution Former-commit-id: 82b6846
* add sequantial callback instruction * add a test_case for sequential instruction type * refactor RunLogicalInstruction/RunPhysicalInstruction * refactor RunLogicalInstruction/RunPhysicalInstruction * refactor front sequential instruction Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: bb4a310
* fix construtor * fix construtor Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 54d069f
* add min_max_observer and moving_average_min_max_observer conversion in onnx * fix op's name * fix moving_average_min_max_observer * update quantization ops conversion * update quantization ops conversion and tests * update ops version * delete auto-imported package * update min_max_observer op * format * fix test_quantization_aware_training * update test_quantize_op * add fake_quantization conversion in onnx * format quantize.py * fix fake_quantization * update fake_quantization conversion * update fake_quantization conversion and its test * update quantization_aware_training * update quantization_aware_training Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 1fef03a
* handle ctrl msg from other rank * static way * use check * add todo * add CHECK * add DumpToConsumedRegstDescId2Addr function * fix comment * handle returned_regst_num * optimize code * rename regst_desc_id2regst_desc_addr_ * fix Segfault fault * remove returned_regst_num * rename arg and function * add info im plan * use name producer_task_id * fix_bug_of_multi_node_with_rank_info_bootstrap * fix bug Former-commit-id: 2fe1b28
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 30f3439
* feat(*): implement autograd engine * feat(OpExprHelper): add op_expr_helper codes * style(*): remove outdated TODO * style(*): refine check message Former-commit-id: 74d7a71
* trainer structure * add test * add nnmodel api * nn Model draft * try run global_func in Model * fit to be refined * model run global_func train & eval * nn Model for function style execution draf test pass * refactor nn model * nn model with nessary component * format * rm nn prefix of Model * flow.Model multi-task numpy-input * (flow.Model)op_dataload support multi job * (flow.Model) auto job_func signature for numpy input * (flow.Model)support auto numpy input job * (flow.Model) nump input multi job train test pass * (flow.Model)fix classmethod * fix test * (oneflow.Model)training_step multi output, refine according to pep8 * (oneflow.Model)pep8 check pass by flake8 * Model refine * Model fix typo * oneflow.Model optimizer variable lazy get, numpy job signature to DataModule * oneflow.Model merge and format * oneflow.Model: comment empty func to be overried * Optimizer: lazy get var add check and tips * oneflow.Model: refactor * oneflow.Model: refactor 2 * oneflow.Model: ModelStage -> SubStep, TrainStage -> TrainStep * fix format * oneflow.Model: SubStep to SubModel * oneflow.Model: infer_oneflow_data_placeholder and _infer_job_signature * add todo for GetCurrentJobName() * fix typo * oneflow.Model: refine error message * fix format * oneflow.Model: rm FunctionConfig in Model * oneflow.Model config_exe to config_execution * oneflow.Model: merge module * Optimizer: user mode to confirm that Optimizer.Variable() is called inside a job * simplify test * oneflow.Model fix according to review * rm useless code Former-commit-id: 4de3595
* need to be reformat * reformat * add docstring and refine test case * add test case * refine according to comments of wyg * refine * add TODO for asymmetric padding Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: dfcd1d7
* add exp_tanh_gelu module * fix comment * fix comment * fix comment Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 4b876e4
* add memory detect info * small fix in opattrref optimize * use bitset * refactor using vector * refine * refine * rename * refine * address review * address review * refine * refine * address review * smaller BITSET_SIZE * refine * refine * refine * refine nameing * refine * refine * refine * update * delete swp file * small update * format fix * format modify * format modify * Update compiler.cpp fix for comment * Update reshape_user_op_util.cpp bug about reshape is fixed Co-authored-by: jackalcooper <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 3f728c0
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 25f9f17
* b21 boxing add ctrl_edge * refine Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Co-authored-by: cheng cheng <472491134@qq.com> Former-commit-id: a9f70c7
* model_io_v2 process multi variable * refine * reserve Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: f7b5bb0
* add greater_less_argmax module * fix comment * add test_case * fix conflict * fix comment * fix comment * fix comment * fix comment * fix comment * fix comment * fix comment * fix comment * fix comment * fix comment * fix comment * format file Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 4789392
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 32a240c
* add flatten module and unit test * add changes according to review * fix docs * use nn.init * add modification according to review * remove useless code * add more tests & fix doc * fix default parameters for module and function Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 7940841
* NcclLogialOpAllGatherNoncontinuous S1 to B * delete useless file * fix user op attr err * fix tmp_buffer * Update oneflow/user/kernels/nccl_logical_kernels.cpp * fix data size check in 2d s1-b Co-authored-by: guo-ran <360112263@qq.com> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Co-authored-by: Juncheng <liujuncheng1022@gmail.com> Former-commit-id: 92337ef
Former-commit-id: 8d02926
* move default log under log dir * remove default env log from proto * refine Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: c402963
* Refactor * Draft * Refactor * Add MutableCfgAttrValueMap * implement AttrValueMap and ComposedAttrValueMap (#4767) * Attr value util (#4773) * implement AttrValueMap and ComposedAttrValueMap * AttrValueUtil::ToProtoAttrValue * Fix compile * Fix compilation * Rename AttrValueMap by AttrMap. Co-authored-by: lixinqi <lixinqi0703106@163.com> Co-authored-by: hjchen2 <hjchen2> Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Former-commit-id: 354652e
|
04b328e
to
feb5d18
Compare
No description provided.