-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rank task graph merge master #9440
Conversation
* scalar math use primitive * fix * support pow grad * dev scalar pow grad * remove useless code * use std * auto format by CI * Refine Co-authored-by: guo-ran <360112263@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
* add higher order derivative for smooth_l1/nll loss * add higher order derivative for bce/kl_div loss * fix bug and refine testcase * fix wrong sbp signature of bce loss * optimize code and align precision with pytorch * add some index check * disable calc derivative for target in bce loss * remove unnecessary header include * fix sbp setting in testcase, and restore out_grads size check * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* add higher order derivative for softmax/logsoftmax * add higher order derivative for mish/gelu activation * auto format by CI * add comment for constexpr parameter Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* add higher order derivative for pool * refine * optimize * fix ndim check error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* support prob for crossentropy, still has bug for dims > 2 * fix bug of for ndim > 2 inputs, refine code * refine code, use template HasLabelSmoothing * fix grad bug of for ndim > 2 inputs, use pre-calculated factor in kernel * format code, remove redundant including header files * refine op * restore wrong modification * remove op, implement at functor layer * set bind_python to false, remove redundant header files * add docs * fix missing default param in unittest, fix typo in docstr example * auto format by CI * Update loss.py * remove useless file Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
* Fix nvjpegGetImageInfo * fix set ROI
* startup: cpu adaptive max pool 2d finished (a draft) * add 1d/2d/3d forward * add return_indices * refine files hieararchy * add adaptive_max_pool2d_grad for test * draft backward op for maxpool 2d * cpu op/kernel finished * reformat * gpu draft kernel * gpu forward finished * draft gpu backward version * refine gpu backward * add nn.AdaptiveMaxPoolnd Module * add docstring * rename avg pool gpu file * refine .td file * refine * refine test case * refine * refine by comments of zzk * refine according to clang_tidy errors * refine * refine by comments of zhuping
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* one_embedding eager forward * deterministic forward gen random * merge master * merge master * grad op add attrs * Revert "grad op add attrs" This reverts commit 33b67c7. * auto format by CI * format * refine * prefetch consume id_shuffle out and exec in advance * add new task_node * sort and add ctrl edge * rm id_shuffle_task_node * add register same output blob regst num * rm tasktype * refine * address review * rename * refine * refine * refine Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* implement eager AMP * skip autocast for inplace and implement make autocast meta * fix * rm unused code * autocast python api * fix * fix * refine * skip autocast if any input is float32 for gray or clear list * refine * fix dead loop * add autocast unittest
* refine worker seed * refine * reifne * use default_generator.seed
* add groupnorm infer * Add groupnorm forward * refine other forawrd situation * groupnorm backward still has bug * fix forward * support backward * add slow groupnorm param grad kernel * use blockreduce * update blocknum * add gradient func * simplify code * refine and add global test * remove annotation * not limit split dim * fix compile error * Add spatialsize pack logic and fix launch blocknum bug * add two stage reduced backward kernel * refine * simplify logic * refine pack logic * use THREAD_CACHED_MUTABLE_ATTR_MAP * fix comment * refine * refine comment * Refine more check * fix affine=False bug * fix bug * tmp use gemm reduce * use ComputeType buf * fix nvbfloat16 compute type * add amp gray list * Revert back * fix clang analysis * refine userops.td * fix userops * remove result_segment_sizes * add dispatch logic for groupnorm grad uncached block impl Co-authored-by: luyang <flowingsun007@163.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* introduce_bfloat16_type * storage * fix compile error * support bfloat16 ep operator * support create cpu bfloat tensor * refine code * minor fix * fix static check error * reslove comment * add more test case * fix bfloat16 numeric_limits * fix error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* refine check in ibverbs * format * fix typo and test * refine error message when there is no errno Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* init * Add attribute val in Userops.td * simply add paddingidx logic in EncodeLookupKernel * add simple padding_idx EmbeddingGrad * when index is -1 let gather add 0 * skip atomicadd when row index equals to padding_idx * change padding_idx type to int64 * fix compile error * set padding_idx in Pass * 1n1d eval success * refine * remove print * fix compile error * revert * refine * fix compile * refine * Refine * refine * refine store options * remove embedding grad shuffle redundant padding_idx * move gather in datashuffle kernel * remove redundant code * Refine * refine * remove redundant header file * Set padding idx as optional and remove attr has_padding_idx * Add padding_idx unittest * use array equal instead of allclose * remove a test * enlarge timeout
* init * registry * add KernelLaunchFunctionPass * pass ninja and relu test * mlir test script & lowering * relu py * fi * kernel launch * fix * fix op and pass interfaces * add comment * add readme docs * fix typo * kenerl launch function pass is done * use template and rename func.func * declare * pass string through mlir.llvm dialect to c interface: llvm.mlir.global internal constant @"relu-0_var"("relu-0") %0 = "llvm.mlir.addressof"() {global_name = @"relu-0_var"} : () -> !llvm.ptr<array<6 x i8>> %1 = "llvm.mlir.constant"() {value = 0 : index} : () -> i64 %2 = "llvm.getelementptr"(%0, %1, %1) {structIndices = dense<-2147483648> : tensor<2xi32>} : (!llvm.ptr<array<6 x i8>>, i64, i64) -> !llvm.ptr<i8> * use symbol table * use oneflow variable op * fix symboltable * fix * ninja c1 check * split into kernel-launch-function pass and kernel-launch-with-llvm pass * restore pass 1 * Gen kernel example (#9042) * add example * add todo * add basic assertion * add file check * create pass in translation * sanitizeIdentifier * enable print * fix * update test file * kernel llvm pass is ok * pass ctx ptr to func and this ptr will be an operand to call c interface function * restore llvm ptr type to llvm.ptr<i8> * Kernel lookup in launch op (#9059) * add * move function to another unit * create map * add iter * impl TensorDesc4ArgNameAndIndex * set dev tag * load lib when ONEFLOW_MLIR_FUSE_KERNEL_LAUNCH is set * sharedlibs enables and pass enables in commpute * enable c interface callee * impl todo * naming * rm * add invalid * fix invoke arg * typed * rm log * rename pass * Update user_op_kernel_registry.h * Update user_op_kernel_registry.h * Update OneFlowOps.td * Update Passes.cpp * add comp ctx * add todo * refine todo * refactor op infer * minor fix * add check * refine error * refine msg * fix typo * fix typo * remove string in llvm * impl Tensor4ArgNameAndIndex * fix ninja c1 bug * realize gpu and add cuda test * auto format by CI * fix merge * fix ninja with cpu version * auto format by CI * rename * merge def * deduplicate code * fix * refactor * fix license * cache * add back TODO() * add jit arg type check * rm comment * fix typo * fix ci * todo ci * fix code style * rm misadded * rm misadded * Update Passes.cpp * pass ninja without debug about hungry mode of knerel init * fix null parsed module problem * fix dynamic cast of state problem * fix gpu error * fix * fix * auto format by CI * fix * Update kernel_launch_op.cpp * move * fix * auto format by CI * done * fix * fix * auto format by CI * fix * fix * auto format by CI * Update kernel_launch_op.cpp * rename * auto format by CI * fix * done * Update kernel_launch_op.cpp * fix * fix * fix * fix * fix * auto format by CI * Update oneflow/ir/oneflow-extension/kernel_launch_op.cpp Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> * fix * fix * fix * fix * fix * Update oneflow/ir/lib/OneFlow/Passes.cpp Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * fix * fix * fix Co-authored-by: jackalcooper <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
* fix masked_select bug * refine * fix ci error
* align with pytorch RANK env * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* add OneflowHub feature, consistent with PyTorchHub * add oneflow hub docs * refine docs and add test * refine * refine * refine * fix comment * auto format by CI * skip unittest Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* fix where op data_type infer bug * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* fix_global_tensor_detach_bug * fix test case
* add new op * add kernel * add deform_conv * add some test * modify test * modify format * modify test * fix the bug and add test * Add error message * modify kernel and add test * adjust the format * add global test * Update python/oneflow/test/modules/test_deform_conv2d.py * add doc and modify global test * adjust OneFlowUserOps.td * remove headfile and modify doc * modify doc * add docs at rst * modify global test * remove unnecessary code * remove unnecessary code * remove debug code * initialize fields * modify global test * modify test * modify test * modify test * auto format by CI Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
* fix inplace mul 0-size tensor check bug * code format * revert
* align round op * add test * modify doc ,test and kernel * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
* rm dict in module apply * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
* support broadcast table_ids * address review * fix like op infer dtype * address review * address review * refine
* refine error msg for framework * more error messages * fix size_t comparison with zero * check for incomplete error messages * err msg for inconsistent placement * modify acc. to review * convert enum to string in error msg * fix redundant error info; clean up * refine error msg for consistency check * auto format by CI Co-authored-by: Yao Chi <later@usopp.net> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
* fix end_factor * fix indent Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
fix(Indexing): fix lazy scalar tensor indexing Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* grouped matmul bias * fiux * grouped_matmul * Amp list * fix cu102 * fix
* optimize upsample nearest2d backward * refine * revert * pack dy * fix comment * fix comment * fix comment * fix comment * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* Move the "-expand" and "-cast" ops backward * Hard-coding for stable diffusion, maximize overlaps * Use op_tyep_name instead of visual string * Change transfer nodes to tributary nodes * Rename tributary to overlap * Prepare to test different decide parameters * Prepare to print and test * {7, 5} seems to be one of the best as before * Find the best straighten mode 973 for stable diffusion * Put cpu nodes into overlap node list * Disable overlap between cpu and gpu if no cpu nodes * Update API * Remove magical number * Update comment * Remove std log message * Remove debug code * Static analysis * Variable op still have activation time in cpu * Rename (address comment)
* profiling tensor.item * SyncAccessInstructionPolicy * FastCopy supports 128-bit data_types * address static analyzer complaints * revert changes about tensor.numpy() * Stream::CheckSizeAndGetTmpSmallPinnedMemPtr * disable busy wait in SyncAccessSmallMem * auto format by CI Co-authored-by: Luyang <flowingsun007@163.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* KernelPriority * concat * fix empty tensor * address review
* rename block to GraphBlock * rename attr of graph block to avoid name colision with origin * rename * add test * revert rename * revert old and test pass * rename BlockConfig to GraphModuleConfig * add test of module property * rename to_graph to trace * refactor block with GraphModule SubGraph gt * refact and test graph pass * all test passed * fix typo of auto_parallel_mainstream_algo * refine ModuleBlock repr * revert auto_parallel_mainstream_algo * auto format by CI * support mixin * auto format by CI * add test of mixin property * auto format by CI * fix auto test error * fix doctest of graph.py * refine doc * fix outdated api and typo * address review * auto format by CI * fix * auto format by CI * rename block to proxy * auto format by CI * avoid use GraphBlock in to * auto format by CI * fix style * add import * fix doc * use new style * support graph tensor set_stage * auto format by CI * update libai commit * Revert "update libai commit" This reverts commit d000c1a. * update libai commit * fix comm barrier * format * auto format by CI * echo oneface commit id * add log * Update test.yml * Update test.yml * Update test.yml * Update test.yml * Update test.yml * add pytest * Update test.yml * Update test.yml * use pytest Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* pull in files * port changes on td * refine file check * refine file check * allow opt to run properly * auto format by CI * refine case * refine * gn+silu * add flag * Update oneflow/ir/lib/OneFlow/Transform/CSEWithAttributesIgnored.cpp Co-authored-by: Peihong Liu <mosout@qq.com> * Update oneflow/ir/lib/OneFlow/Transform/CSEWithAttributesIgnored.cpp Co-authored-by: Peihong Liu <mosout@qq.com> * Update oneflow/ir/oneflow-opt/oneflow-opt.cpp Co-authored-by: Peihong Liu <mosout@qq.com> * auto format by CI * update description Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Peihong Liu <mosout@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
remove cuda half unittest in maxunpool
fix checkpointv2
// Step1: process some scalar items. | ||
// if (check) { CHECK_NE(chain_id_, -1); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
合并 master 后需要注释掉,否则不能过测试:
ONEFLOW_TEST_DEVICE_NUM=2 python3 -m oneflow.distributed.launch --nproc_per_node 2 ./test_graph_zero.py --failfast --verbose
void RegstDesc::ToProto(RegstDescProto* ret) const { | ||
void RegstDesc::InitFromProtoExceptConsumers(const RegstDescProto& proto) { | ||
regst_desc_id_ = proto.regst_desc_id(); | ||
CHECK_EQ(proto.producer_task_id(), producer_->task_id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
合并 master 后,跑 per iter 的2卡测试,这里报错:
ONEFLOW_LAZY_COMPILE_MODE=rank_per_iter ONEFLOW_TEST_DEVICE_NUM=2 python3 -m oneflow.distributed.launch --nproc_per_node 2 ./test_graph_zero.py --failfast --verbose
错误栈:
F20221118 16:13:24.744776 29228 register_desc.cpp:125] Check failed: proto.producer_task_id() == producer_->task_id() (37391985279007 vs. 37391985279025)
*** Check failure stack trace: ***
@ 0x7f112b8163e3 google::LogMessage::Fail()
@ 0x7f112b8189d4 google::LogMessage::SendToLog()
@ 0x7f112b815ecf google::LogMessage::Flush()
@ 0x7f112b818fcf google::LogMessageFatal::~LogMessageFatal()
@ 0x7f1162dac7d8 oneflow::RegstDesc::InitFromProtoExceptConsumers()
@ 0x7f11622f96b9 oneflow::TaskNode::InitFromProtoExceptConsumedRegsts()
@ 0x7f1162245065 oneflow::CompTaskNode::InitFromProtoExceptConsumedRegsts()
@ 0x7f11622cd7d7 oneflow::RankTaskGraph::AddBoxingReletedCompTaskNodesFromProto()
@ 0x7f11622dcc0e oneflow::RankTaskGraph::Init()
@ 0x7f1162475548 oneflow::RankCompiler::Compile()
@ 0x7f1161a26fc4 _ZZZN7oneflow7NNGraph17MasterRankCompileIXadL_ZNS_16SingleThreadLoopEmRKSt8functionIFvmEEEEEENS_5MaybeIvvEEvENKUlvE_clEvENKUlmE0_clEm
@ 0x7f1162dc8ed8 oneflow::SingleThreadLoop()
@ 0x7f1161a22f35 _ZZN7oneflow7NNGraph17MasterRankCompileIXadL_ZNS_16SingleThreadLoopEmRKSt8functionIFvmEEEEEENS_5MaybeIvvEEvENKUlvE_clEv
@ 0x7f1161a23772 _ZNSt17_Function_handlerIFN7oneflow5MaybeIvvEEvEZNS0_7NNGraph17MasterRankCompileIXadL_ZNS0_16SingleThreadLoopEmRKSt8functionIFvmEEEEEES2_vEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7f1162294fa3 oneflow::OpGraph::WithSingleton()
@ 0x7f1161a238aa oneflow::NNGraph::MasterRankCompile<>()
@ 0x7f1161a0e355 oneflow::NNGraph::CompileAndInitRuntime()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
报错的栈信息可以贴一下~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确定了 merge pr 9094 就会报错, merge pr 9094 前一个 commit 就不会。
PR 9094: #9094
合并后的分支
https://github.com/Oneflow-Inc/oneflow/tree/rank_task_graph-merge-pr9094
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yipeng1994 拉直算法会更改 TaskGraph 中 TaskNode 的 order,不同的 rank 可能结果不同,这里会影响连边吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
拉直算法里面,决定每个task node的order不会改变之前的连边,这个order本身就是根据图的拓扑序来的。
然后如果逻辑图是对称的,那么拉直算法的物理图在不同卡的顺序都是一样的,不会出现 1卡 先a后b,2卡先b后a的情况。
拉直算法里面有个关联不同卡的node的算法,不关联的话,通信会出现死锁。
最后是在研发拉直的时候,发现有的task node不是对称的,有的tick node只会出现在0卡而不会出现在其他卡。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR 9094 中去掉了 TaskGraph 的初始化函数中的参数,这样这个有 Init 逻辑函数就变成了默认构造函数。
BoxingTaskGraph 初始化时,会触发 TaskGraph 的默认构造函数,但是也有自己 Init 逻辑。这导致 BoxingTaskGraph 会 Init 两遍,进而带来后续出现比较奇怪的错误。
已经修复。
Merge master into #9108