-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Operator's Python Stack Trace Debug Info #8028
Conversation
xiacijie
commented
Apr 15, 2022
- add a parameter 'max_stack_depth' to specify the max depth for the stack trace
- refactor other debug related classes.
1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace 2. refactor other debug related classes.
相关的Issue:https://github.com/Oneflow-Inc/OneTeam/issues/1305 原来的Op 的Python Stack debug信息只会显示 两层,对于需要大于两层调用信息的Op来说,就无法去查看追踪了。 所以给 |
@strint comments resolve了 |
@strint debug 相关的test已更新。 另外有一个更新是 block和graph self._debug_max_v_level 在__init__方法底下的默认值为-1, 对应之前self._debug 为 False的情况。表示不调用debug函数的话,debug是默认关闭的。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@strint @BBuf 但内部实现里面我把之前删掉的 所以我加了一个 |
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally. |
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8028/ |
Speed stats:
|
CI failed when running job: cuda-module. PR label automerge has been removed |
This reverts commit 707eec8.
@@ -36,4 +36,13 @@ void SetGraphVerboseStepLr(bool verbose) { | |||
*graph_verbose_step_lr = verbose; | |||
} | |||
|
|||
std::atomic<int32_t>* GetGraphDebugMaxPyStackDepthVar() { | |||
static std::atomic<int32_t> graph_debug_max_py_stack_depth{2}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在有地方可以方便复现不,感觉是这里的问题:
static std::atomic<int32_t> graph_debug_max_py_stack_depth
改成
static thread_local int32_t graph_debug_max_py_stack_depth
试试
std::string back_f_str = PyObjectToReprStr((PyObject*)back_frame).GetOrThrow(); | ||
cur_f_str = "Python stack[-2]: " + back_f_str + "; " + cur_f_str; | ||
Py_XDECREF(back_frame); | ||
std::string get_cur_frame_stack_str() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些地方是不是有些过度设计了。既然当前有DebugScopeContext。那是不是可以注入一个std::function<std::string()>
类型的GetCurrentFrameStackStr的函数,这个函数的实现是python。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python层来获取 stack frame str,传递到 c++ 层?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c++调用python代码。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caishenghang 看看行不行
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里背景是在 c api functor 接口处,统一为 op 生成 python 代码的 location。
目前看都在 c++ 层调用 python c api 做是直接的。调用 python 代码反而是间接的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个我和xiaoyu讨论过了,我觉得不是过度设计的问题,我觉得奇怪的是这里用python的一个类来提供接口,但是C++里面却用一个静态变量来存配置。xiaoyu 指出这是参考autograd的实现来写的。
我觉得比较合理的做法是配置放在debug context上,然后一个或多个graph持有这个debug context的引用。
现在这样的做法看上去在简单的需求中没问题,但是如果以后graph的用法更灵活了,就会有问题,比如graph被提取subgraph,graph被clone,并行编译 graph,而 debug context 可能会被添加更多的变量来实现新功能。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug context 可能会被添加更多的变量
嗯,现在这里创建了2个变量,之前只有一个。后面有计划做一次合并,用一个对象来统一处理。
@@ -29,6 +29,7 @@ limitations under the License. | |||
#include "oneflow/core/common/util.h" | |||
#include "oneflow/core/job/lazy_mode.h" | |||
#include "oneflow/core/framework/op_interpreter/dispatch_frame.h" | |||
#include "oneflow/api/python/env/env.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前发现,是这个include带来的问题。
只 inculde 本文件依赖的两个函数,不 include env.h 中其它代码,就避免了 https://github.com/Oneflow-Inc/OneTeam/issues/1343 中的问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是不是有什么全局变量的问题,一旦include会造成一个全局变量存在多个版本?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是不是有什么全局变量的问题
嗯,这种可能性最大。中间发现了一个,但是试验了一下没有解决问题。
这里有个总结:https://github.com/Oneflow-Inc/OneTeam/issues/1343#issuecomment-1107836666
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一般是不好的实践造成的,譬如在头文件里定义了全局变量,会导致每次被include一次就定义一次
刚刚我开了一个新branch: https://github.com/Oneflow-Inc/oneflow/tree/fix_import_oneflow_seg_fault |