Improve Operator's Python Stack Trace Debug Info #8028

xiacijie · 2022-04-15T06:41:17Z

add a parameter 'max_stack_depth' to specify the max depth for the stack trace
refactor other debug related classes.

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace 2. refactor other debug related classes.

xiacijie · 2022-04-15T06:49:16Z

相关的Issue：https://github.com/Oneflow-Inc/OneTeam/issues/1305

原来的Op 的Python Stack debug信息只会显示两层，对于需要大于两层调用信息的Op来说，就无法去查看追踪了。

所以给Graph.debug() 函数增加了一个max_stack_depth 的参数来声明需要展示的最大Stack深度，默认值是2。并且在debug模式下，会显示还有多少Stack盏未显示例如：( ... 8 more ) 代表还有8个栈帧未显示。

更改之前（深度一直是2）：

更改之后（max_stack_depth 设为 4, 深度总共为8，还有4个未显示）:

(max_stack_depth 设为8）:

python/oneflow/nn/graph/graph.py

python/oneflow/framework/graph_build_util.py

oneflow/api/python/functional/py_function.h

xiacijie · 2022-04-18T09:28:53Z

@strint comments resolve了

oneflow/api/python/functional/py_function.h

strint · 2022-04-18T09:38:33Z

https://docs.python.org/2/c-api/intro.html#reference-count-details

https://python.readthedocs.io/en/stable/c-api/reflection.html#c.PyEval_GetFrame

现在的用法，都属于 borrowed reference.

python/oneflow/nn/graph/graph.py

xiacijie · 2022-04-19T01:45:42Z

@strint debug 相关的test已更新。

另外有一个更新是 block和graph self._debug_max_v_level 在__init__方法底下的默认值为-1，对应之前self._debug 为 False的情况。表示不调用debug函数的话，debug是默认关闭的。

strint

lgtm

xiacijie · 2022-04-21T07:27:12Z

@strint @BBuf nn.Graph.debug的外部接口没变，v_level 设置为-1代表关闭debug mode。

但内部实现里面我把之前删掉的self._debug 字段给加回来了，因为如果仅仅用 self._debug_max_v_level 大于0来判断是否开启了debug 模式，有个问题会是 glog里面的Flags_v 的初始值为0，所以在 class DebugScopeContext(object) 下面的 self._v = max(v_level, self._prev_v) 这一段代码会把一些没有开启debug的scope的v_level 给设置成0，这样会错误的开启debug mode，逻辑就不对了。

所以我加了一个 self._debug来判断是否开启debug mode, 而不是之前用v_level 来判断是否开启debug mode。

github-actions · 2022-04-22T05:52:10Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions · 2022-04-22T12:52:02Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8028/

github-actions · 2022-04-22T13:01:33Z

Speed stats:

GPU Name: GeForce GTX 1080 

✔️ OneFlow resnet50 time: 128.6ms (= 12856.3ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 141.1ms (= 14111.2ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 141.1ms / 128.6ms)

OneFlow resnet50 time: 80.6ms (= 8064.8ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 82.5ms (= 8245.2ms / 100, input_shape=[8, 3, 224, 224])
❌ Relative speed: 1.02 (= 82.5ms / 80.6ms)

OneFlow resnet50 time: 53.4ms (= 10688.8ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.8ms (= 11756.9ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.10 (= 58.8ms / 53.4ms)

OneFlow resnet50 time: 46.1ms (= 9229.5ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 38.9ms (= 7775.4ms / 200, input_shape=[2, 3, 224, 224])
❌ Relative speed: 0.84 (= 38.9ms / 46.1ms)

OneFlow resnet50 time: 38.8ms (= 7756.9ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 36.9ms (= 7383.7ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 0.95 (= 36.9ms / 38.8ms)

OneFlow swin dataloader time: 0.266s (= 53.164s / 200, num_workers=1)
PyTorch swin dataloader time: 0.151s (= 30.149s / 200, num_workers=1)
Relative speed: 0.567 (= 0.151s / 0.266s)

OneFlow swin dataloader time: 0.067s (= 13.361s / 200, num_workers=4)
PyTorch swin dataloader time: 0.040s (= 8.085s / 200, num_workers=4)
Relative speed: 0.605 (= 0.040s / 0.067s)

OneFlow swin dataloader time: 0.037s (= 7.424s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.464s / 200, num_workers=8)
Relative speed: 0.601 (= 0.022s / 0.037s)

✔️ OneFlow resnet50 time: 136.1ms (= 13606.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 162.1ms (= 16209.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 162.1ms / 136.1ms)

OneFlow resnet50 time: 92.2ms (= 9215.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 112.1ms (= 11205.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.22 (= 112.1ms / 92.2ms)

OneFlow resnet50 time: 64.0ms (= 12795.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.7ms (= 17541.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.37 (= 87.7ms / 64.0ms)

OneFlow resnet50 time: 51.9ms (= 10388.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.2ms (= 14235.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.37 (= 71.2ms / 51.9ms)

OneFlow resnet50 time: 51.3ms (= 10259.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.7ms (= 14932.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.46 (= 74.7ms / 51.3ms)

github-actions · 2022-04-22T13:37:55Z

CI failed when running job: cuda-module. PR label automerge has been removed

This reverts commit 707eec8.

strint · 2022-04-24T08:40:17Z

oneflow/core/job/graph_scope_vars.cpp

@@ -36,4 +36,13 @@ void SetGraphVerboseStepLr(bool verbose) {
  *graph_verbose_step_lr = verbose;
 }

+std::atomic<int32_t>* GetGraphDebugMaxPyStackDepthVar() {
+  static std::atomic<int32_t> graph_debug_max_py_stack_depth{2};


现在有地方可以方便复现不，感觉是这里的问题：

static std::atomic<int32_t> graph_debug_max_py_stack_depth

改成

static thread_local int32_t graph_debug_max_py_stack_depth

试试

lixinqi · 2022-04-24T09:24:25Z

oneflow/api/python/functional/py_function.h

-  std::string back_f_str = PyObjectToReprStr((PyObject*)back_frame).GetOrThrow();
-  cur_f_str = "Python stack[-2]: " + back_f_str + "; " + cur_f_str;
-  Py_XDECREF(back_frame);
+std::string get_cur_frame_stack_str() {


这些地方是不是有些过度设计了。既然当前有DebugScopeContext。那是不是可以注入一个std::function<std::string()>类型的GetCurrentFrameStackStr的函数，这个函数的实现是python。

python层来获取 stack frame str，传递到 c++ 层？

c++调用python代码。

@caishenghang 看看行不行

这里背景是在 c api functor 接口处，统一为 op 生成 python 代码的 location。

目前看都在 c++ 层调用 python c api 做是直接的。调用 python 代码反而是间接的。

这个我和xiaoyu讨论过了，我觉得不是过度设计的问题，我觉得奇怪的是这里用python的一个类来提供接口，但是C++里面却用一个静态变量来存配置。xiaoyu 指出这是参考autograd的实现来写的。
我觉得比较合理的做法是配置放在debug context上，然后一个或多个graph持有这个debug context的引用。
现在这样的做法看上去在简单的需求中没问题，但是如果以后graph的用法更灵活了，就会有问题，比如graph被提取subgraph，graph被clone，并行编译 graph，而 debug context 可能会被添加更多的变量来实现新功能。

debug context 可能会被添加更多的变量

嗯，现在这里创建了2个变量，之前只有一个。后面有计划做一次合并，用一个对象来统一处理。

strint · 2022-04-24T12:33:41Z

oneflow/api/python/functional/py_function.h

@@ -29,6 +29,7 @@ limitations under the License.
 #include "oneflow/core/common/util.h"
 #include "oneflow/core/job/lazy_mode.h"
 #include "oneflow/core/framework/op_interpreter/dispatch_frame.h"
+#include "oneflow/api/python/env/env.h"


目前发现，是这个include带来的问题。

只 inculde 本文件依赖的两个函数，不 include env.h 中其它代码，就避免了 https://github.com/Oneflow-Inc/OneTeam/issues/1343 中的问题。

是不是有什么全局变量的问题，一旦include会造成一个全局变量存在多个版本？

是不是有什么全局变量的问题

嗯，这种可能性最大。中间发现了一个，但是试验了一下没有解决问题。

这里有个总结：https://github.com/Oneflow-Inc/OneTeam/issues/1343#issuecomment-1107836666

这一般是不好的实践造成的，譬如在头文件里定义了全局变量，会导致每次被include一次就定义一次

xiacijie · 2022-04-25T03:44:55Z

刚刚我开了一个新branch： https://github.com/Oneflow-Inc/oneflow/tree/fix_import_oneflow_seg_fault
这里面就没有include env.h头文件，准备让 @yoonlee888 帮测一下

improve the debug information for the python stack trace

1e979fc

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace 2. refactor other debug related classes.

xiacijie requested review from BBuf, daquexian and jackalcooper as code owners April 15, 2022 06:41

xiacijie requested a review from strint April 15, 2022 06:41

xiacijie changed the title ~~improve the debug information for the python stack trace~~ 改进Op的Python Stack Trace的Debug信息 Apr 15, 2022

strint added graph graph mode feature tooling labels Apr 15, 2022

xiacijie added 2 commits April 17, 2022 12:02

remove parens

e5fc2f1

update

b53d43b