Skip to content

python单测中输入/输出参数名写错时,出现的SegFault问题 #4117

@Xreki

Description

@Xreki

这个问题,我#3899 (comment) 提到过,@qingqing01 也做过相关实验#4107 (comment) ,总体现象就是,python单测中输入/输出参数名字写错了,单测直接挂掉,并显示SegFault。还是以#4107 (comment) 中的case 2为例。我们故意将test_mul_op.py中的输入名字写错Y -> Y0,如下:

class TestMulOp(OpTest):
    def setUp(self):
        self.op_type = "mul"
        self.inputs = {
            'X': np.random.random((32, 84)).astype("float32"),
            'Y0': np.random.random((84, 100)).astype("float32")
        }
        self.outputs = {'Out': np.dot(self.inputs['X'], self.inputs['Y0'])}

    def test_check_output(self):
        self.check_output()

执行单测的结果:

126: Test timeout computed to be: 9.99988e+06
1/1 Test #126: test_mul_op ......................***Exception: SegFault 45.98 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =  46.00 sec

OperatorBase的构造函数中有对输入输出参数名是否存在进行检查(https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.cc#L165 ),若不存在,则会ENFORCE失败,出现#3899 (comment) 中的错误提示和堆栈信息。因此,真实情况应该是:python端允许一些输入/输出参数不设置,并且认为该参数名是存在的,但是创建相应的Variable变量,因此需要在写op时,对Variable是否为null进行检查
很多op在的实现,没有检查输入输出变量是否为空,而直接引用,从而导致了SegFault。比如mul_op.cc中:

 23 class MulOp : public framework::OperatorWithKernel {
 24  public:
 25   using framework::OperatorWithKernel::OperatorWithKernel;
 26 
 27  protected:
 28   void InferShape(const framework::InferShapeContext &ctx) const override {
 29     auto x_dims = ctx.Input<Tensor>("X")->dims();
 30     auto y_dims = ctx.Input<Tensor>("Y")->dims();
          ...

我在mul_op.cc中加入PADDLE_ENFORCE_NOT_NULL的检查:

$ git diff mul_op.cc
diff --git a/paddle/operators/mul_op.cc b/paddle/operators/mul_op.cc
index 015e13d..15b48b8 100644
--- a/paddle/operators/mul_op.cc
+++ b/paddle/operators/mul_op.cc
@@ -26,6 +26,8 @@ class MulOp : public framework::OperatorWithKernel {
 
  protected:
   void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"), "Input(X) of MulOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Y"), "Input(Y) of MulOp should not be null.");
     auto x_dims = ctx.Input<Tensor>("X")->dims();
     auto y_dims = ctx.Input<Tensor>("Y")->dims();
     int x_num_col_dims = Attr<int>("x_num_col_dims");

再次运行上述单测的case,执行结果如下:

138: ======================================================================
138: ERROR: test_check_output (__main__.TestMulOp)
138: ----------------------------------------------------------------------
138: Traceback (most recent call last):
138:   File "test_mul_op.py", line 16, in test_check_output
138:     self.check_output()
138:   File "/home/liuyiqun01/github/Paddle/python/paddle/v2/framework/tests/op_test.py", line 211, in check_output
138:     self.check_output_with_place(place)
138:   File "/home/liuyiqun01/github/Paddle/python/paddle/v2/framework/tests/op_test.py", line 183, in check_output_with_place
138:     self.op.infer_shape(self.scope)
138: RuntimeError: ctx.InputVar("Y") should not be null
138: Input(Y) of MulOp should not be null. at [/home/liuyiqun01/github/Paddle/paddle/operators/mul_op.cc:30]
138: PaddlePaddle Call Stacks: 
138: 0       0x7f26ca7917a8p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 648
138: 1       0x7f26ca82af7fp paddle::operators::MulOp::InferShape(paddle::framework::InferShapeContext const&) const + 2943
138: 2       0x7f26ca7b5cb1p paddle::framework::OperatorWithKernel::InferShape(paddle::framework::Scope const&) const + 33

因此,op在实现时,必须使用PADDLE_ENFORCE_NOT_NULL对输入/输出是否为空进行检查。为了保证都有check,可在单测中故意将每个参数写错,以进行验证。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions