Skip to content

Memory Leaks when train transformer model. #10492

@gongweibao

Description

@gongweibao

Background:

I found memory leaks in the process of run transformer model. Memory increases by speed about 100KB/batch. Both trainer and pserver meet the problem.

Generally, memory increases by two reasons:

  • Malloced(newed) memory is not freed.
  • Memory fragment

And I found two location of not freed memory use pprof tool to run all C++ unit tests:

But the memory increases over time even I solved the above.

Analysis

First, I use pprof and Valgrind to detect when run python interface, but it contains a lot of warnings

  • I write a C++ executor to friendly to memory check tool, but I found nothing except the initialize memory leak.
  • I compiled the debug version python for memory check tool, and the result is similar to the above.

Second, I think maybe there's memory fragment in Glibc memory pool:

  • Use malloc_trim to release not used memory: it's not helpful.
  • Use LD_PRELOAD tcmalloc.so, and set TCMALLOC_RELEASE_RATE=10.0(max value): it's not helpful.
  • Link tcmalloc to paddle: because our complicated dependency and the dependency order, I meet free invalid pointer error and so fail to link.

Third, I think it's maybe the python memory leak:

Third, I use mallinfo to trace memory consumption. It does consume some memory every batch, but I can't locate the operator: every operator use new or malloc to allocate memory in our code or std:: STL code.

Conclusion: Need help

Maybe memory fragment is the reason of memory leak. We can use malloc hook to manage our memory.

Reference:

valgrind + debug version python:

7881 ==22452== 616,960 (63,240 direct, 553,720 indirect) bytes in 527 blocks are definitely lost in loss record 13,934 of 13,972
7882 ==22452==    at 0x4C2E0EF: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
7883 ==22452==    by 0x3D549AA9: google::protobuf::internal::GenericTypeHandler<paddle::framework::proto::OpDesc>::NewFromPrototype(paddle::framework::proto::OpDesc const*,      google::protobuf::Arena*) [clone .isra.186] (in /paddle/build/release_gpu/python/paddle/fluid/core.so)
7884 ==22452==    by 0x3D5509C0: paddle::framework::proto::BlockDesc::UnsafeMergeFrom(paddle::framework::proto::BlockDesc const&) (in /paddle/build/release_gpu/python/paddle     /fluid/core.so)
7885 ==22452==    by 0x3D550D86: paddle::framework::proto::ProgramDesc::UnsafeMergeFrom(paddle::framework::proto::ProgramDesc const&) (in /paddle/build/release_gpu/python/pa     ddle/fluid/core.so)
7886 ==22452==    by 0x3C7A288D: paddle::framework::ProgramDesc::ProgramDesc(paddle::framework::proto::ProgramDesc const&) (in /paddle/build/release_gpu/python/paddle/fluid/     core.so)
7887 ==22452==    by 0x3C6F3F61: void pybind11::cpp_function::initialize<paddle::pybind::pybind11_init()::{lambda(paddle::framework::ProgramDesc const&, std::vector<std::arr     ay<unsigned long, 2ul>, std::allocator<std::array<unsigned long, 2ul> > > const&)#39}, paddle::framework::ProgramDesc*, paddle::framework::ProgramDesc const&, std::vect     or<std::array<unsigned long, 2ul>, std::allocator<std::array<unsigned long, 2ul> > > const&, pybind11::name, pybind11::scope, pybind11::sibling>(paddle::pybind::pybind1     1_init()::{lambda(paddle::framework::ProgramDesc const&, std::vector<std::array<unsigned long, 2ul>, std::allocator<std::array<unsigned long, 2ul> > > const&)#39}&&, pa     ddle::framework::ProgramDesc* (*)(paddle::framework::ProgramDesc const&, std::vector<std::array<unsigned long, 2ul>, std::allocator<std::array<unsigned long, 2ul> > > c     onst&), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) (i     n /paddle/build/release_gpu/python/paddle/fluid/core.so)
7888 ==22452==    by 0x3C713823: pybind11::cpp_function::dispatcher(_object*, _object*, _object*) (in /paddle/build/release_gpu/python/paddle/fluid/core.so)
7889 ==22452==    by 0x4BC3F9: PyEval_EvalFrameEx (in /usr/bin/python2.7)
7890 ==22452==    by 0x4B9AB5: PyEval_EvalCodeEx (in /usr/bin/python2.7)
7891 ==22452==    by 0x4C1E6E: PyEval_EvalFrameEx (in /usr/bin/python2.7)
7892 ==22452==    by 0x4B9AB5: PyEval_EvalCodeEx (in /usr/bin/python2.7)
7893 ==22452==    by 0x4C16E6: PyEval_EvalFrameEx (in /usr/bin/python2.7)
.....
7938 ==22452==    definitely lost: 264,370 bytes in 1,939 blocks
7939 ==22452==    indirectly lost: 799,257 bytes in 16,449 blocks
7940 ==22452==      possibly lost: 6,991,543 bytes in 49,725 blocks
7941 ==22452==    still reachable: 247,616,073 bytes in 535,440 blocks
7942 ==22452==                       of which reachable via heuristic:
7943 ==22452==                         stdstring          : 7,891 bytes in 122 blocks
7944 ==22452==         suppressed: 0 bytes in 0 blocks
7945 ==22452== Reachable blocks (those to which a pointer was found) are not shown.
7946 ==22452== To see them, rerun with: --leak-check=full --show-leak-kinds=all

glibc malloc trace:

Diff is the memory consumed by operator.
image
Diff is the memory consumed by executor of every batch.
image

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions