-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Description
Background:
I found memory leaks in the process of run transformer model. Memory increases by speed about 100KB/batch. Both trainer and pserver meet the problem.
Generally, memory increases by two reasons:
- Malloced(newed) memory is not freed.
- Memory fragment
And I found two location of not freed memory use pprof
tool to run all C++ unit tests:
But the memory increases over time even I solved the above.
Analysis
First, I use pprof
and Valgrind
to detect when run python
interface, but it contains a lot of warnings
- I write a C++ executor to friendly to memory check tool, but I found nothing except the initialize memory leak.
- I compiled the debug version python for memory check tool, and the result is similar to the above.
Second, I think maybe there's memory fragment in Glibc memory pool:
- Use
malloc_trim
to release not used memory: it's not helpful. - Use LD_PRELOAD tcmalloc.so, and set
TCMALLOC_RELEASE_RATE=10.0(max value)
: it's not helpful. - Link
tcmalloc
topaddle
: because our complicated dependency and the dependency order, I meetfree invalid pointer
error and so fail to link.
Third, I think it's maybe the python
memory leak:
- Use
gc.collect
gc.garbage
to find uncollectable objects: there's nothing.
Third, I use mallinfo
to trace memory consumption. It does consume some memory every batch, but I can't locate the operator: every operator use new
or malloc
to allocate memory in our code or std:: STL code.
Conclusion: Need help
Maybe memory fragment is the reason of memory leak. We can use malloc hook to manage our memory.
Reference:
- Understanding glibc malloc
- TCMalloc : Thread-Caching Malloc
- mallinfo
- How to use valgrind with python?
valgrind + debug version python:
7881 ==22452== 616,960 (63,240 direct, 553,720 indirect) bytes in 527 blocks are definitely lost in loss record 13,934 of 13,972
7882 ==22452== at 0x4C2E0EF: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
7883 ==22452== by 0x3D549AA9: google::protobuf::internal::GenericTypeHandler<paddle::framework::proto::OpDesc>::NewFromPrototype(paddle::framework::proto::OpDesc const*, google::protobuf::Arena*) [clone .isra.186] (in /paddle/build/release_gpu/python/paddle/fluid/core.so)
7884 ==22452== by 0x3D5509C0: paddle::framework::proto::BlockDesc::UnsafeMergeFrom(paddle::framework::proto::BlockDesc const&) (in /paddle/build/release_gpu/python/paddle /fluid/core.so)
7885 ==22452== by 0x3D550D86: paddle::framework::proto::ProgramDesc::UnsafeMergeFrom(paddle::framework::proto::ProgramDesc const&) (in /paddle/build/release_gpu/python/pa ddle/fluid/core.so)
7886 ==22452== by 0x3C7A288D: paddle::framework::ProgramDesc::ProgramDesc(paddle::framework::proto::ProgramDesc const&) (in /paddle/build/release_gpu/python/paddle/fluid/ core.so)
7887 ==22452== by 0x3C6F3F61: void pybind11::cpp_function::initialize<paddle::pybind::pybind11_init()::{lambda(paddle::framework::ProgramDesc const&, std::vector<std::arr ay<unsigned long, 2ul>, std::allocator<std::array<unsigned long, 2ul> > > const&)#39}, paddle::framework::ProgramDesc*, paddle::framework::ProgramDesc const&, std::vect or<std::array<unsigned long, 2ul>, std::allocator<std::array<unsigned long, 2ul> > > const&, pybind11::name, pybind11::scope, pybind11::sibling>(paddle::pybind::pybind1 1_init()::{lambda(paddle::framework::ProgramDesc const&, std::vector<std::array<unsigned long, 2ul>, std::allocator<std::array<unsigned long, 2ul> > > const&)#39}&&, pa ddle::framework::ProgramDesc* (*)(paddle::framework::ProgramDesc const&, std::vector<std::array<unsigned long, 2ul>, std::allocator<std::array<unsigned long, 2ul> > > c onst&), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) (i n /paddle/build/release_gpu/python/paddle/fluid/core.so)
7888 ==22452== by 0x3C713823: pybind11::cpp_function::dispatcher(_object*, _object*, _object*) (in /paddle/build/release_gpu/python/paddle/fluid/core.so)
7889 ==22452== by 0x4BC3F9: PyEval_EvalFrameEx (in /usr/bin/python2.7)
7890 ==22452== by 0x4B9AB5: PyEval_EvalCodeEx (in /usr/bin/python2.7)
7891 ==22452== by 0x4C1E6E: PyEval_EvalFrameEx (in /usr/bin/python2.7)
7892 ==22452== by 0x4B9AB5: PyEval_EvalCodeEx (in /usr/bin/python2.7)
7893 ==22452== by 0x4C16E6: PyEval_EvalFrameEx (in /usr/bin/python2.7)
.....
7938 ==22452== definitely lost: 264,370 bytes in 1,939 blocks
7939 ==22452== indirectly lost: 799,257 bytes in 16,449 blocks
7940 ==22452== possibly lost: 6,991,543 bytes in 49,725 blocks
7941 ==22452== still reachable: 247,616,073 bytes in 535,440 blocks
7942 ==22452== of which reachable via heuristic:
7943 ==22452== stdstring : 7,891 bytes in 122 blocks
7944 ==22452== suppressed: 0 bytes in 0 blocks
7945 ==22452== Reachable blocks (those to which a pointer was found) are not shown.
7946 ==22452== To see them, rerun with: --leak-check=full --show-leak-kinds=all
glibc malloc trace:
Diff is the memory consumed by operator.
Diff is the memory consumed by executor of every batch.