You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
(Brief description of the problem in no more than 2 sentences.)
My cpp program sometimes core dump in libmxnet.so when the model is as large as 200M bytes;
no core dump with small model.
Environment info (Required)
imac osx 10.13.6
CPU
compliler:
clang -v
Apple LLVM version 9.1.0 (clang-902.0.39.2)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
you can disable it, however, you will not able to use
imbin iterator
-USE_OPENCV = 1
+USE_OPENCV = 0
#whether use libjpeg-turbo for image decode without OpenCV wrapper
USE_LIBJPEG_TURBO = 0
@@ -90,7 +90,7 @@ USE_LIBJPEG_TURBO = 0
USE_LIBJPEG_TURBO_PATH = NONE
use openmp for parallelization
-USE_OPENMP = 1
+USE_OPENMP = 0
Error Message:
(Paste the complete error message, including stack trace.)
lldb main -c /cores/core.97762
(lldb) target create "main" --core "/cores/core.97762"
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 52, in
import weakref
File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/weakref.py", line 14, in
from _weakref import (
ImportError: cannot import name _remove_dead_weakref
Core file '/cores/core.97762' (x86_64) was loaded.
(lldb) bt
warning: could not execute support code to read Objective-C class data in the process. This may reduce the quality of type information available.
thread Add some ops #1: tid = 0x0000, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #2: tid = 0x0001, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP
thread Update dev branch #3: tid = 0x0002, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #4: tid = 0x0003, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP
thread change capi #5: tid = 0x0004, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #6: tid = 0x0005, 0x000000010c589a4a libmxnet.sovoid mxnet::op::BatchNormForwardImpl<mshadow::cpu, float, float>(mshadow::Streammshadow::cpu*, mxnet::OpContext const&, mxnet::op::BatchNormParam const&, std::__1::vector<mxnet::TBlob, std::__1::allocatormxnet::TBlob > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::TBlob, std::__1::allocatormxnet::TBlob > const&, std::__1::vector<mxnet::TBlob, std::__1::allocatormxnet::TBlob > const&) + 1002, stop reason = signal SIGSTOP
thread symbol implementation and fix #7: tid = 0x0006, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #8: tid = 0x0007, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP
thread new symbol interface #9: tid = 0x0008, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #10: tid = 0x0009, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP
thread static graph #11: tid = 0x000a, 0x00007fff63e7e28a libsystem_kernel.dylib__workq_kernreturn + 10, stop reason = signal SIGSTOP thread #12: tid = 0x000b, 0x00007fff63e7e28a libsystem_kernel.dylib__workq_kernreturn + 10, stop reason = signal SIGSTOP
thread out_data is necessary, e.g. sigmoid #13: tid = 0x000c, 0x00007fff63e7e28a libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP
Minimum reproducible example
There is no obvious condition which cause the core dump.
I do manuelly send a sigstop signal to my main program, then main stop as usual.
I'm curious that there is no segment fault or abort or some other signal but a sigstop when the core dump occurs.
At first I compile the mxnet master branch. Then I switch a release tag '1.2.1.rc1', same thing happens.
The text was updated successfully, but these errors were encountered:
Maybe the model size is not relevant. The model vision is more likely the trigger.
coredump model:
/Users/load/code/python/model/model-r100-gg/model-symbol.json ... 287521 bytes
/Users/load/code/python/model/model-r100-gg/model-0000.params ... 260958682 bytes
[11:34:04] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.0.0. Attempting to upgrade...
[11:34:04] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
This core dump issue causes by the bug from other code. But I do find some core dump bug which can be repeat very easily: create a infer handle by load a model, then the thread exit immediately. Then core dump happens, something like this: // Create Predictor MXPredCreate(static_cast<const char*>(json_data.GetBuffer()), static_cast<const char*>(param_data.GetBuffer()), static_cast<int>(param_data.GetLength()), dev_type, dev_id, num_input_nodes, input_keys, input_shape_indptr, input_shape_data, &pred_hnd); assert(pred_hnd); exit(0);
However, if insert a sleep() statement before exit(), the issue doesn't exist.
(Brief description of the problem in no more than 2 sentences.)
My cpp program sometimes core dump in libmxnet.so when the model is as large as 200M bytes;
no core dump with small model.
Environment info (Required)
imac osx 10.13.6
CPU
compliler:
clang -v
Apple LLVM version 9.1.0 (clang-902.0.39.2)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Build info (Required if built from source)
git diff make/config.mk
@@ -82,7 +82,7 @@ USE_NCCL_PATH = NONE
whether use opencv during compilation
you can disable it, however, you will not able to use
imbin iterator
-USE_OPENCV = 1
+USE_OPENCV = 0
#whether use libjpeg-turbo for image decode without OpenCV wrapper
USE_LIBJPEG_TURBO = 0
@@ -90,7 +90,7 @@ USE_LIBJPEG_TURBO = 0
USE_LIBJPEG_TURBO_PATH = NONE
use openmp for parallelization
-USE_OPENMP = 1
+USE_OPENMP = 0
Error Message:
(Paste the complete error message, including stack trace.)
lldb main -c /cores/core.97762
(lldb) target create "main" --core "/cores/core.97762"
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 52, in
import weakref
File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/weakref.py", line 14, in
from _weakref import (
ImportError: cannot import name _remove_dead_weakref
Core file '/cores/core.97762' (x86_64) was loaded.
(lldb) bt
warning: could not execute support code to read Objective-C class data in the process. This may reduce the quality of type information available.
__psynch_cvwait + 10 frame #1: 0x00007fff64046589 libsystem_pthread.dylib
_pthread_cond_wait + 732frame [concurrent-blocking-queue-fix] ConcurrentBlockingQueue::Pop's return… #2: 0x00007fff61c81cb0 libc++.1.dylib
std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 18 frame #3: 0x000000010d6bc364 libmxnet.so
mxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) + 596frame rename #4: 0x000000010d7cd49a libmxnet.so
mxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const + 954 frame #5: 0x000000010d6ad0d4 libmxnet.so
MXPredGetOutput + 340frame clean up registry code #6: 0x000000010c1cac30 main
Infer(pred_hnd=0x00007fcba2f00000, image_data=size=1, data=size=1) at face_predict.cpp:296 frame #7: 0x000000010c120e99 main
process_camera(model_path="../models/ncnn", camera=0x00007ffee3af5170, output_folder="./output/192.168.150.244", mainThread=true) at main.cpp:278frame static graph #8: 0x000000010c125f42 main
main(argc=4, argv=0x00007ffee3af57b0) at main.cpp:484 frame #9: 0x00007fff63d2d015 libdyld.dylib
start + 1(lldb) thread list
Process 0 stopped
__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #2: tid = 0x0001, 0x00007fff63e7da16 libsystem_kernel.dylib
__psynch_cvwait + 10, stop reason = signal SIGSTOPthread Update dev branch #3: tid = 0x0002, 0x00007fff63e7da16 libsystem_kernel.dylib
__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #4: tid = 0x0003, 0x00007fff63e7da16 libsystem_kernel.dylib
__psynch_cvwait + 10, stop reason = signal SIGSTOPthread change capi #5: tid = 0x0004, 0x00007fff63e7da16 libsystem_kernel.dylib
__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #6: tid = 0x0005, 0x000000010c589a4a libmxnet.so
void mxnet::op::BatchNormForwardImpl<mshadow::cpu, float, float>(mshadow::Streammshadow::cpu*, mxnet::OpContext const&, mxnet::op::BatchNormParam const&, std::__1::vector<mxnet::TBlob, std::__1::allocatormxnet::TBlob > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::TBlob, std::__1::allocatormxnet::TBlob > const&, std::__1::vector<mxnet::TBlob, std::__1::allocatormxnet::TBlob > const&) + 1002, stop reason = signal SIGSTOPthread symbol implementation and fix #7: tid = 0x0006, 0x00007fff63e7da16 libsystem_kernel.dylib
__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #8: tid = 0x0007, 0x00007fff63e7da16 libsystem_kernel.dylib
__psynch_cvwait + 10, stop reason = signal SIGSTOPthread new symbol interface #9: tid = 0x0008, 0x00007fff63e7da16 libsystem_kernel.dylib
__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #10: tid = 0x0009, 0x00007fff63e7da16 libsystem_kernel.dylib
__psynch_cvwait + 10, stop reason = signal SIGSTOPthread static graph #11: tid = 0x000a, 0x00007fff63e7e28a libsystem_kernel.dylib
__workq_kernreturn + 10, stop reason = signal SIGSTOP thread #12: tid = 0x000b, 0x00007fff63e7e28a libsystem_kernel.dylib
__workq_kernreturn + 10, stop reason = signal SIGSTOPthread out_data is necessary, e.g. sigmoid #13: tid = 0x000c, 0x00007fff63e7e28a libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP
Minimum reproducible example
There is no obvious condition which cause the core dump.
I do manuelly send a sigstop signal to my main program, then main stop as usual.
I'm curious that there is no segment fault or abort or some other signal but a sigstop when the core dump occurs.
At first I compile the mxnet master branch. Then I switch a release tag '1.2.1.rc1', same thing happens.
The text was updated successfully, but these errors were encountered: