Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diagnose and fix SIGSEGV/SIGABRTs #2987

Open
wchargin opened this issue Dec 2, 2019 · 2 comments
Open

Diagnose and fix SIGSEGV/SIGABRTs #2987

wchargin opened this issue Dec 2, 2019 · 2 comments
Assignees

Comments

@wchargin
Copy link
Contributor

wchargin commented Dec 2, 2019

All Travis builds are currently failing in the example plugin smoke test:

+ python -m tensorboard_plugin_example.demo
2019-12-02 21:26:34.264757: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2019-12-02 21:26:34.264791: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2019-12-02 21:26:34.264821: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (travis-job-48f6ca02-7a68-439f-8e39-c272ce59d7d3): /proc/driver/nvidia/version does not exist
2019-12-02 21:26:34.265298: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-02 21:26:34.275750: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 2300000000 Hz
2019-12-02 21:26:34.275976: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4bac2a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2019-12-02 21:26:34.275990: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
*** Error in `python': corrupted size vs. prev_size: 0x0000000001805c60 ***
Aborted (core dumped)

I reported this as an upstream issue two weeks ago, but it hasn’t been
fixed yet: tensorflow/tensorflow#34427. At
that time, we were also in the process of migrating from Travis .com to
.org. That migration “happened” to fix the problem, but that may have
been a caching issue. We’ve pinned ourselves to non-broken TF nightlies
(#2947), but that doesn’t appear to have sufficed.

As a quick fire extinguisher, I’m disabling the test, but this is
actually a pretty important test to keep around: it’s our sole
integration test for the dynamic plugin system.

wchargin added a commit that referenced this issue Dec 2, 2019
Summary:
See <#2987> for context.

Test Plan:
See what Travis thinks.

wchargin-branch: disable-dynplugin-smoke
@wchargin wchargin assigned davidsoergel and unassigned davidsoergel Dec 2, 2019
wchargin added a commit that referenced this issue Dec 2, 2019
Summary:
See <#2987> for context.

Test Plan:
That CI passes suffices.

wchargin-branch: disable-dynplugin-smoke
@wchargin
Copy link
Contributor Author

@psybuzz: Do we know whether this is still an issue on Xenial?

@psybuzz
Copy link
Contributor

psybuzz commented Jan 13, 2020

Using #3139 as a test PR, the SIGSEV/SIGABRT is gone, but now one of the examples is failing with a different error about libcuda. More investigation is needed.

https://travis-ci.org/tensorflow/tensorboard/jobs/636562069

+ python -m tensorboard_plugin_example.demo
2020-01-13 21:15:37.711604: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-01-13 21:15:37.711655: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-01-13 21:15:37.711682: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (travis-job-19d00633-4047-451f-9ec0-3545a32267c0): /proc/driver/nvidia/version does not exist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants