-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI][Docker] xgboost 1.0.1 causes segfault on test_autotvm_xgboost_model.py #4953
Comments
@leandron can you create a minimum reproducible example? e.g. pickle the data that causes segfault in XGBoost. Then we can start to bring attention of the XGBoost dev community. In the meanwhile, we can pin xgboost to 0.9 |
cc @merrymercy @hcho3 Who might be interested in this issue |
It would be nice if we can get a reproducible example. We are currently working on the patch release 1.0.2 and I want to get a patch to fix this issue. |
* Sets xgboost dependency to be 0.90, preventing segfaults during TVM python unit tests execution * This is discussed in issue #4953
@leandron can you please comment about the current state? |
Hi, I only managed to investigate it further, today. XGBoost now is version To give some context, this is the function call that triggers the issue: I tried just creating pickle files of some inputs ( Now, something that could help me a bit to narrow down where the problem is if I run XGBoost in debug mode. @hcho3 what is the simplest way I can do that? |
@leandron Thanks for pointing out which part of TVM test is failing. Not sure if running in debug mode would help, since XGBoost is crashing with segfault here. I will take a look some time this week. |
* Sets xgboost dependency to be 0.90, preventing segfaults during TVM python unit tests execution * This is discussed in issue apache#4953
* Sets xgboost dependency to be 0.90, preventing segfaults during TVM python unit tests execution * This is discussed in issue apache#4953
I compiled TVM from source and tried running the test |
I also tried building |
We applied a workaround, pinning the xgboost version to be 0.90. Which XGBoost version you see in the image you created from scratch? |
@leandron can you also provide a bit more details e.g. does directly run I tried to build a docker image with xgboost==1.0.2 and seems cannot repro the issue. |
I see this reliably with a virtualenv on bionic on AWS. environment: command: python version
installed python packages:
backtrace:
pytest log:
|
Let me try again with |
The trace might offer some insights, @hcho3 , couldit caused by ConfigureGpuId? also cc @trivialfis since it seems to relates to dmlc/xgboost#4961?
|
Is it possible that somehow XGBoost linked a wrong dmlc static library? |
i don't know unless we go and dig deeper, but if the bug is reproducible, then it should not be hard to find the cause |
tried to reproduce with xgboost built from source (at HEAD/e4f5b6c8 and v1.0.2), no luck (test_autotvm_xgboost_model passes). if I reinstall the pip package (1.0.2), I can get it to reproduce again. I built xgboost with this config: any other suggestions to get it to build or install like the pypi package? might it be related to building the package on centos? |
@areusch Are you still running bionic when installing from pip? |
yes |
There is something I wanted to point out, which is an insight after @areusch's comment (thanks for that!). The VM I'm running this test, does not have a GPU. However, the same test used to pass on this very same machine, with @hcho3 do you think this could be something caused by a change in behaviour after |
no GPU on my instance (it is c5.4xlarge) |
I can reproduce it on bionic. Here is what I have found so far:
@hcho3 It would be of great help if I can obtain a debug build or |
I am still unable to reproduce it on my Bionic machine. @areusch Can you share the content of your @trivialfis I'll try to build a wheel using CentOS Docker image. |
@trivialfis And you ran |
@hcho3 Yes. Currently detached at the commit before above linked PR.
|
@areusch Can you try out the latest TVM master on your end? I'm still having trouble reproducing the original issue. |
Finally, I reproduced it. Yes! Note: I used latest TVM as of today. The crash still occured. |
@hcho3 Could you try applying the patches I posted above and use the corresponding cmake flags? |
@trivialfis I applied your patch and changed CMake flags. And now the unit test does not crash any more. You should try it too. Get the wheel at https://xgboost-wheels.s3-us-west-2.amazonaws.com/xgboost-1.0.2-py3-none-manylinux1_x86_64.whl. |
@hcho3 Yup. It works fine on my machine too |
@hcho3 tested with your new wheel on my aws instance and the test now passes! |
@trivialfis Can you elaborate what your patch does? Does it hide certain symbols? |
It hides all the symbols, except for C APIs. So if anyone's using C++ header, it might generate a lots of linker errors. |
@trivialfis I think we can hide all C++ symbols when building Python wheels. I don't think anyone using C++ headers would use the Pip wheel. WDYT? |
@hcho3 Yup. Good idea. We can make it a CMake option: |
@trivialfis Let me file a pull request. We'll include the fix as part of the upcoming 1.1.0 release. |
We need to be careful about this. Rabit and dmlc core are independently built, I'm not sure what will happen if they throw an error, as hiding symbols means exception can not be propagated out. |
Not entirely sure in the context of static linking. |
Got it. How about compiling the wheel using latest Ubuntu (not CentOS) and put it in a S3 bucket? The TVM CI can pull from this bucket instead of PyPI. Current build environment for the Pip wheel is quite old: CentOS 6 + devtoolset-4 (GCC 5.x). When we drop CUDA 9.0 support, we can upgrade the build environment to CentOS 6 + devtoolset-6 (GCC 7.x). Upgrade may fix the issue. |
Before upgrading libc dependency, we can try to add a test that forces rabit to throw an error, see if it crashes XGBoost with segfault. An uneven all reduce can do. Or a test with dmlc core, a file nonexist error seems to be simple. |
Just make sure error is not thrown in header. |
I believe if it works it will be a net gain for XGBoost, hiding symbols is a good practice for shared libraries. |
With dmlc/xgboost#5590, I can now run |
Thanks @hcho3 @areusch @leandron @trivialfis for resolving this problem. |
With the release of XGBoost 1.0.x (i.e
xgboost-1.0.1-py3-none-manylinux1_x86_64.whl
), it seems that installing TVM from scratch (rebuilding Docker containers) makestests/python/unittest/test_autotvm_xgboost_model.py
to fail with a segfault.Investigating it a bit further, if I manually revert it to
xgboost-0.90
it works fine. Usingxgboost-1.0.1
, this is the message I see:@tqchen, I didn't see any PR or discussion about it, but are you aware about any ongoing initiative to move TVM to XGBoost 1.0.x, or shall we pin xgboost to be 0.90, to prevent the error to happen? (note: I'm happy to send a patch to pin the version)
The text was updated successfully, but these errors were encountered: