-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query for libaio package using known package managers #1250
Conversation
@stas00 , I have these two PRs for the libaio search. This is a bit different from the original plan we discussed. The This adds the package manager query step, though now it's to print a message to the user to help them resolve the issue rather than a check as to whether the build will actually work. I suppose it could still be misleading if one has multiple package managers installed, like Please let me know if you still think we should flip this search method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really good to check that the functionality is actually there - with the compile check! Much much better.
Just a few suggestions to perhaps to make the troubleshooting easier for the user.
(and as I commented in the other PR, probably need to gracefully report if the compiler is not found, rather than possibly sending a cryptic error message).
Thank you!
op_builder/async_io.py
Outdated
f"{self.NAME} requires libaio but it was not found. If libaio is installed, try setting CFLAGS and LDFLAGS." | ||
) | ||
|
||
# Check for the libaio package via known package managers | ||
# to print suggestions on which package to install. | ||
self.check_for_libaio_pkg() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"{self.NAME} requires libaio but it was not found. If libaio is installed, try setting CFLAGS and LDFLAGS." | |
) | |
# Check for the libaio package via known package managers | |
# to print suggestions on which package to install. | |
self.check_for_libaio_pkg() | |
f"{self.NAME} requires the dev libaio .so object and headers but these were not found. | |
) | |
# Check for the libaio package via known package managers | |
# to print suggestions on which package to install. | |
self.check_for_libaio_pkg() | |
self.warning("If libaio is already installed, (perhaps from source), try setting CFLAGS and LDFLAGS environment variables to where it can be found.") |
I'd suggest to first give users a possible easy solution via their package manager and only then the fallback to the complicated solution.
But I also don't fully understand the logic here - if the right package is found already to be installed why would it fail to build io_submit
. User messing up their global ld.conf
?
If we checked that it's already found to be installed, should we not say that? rather than "if libaio is already installed"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing this, @stas00 . If libaio was installed from source or through something like conda, it might actually be installed, but just not in a path where the compiler looks by default. The CFLAGS/LDFLAGS tip is trying to help users who are in that situation.
The original code I had in the PR would print something like:
async_io requires libaio but it was not found. If libaio is installed, try setting CFLAGS and LDFLAGS.
async_io: please install the libaio-devel package with rpm
Flipping them and adopting your new wording would produce something like:
async_io: please install the libaio-devel package with rpm
If libaio is already installed, (perhaps from source), try setting CFLAGS and LDFLAGS environment variables to where it can be found.
Yes, that second option sounds better to me, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally get the CFLAGS/LDFLAGS
case - I probably wasn't clear if it sounded to you that I thought it was redundant. That's why I thought you're in the best position to complete this, since yours is that special case, Adam.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but actually it'll produce 3 lines:
async_io requires the dev libaio .so object and headers but these were not found.
please install the libaio-devel package with rpm
If libaio is already installed, (perhaps from source), try setting CFLAGS and LDFLAGS environment variables to where it can be found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just realized that I missed that third line you suggested. Those changes should be in there now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to add a check for the C compiler.
@@ -59,6 +90,15 @@ def is_compatible(self): | |||
aio_compatible = self.has_function('io_submit', ('aio', )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get a compile error when I build DeepSpeed from source using install.sh. I don't fully understand what this function is doing, but it generates and fails to compile the following source.
main (int argc, char **argv) {
io_submit();
}
The error message is reproducible by running gcc directly on the file:
/tmp/io_submitikjqrmn5.c:1:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
main (int argc, char **argv) {
^~~~
/tmp/io_submitikjqrmn5.c: In function ‘main’:
/tmp/io_submitikjqrmn5.c:2:5: warning: implicit declaration of function ‘io_submit’ [-Wimplicit-function-declaration]
io_submit();
^~~~~~~~~
/tmp/ccFfhIAq.o: In function `main':
io_submitikjqrmn5.c:(.text+0x15): undefined reference to `io_submit'
collect2: error: ld returned 1 exit status
So I am curious how this is working for you (and in our CI). Can you help? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I accidentally hit approve, sorry for the confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The has_function
method is creating a simple C program that uses a named function, which it then attempts to compile and link. In this case, the link is failing with undefined reference to `io_submit'
. That function should be provided in libaio.so
, so one problem might be that's failing to link with libaio.
Does this cause your whole build fail, or just the test to detect whether libaio is available?
I'll see if I can reproduce it. Can you share the install.sh
command you're using?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is failing the entire build. Sorry for not linking install.sh earlier. It is the standard script for building from source.
Thanks for the explanation. As far as I can see that c program is not well-formed since it is missing the #include<libaio.h>
at the minimum. Also, io_submit
takes arguments so I don't expect this call io_submit()
to succeed. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that C program is not exactly well-formed, but something like that is typically enough to test whether a certain function symbol exists. There are two problems I'd like to figure out.
First, I can imagine one gets this error if the test program fails to link with libaio or the libaio that it finds does not provide the io_submit
symbol for some reason.
Second, I'd like to figure out why the failure of that test link kills the whole build rather than just disables the libaio extension.
I can force a similar type of error message if I replace io_submit
with some bogus function name that I know does not exist. However, in my case, the build seems to just disable the aio op and keep going rather than aborting the whole build. I'm running a python setup.py build_dist
command directly rather than using the install.sh
script. I still need to try that myself.
As for that second item where it kills your build, do you see any other errors messages in the install.sh
build output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as I mentioned earlier, this program may crash as well if the user doesn't cc tools installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ccompiler.has_function
method doesn't seem to raise an exception here, even when the link fails due to an undefined symbol. It just seems to return False
in that case.
I'm reworking this to use compile
and link_executable
instead, which do seem to raise exceptions. That should also let us catch compile errors (e.g., no compiler) separately from link errors (io_submit
is undefined). That requires one to put together the test program manually.
I think that will help resolve the missing compiler problem. However, I'm unsure whether this will help with either of the other two problems above, namely, why is io_submit
missing, and why is that causing the whole build to abort rather than just disabling the aio op. We'll keep hunting those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're trying to solve a very complex problem, Adam. This is super challenging, and we haven't even tested it yet on a bunch of different platforms. I'm very grateful you were inspired to work on this.
If you find a way for this compile test to gracefully fail, I think it'd be a huge help - that's to have it optional and not getting in the way of the build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is failing the entire build. Sorry for not linking install.sh earlier. It is the standard script for building from source.
Thanks for the explanation. As far as I can see that c program is not well-formed since it is missing the
#include<libaio.h>
at the minimum. Also,io_submit
takes arguments so I don't expect this callio_submit()
to succeed. Thanks.
Apologies for the delayed response. Below is the error I see with master (without your latest changes). I will share the results of your branch asap. Also, to echo @stas00, thanks for doggedly taking on a very challenging problem.
These printout is in two parts, the first is from the very beginning, and the second is from the end of the build.
olruwase@webxt6cb300009C:/data/users/olruwase/deepseed/public/query_libaio$ bash install.sh -s
Attempting to remove deepspeed/git_version_info_installed.py
Attempting to remove dist
Attempting to remove build
Attempting to remove deepspeed.egg-info
Building deepspeed wheel
DS_BUILD_OPS=0
/tmp/io_submit_w38udcc.c:1:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
main (int argc, char **argv) {
^~~~
/tmp/io_submit_w38udcc.c: In function ‘main’:
/tmp/io_submit_w38udcc.c:2:5: warning: implicit declaration of function ‘io_submit’ [-Wimplicit-function-declaration]
io_submit();
^~~~~~~~~
/tmp/io_submitk9f18iq8.c:1:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
main (int argc, char **argv) {
^~~~
/tmp/io_submitk9f18iq8.c: In function ‘main’:
/tmp/io_submitk9f18iq8.c:2:5: warning: implicit declaration of function ‘io_submit’ [-Wimplicit-function-declaration]
io_submit();
^~~~~~~~~
worker-0: Installing collected packages: deepspeed
worker-0: Successfully installed deepspeed-0.4.4+6ae756c
worker-0: Warning: Permanently added '[192.168.0.52]:44896' (ECDSA) to the list of known hosts.
worker-0: /tmp/io_submitlnvp9xtf.c:1:1: warning: return type defaults to 'int' [-Wimplicit-int]
worker-0: main (int argc, char **argv) {
worker-0: ^~~~
worker-0: /tmp/io_submitlnvp9xtf.c: In function 'main':
worker-0: /tmp/io_submitlnvp9xtf.c:2:5: warning: implicit declaration of function 'io_submit' [-Wimplicit-function-declaration]
worker-0: io_submit();
worker-0: ^~~~~~~~~
worker-0: --------------------------------------------------
worker-0: DeepSpeed C++/CUDA extension op report
worker-0: --------------------------------------------------
worker-0: NOTE: Ops not installed will be just-in-time (JIT) compiled at
worker-0: runtime if needed. Op compatibility means that your system
worker-0: meet the required dependencies to JIT install the op.
worker-0: --------------------------------------------------
worker-0: JIT compiled ops requires ninja
worker-0: ninja .................. [OKAY]
worker-0: --------------------------------------------------
worker-0: op name ................ installed .. compatible
worker-0: --------------------------------------------------
worker-0: cpu_adam ............... [NO] ....... [OKAY]
worker-0: fused_adam ............. [NO] ....... [OKAY]
worker-0: fused_lamb ............. [NO] ....... [OKAY]
worker-0: sparse_attn ............ [NO] ....... [OKAY]
worker-0: transformer ............ [NO] ....... [OKAY]
worker-0: stochastic_transformer . [NO] ....... [OKAY]
worker-0: async_io ............... [NO] ....... [OKAY]
worker-0: transformer_inference .. [NO] ....... [OKAY]
worker-0: utils .................. [NO] ....... [OKAY]
worker-0: quantizer .............. [NO] ....... [OKAY]
worker-0: --------------------------------------------------
worker-0: DeepSpeed general environment info:
worker-0: torch install path ............... ['/opt/conda/lib/python3.6/site-packages/torch']
worker-0: torch version .................... 1.8.0a0+17f8c32
worker-0: torch cuda version ............... 11.1
worker-0: nvcc version ..................... 11.1
worker-0: deepspeed install path ........... ['/opt/conda/lib/python3.6/site-packages/deepspeed']
worker-0: deepspeed info ................... 0.4.4+6ae756c, 6ae756c, master
worker-0: deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1
worker-0: Warning: Permanently added '[192.168.0.52]:44896' (ECDSA) to the list of known hosts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tjruwase . In the output about, it looks like it's printing the compiler warnings about the implicit return types, but I don't see the linker error about io_submit
being undefined. Actually, the bottom part of the output suggests that it left the async_io op enabled, which probably means it must have also linked successfully.
Do you also see the linker error in your install.sh
output?
I've seen people suggest ways to silence the compile/link output from distutils to hide those warnings by redirecting stderr with os.dup2
. For now, I'm not hiding that output since it should help us debug the problem.
@@ -59,6 +90,15 @@ def is_compatible(self): | |||
aio_compatible = self.has_function('io_submit', ('aio', )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I accidentally hit approve, sorry for the confusion
The @tjruwase , when you get a chance, would you please try your build again this lastest version? |
when it works for Tunji, please ping me and I will run a test on my box too before merging. |
@adammoody, I am so sorry! I previously shared a number of incorrect feedback on this PR as I totally misread my logs. Apologies for the unnecessary alarm and confusion from me and the extra work on you. Let me try to set the record straight.
worker-0: async_io ............... [NO] ....... [OKAY]
pytest tests/unit/test_aio.py
======================================================= test session starts ========================================================
platform linux -- Python 3.6.10, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
rootdir:
plugins: forked-1.3.0, pythonpath-0.7.3, cov-2.10.1, hypothesis-4.50.8
collected 28 items
tests/unit/test_aio.py ............................ [100%]
================================================== 28 passed in 361.58s (0:06:01) ==================================================
worker-0: Compiler default opts: ['gcc', '-pthread', '-B', '/opt/conda/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes']
worker-0: Compiler (so) default opts: ['gcc', '-pthread', '-B', '/opt/conda/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes', '-fPIC']
worker-0: Linker default opts: ['gcc', '-pthread', '-B', '/opt/conda/compiler_compat', '-Wl,--sysroot=/']
worker-0: Linker (so) default opts: ['gcc', '-pthread', '-shared', '-B', '/opt/conda/compiler_compat', '-L/opt/conda/lib', '-Wl,-rpath=/opt/conda/lib', '-Wl,--no-as-needed', '-Wl,--sysroot=/']
worker-0: Tempfile: /tmp/tmpbs6r3xbl/test.c
worker-0: Test program: int main(int argc, char** argv) { io_submit(); return 0; }
worker-0: CFLAGS: []
worker-0: creating tmp/tmpbs6r3xbl
worker-0: gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -c /tmp/tmpbs6r3xbl/test.c -o tmp/tmpbs6r3xbl/test.o
worker-0: /tmp/tmpbs6r3xbl/test.c: In function 'main':
worker-0: /tmp/tmpbs6r3xbl/test.c:1:35: warning: implicit declaration of function 'io_submit' [-Wimplicit-function-declaration]
worker-0: int main(int argc, char** argv) { io_submit(); return 0; }
worker-0: ^~~~~~~~~
worker-0: LDFLAGS: []
worker-0: gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ tmp/tmpbs6r3xbl/test.o -laio -o /tmp/tmpbs6r3xbl/a.out
worker-0: Cleaning up temp directory for the test C program @adammoody, once again I do apologize for the unnecessary alarm and extra work. @stas00, please go ahead with testing on your side. Going forward, I wonder if we should suppress the expected compiler warnings (i.e., |
So I went ahead to test my proposal of suppressing these specific compiler warnings using the following change to this branch: diff --git a/op_builder/builder.py b/op_builder/builder.py
index 033a3b6..db0cfef 100644
--- a/op_builder/builder.py
+++ b/op_builder/builder.py
@@ -204,7 +204,7 @@ class OpBuilder(ABC):
#os.dup2(filestderr.fileno(), sys.stderr.fileno())
# Attempt to compile the C program into an object file.
- cflags = []
+ cflags = ['-Wno-implicit-int', '-Wno-implicit-function-declaration']
if 'CFLAGS' in os.environ:
cflags = os.environ['CFLAGS'].split()
print('CFLAGS:', cflags) And now the corresponding build log snippet looks like the following. What do you think? worker-0: Testing for io_submit
worker-0: Compiler default opts: ['gcc', '-pthread', '-B', '/opt/conda/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes']
worker-0: Compiler (so) default opts: ['gcc', '-pthread', '-B', '/opt/conda/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes', '-fPIC']
worker-0: Linker default opts: ['gcc', '-pthread', '-B', '/opt/conda/compiler_compat', '-Wl,--sysroot=/']
worker-0: Linker (so) default opts: ['gcc', '-pthread', '-shared', '-B', '/opt/conda/compiler_compat', '-L/opt/conda/lib', '-Wl,-rpath=/opt/conda/lib', '-Wl,--no-as-needed', '-Wl,--sysroot=/']
worker-0: Tempfile: /tmp/tmp871com4z/test.c
worker-0: Test program: int main(int argc, char** argv) { io_submit(); return 0; }
worker-0: CFLAGS: ['-Wno-implicit-int', '-Wno-implicit-function-declaration']
worker-0: creating tmp/tmp871com4z
worker-0: gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -Wno-implicit-int -Wno-implicit-function-declaration -c /tmp/tmp871com4z/test.c -o tmp/tmp871com4z/test.o
worker-0: LDFLAGS: []
worker-0: gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ tmp/tmp871com4z/test.o -laio -o /tmp/tmp871com4z/a.out
worker-0: Cleaning up temp directory for the test C program
worker-0: async_io ............... [NO] ....... [OKAY] |
with libaio preinstalled I get:
if you look it's doing the attempt to build the program twice |
If I uninstall libaio-dev:
|
So the only nuance from my side is that many Ubuntu users
Therefore, could we expand that lookup table to have an entry for the program that we tell users to use? e.g. on Ubuntu that would be:
alternatively just dropping the package manager name from the warning, as in:
note I changed the wording to system package, so users won't try to do |
But while we are at it, as I raised this issue in the past. Why are we issuing a warning and not an exception? We know DS_BUILD_AIO=1 will fail to build, and the warning is much harder to see. Any reason why not to make it an exception during the build? this shouldn't succeed w/o libaio, yet it does:
and then will fail at runtime. |
@stas00, so should this case fail because we are building only one extension? I believe by default, we test compatibility of all extensions even though user might not need all of them. |
Yes, this is what I was suggesting. So if So this is an additional suggestion. And if it's just checking for compatibility then warning is right. |
I believe we use |
Awesome. This worked for me in that without libaio-dev this now breaks, File "/data/users/olruwase/deepseed/public/findlibaio2/setup.py", line 122, in <module>
assert False, f"Unable to pre-compile {op_name}, perhaps disable with {env_var}=0"
AssertionError: Unable to pre-compile async_io, perhaps disable with DS_BUILD_AIO=0
DS_BUILD_OPS=0
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Exception information:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 188, in _main
status = self.run(options, args)
File "/opt/conda/lib/python3.6/site-packages/pip/_internal/cli/req_command.py", line 185, in wrapper
return func(self, options, args)
File "/opt/conda/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 333, in run
reqs, check_supported_wheels=not options.target_dir
File "/opt/conda/lib/python3.6/site-packages/pip/_internal/resolution/legacy/resolver.py", line 179, in resolve
discovered_reqs.extend(self._resolve_one(requirement_set, req)) |
@adammoody, this looks good to merge. @stas00, what say ye? |
@tjruwase , squeezed in one more commit that I think should just run each |
Let me re-test |
And all the surviving DeepSpeed bugs let out a collective sigh of relief 🥇 |
with the lib it's good, w/o it it's not great, I don't think the user can make any sense out of it:
After a very close study I finally found the assert message way before the normal traceback.
and then warnings follow, which is very odd. My vote is at least to kill "perhaps disable with DS_BUILD_AIO=0 DS_BUILD_OPS=0" - as this doesn't suggest anything useful. I enabled these manually because I needed those. And definitely not What this needs is a rethinking of why warnings are printed separately from the error and after it. Since it's illogical to the user. For now, perhaps just changing the assert message to:
will be a good enough first step and then merge. |
Hmm, yeah, the messages are being displayed in a different order than how they are printed, probably due to a mixing of |
@stas00 , how about this change? I modified The order of those messages might still come out mixed ( To fix the order, we probably should have |
Looks good. thank you! Trying to sync std streams can be a madness, when I mentioned re-think I was thinking that different components do their work, and print nothing and return success/failure and any messages instead to the caller. Then the manager that runs these components has the freedom to print any messages it collected from the components in the order it needs. But again, it's probably a job for another time. |
It looks like |
adding a few empty lines before and after the assert? e.g. pytorch recently started doing logs with banners that stand out from the other noise, examples:
may be doing something similar here? |
I can revert that with a new commit if needed, but this last commit adds a |
|
Ah, I see. Great points, @stas00 . A call to I borrowed the color code idea from the DeepSpeed/op_builder/builder.py Lines 13 to 15 in 97f7ed9
I will revert this and go back to the |
This reverts commit 0e4cde6.
I am fine with color code actually. Yes, |
From the previous commit 0e4cde6, we could replace the |
I think your PR got hit by never starting CI - happened a few times to my PRs, you probably need to push an empty commit and it might start:
|
I just started the CI. |
This depends on PR #1247
It extends that PR to add package manager detection and corresponding error messages on which package to install. This is related to #1126 (comment), though it has modified things to only print messages based on the package manager.