Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query for libaio package using known package managers #1250

Merged
merged 13 commits into from
Jul 29, 2021

Conversation

adammoody
Copy link
Contributor

This depends on PR #1247

It extends that PR to add package manager detection and corresponding error messages on which package to install. This is related to #1126 (comment), though it has modified things to only print messages based on the package manager.

@adammoody
Copy link
Contributor Author

adammoody commented Jul 24, 2021

@stas00 , I have these two PRs for the libaio search. This is a bit different from the original plan we discussed.

The distutils.ccompiler test seemed to work well for both my conda install and a system install where libaio is available in the default search paths, which should be the case for most package managers. So I opted to use that as the first search method.

This adds the package manager query step, though now it's to print a message to the user to help them resolve the issue rather than a check as to whether the build will actually work. I suppose it could still be misleading if one has multiple package managers installed, like rpm and dpkg, but I'm not sure how to handle that case. I have worked on systems that did have more than one like that, which each managed a different mount point on the file system.

Please let me know if you still think we should flip this search method.

Copy link
Collaborator

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good to check that the functionality is actually there - with the compile check! Much much better.

Just a few suggestions to perhaps to make the troubleshooting easier for the user.

(and as I commented in the other PR, probably need to gracefully report if the compiler is not found, rather than possibly sending a cryptic error message).

Thank you!

Comment on lines 93 to 98
f"{self.NAME} requires libaio but it was not found. If libaio is installed, try setting CFLAGS and LDFLAGS."
)

# Check for the libaio package via known package managers
# to print suggestions on which package to install.
self.check_for_libaio_pkg()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"{self.NAME} requires libaio but it was not found. If libaio is installed, try setting CFLAGS and LDFLAGS."
)
# Check for the libaio package via known package managers
# to print suggestions on which package to install.
self.check_for_libaio_pkg()
f"{self.NAME} requires the dev libaio .so object and headers but these were not found.
)
# Check for the libaio package via known package managers
# to print suggestions on which package to install.
self.check_for_libaio_pkg()
self.warning("If libaio is already installed, (perhaps from source), try setting CFLAGS and LDFLAGS environment variables to where it can be found.")

I'd suggest to first give users a possible easy solution via their package manager and only then the fallback to the complicated solution.

But I also don't fully understand the logic here - if the right package is found already to be installed why would it fail to build io_submit. User messing up their global ld.conf?

If we checked that it's already found to be installed, should we not say that? rather than "if libaio is already installed"

Copy link
Contributor Author

@adammoody adammoody Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing this, @stas00 . If libaio was installed from source or through something like conda, it might actually be installed, but just not in a path where the compiler looks by default. The CFLAGS/LDFLAGS tip is trying to help users who are in that situation.

The original code I had in the PR would print something like:

async_io requires libaio but it was not found.  If libaio is installed, try setting CFLAGS and LDFLAGS.
async_io: please install the libaio-devel package with rpm

Flipping them and adopting your new wording would produce something like:

async_io: please install the libaio-devel package with rpm
If libaio is already installed, (perhaps from source), try setting CFLAGS and LDFLAGS environment variables to where it can be found.

Yes, that second option sounds better to me, too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally get the CFLAGS/LDFLAGS case - I probably wasn't clear if it sounded to you that I thought it was redundant. That's why I thought you're in the best position to complete this, since yours is that special case, Adam.

Copy link
Collaborator

@stas00 stas00 Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but actually it'll produce 3 lines:

async_io requires the dev libaio .so object and headers but these were not found. 
please install the libaio-devel package with rpm
If libaio is already installed, (perhaps from source), try setting CFLAGS and LDFLAGS environment variables to where it can be found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just realized that I missed that third line you suggested. Those changes should be in there now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to add a check for the C compiler.

@@ -59,6 +90,15 @@ def is_compatible(self):
aio_compatible = self.has_function('io_submit', ('aio', ))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a compile error when I build DeepSpeed from source using install.sh. I don't fully understand what this function is doing, but it generates and fails to compile the following source.

main (int argc, char **argv) {
    io_submit();
}

The error message is reproducible by running gcc directly on the file:

/tmp/io_submitikjqrmn5.c:1:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
 main (int argc, char **argv) {
 ^~~~
/tmp/io_submitikjqrmn5.c: In function ‘main’:
/tmp/io_submitikjqrmn5.c:2:5: warning: implicit declaration of function ‘io_submit’ [-Wimplicit-function-declaration]
     io_submit();
     ^~~~~~~~~
/tmp/ccFfhIAq.o: In function `main':
io_submitikjqrmn5.c:(.text+0x15): undefined reference to `io_submit'
collect2: error: ld returned 1 exit status

So I am curious how this is working for you (and in our CI). Can you help? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I accidentally hit approve, sorry for the confusion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The has_function method is creating a simple C program that uses a named function, which it then attempts to compile and link. In this case, the link is failing with undefined reference to `io_submit'. That function should be provided in libaio.so, so one problem might be that's failing to link with libaio.

Does this cause your whole build fail, or just the test to detect whether libaio is available?

I'll see if I can reproduce it. Can you share the install.sh command you're using?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is failing the entire build. Sorry for not linking install.sh earlier. It is the standard script for building from source.

Thanks for the explanation. As far as I can see that c program is not well-formed since it is missing the #include<libaio.h> at the minimum. Also, io_submit takes arguments so I don't expect this call io_submit() to succeed. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that C program is not exactly well-formed, but something like that is typically enough to test whether a certain function symbol exists. There are two problems I'd like to figure out.

First, I can imagine one gets this error if the test program fails to link with libaio or the libaio that it finds does not provide the io_submit symbol for some reason.

Second, I'd like to figure out why the failure of that test link kills the whole build rather than just disables the libaio extension.

I can force a similar type of error message if I replace io_submit with some bogus function name that I know does not exist. However, in my case, the build seems to just disable the aio op and keep going rather than aborting the whole build. I'm running a python setup.py build_dist command directly rather than using the install.sh script. I still need to try that myself.

As for that second item where it kills your build, do you see any other errors messages in the install.sh build output?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I mentioned earlier, this program may crash as well if the user doesn't cc tools installed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ccompiler.has_function method doesn't seem to raise an exception here, even when the link fails due to an undefined symbol. It just seems to return False in that case.

I'm reworking this to use compile and link_executable instead, which do seem to raise exceptions. That should also let us catch compile errors (e.g., no compiler) separately from link errors (io_submit is undefined). That requires one to put together the test program manually.

I think that will help resolve the missing compiler problem. However, I'm unsure whether this will help with either of the other two problems above, namely, why is io_submit missing, and why is that causing the whole build to abort rather than just disabling the aio op. We'll keep hunting those.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're trying to solve a very complex problem, Adam. This is super challenging, and we haven't even tested it yet on a bunch of different platforms. I'm very grateful you were inspired to work on this.

If you find a way for this compile test to gracefully fail, I think it'd be a huge help - that's to have it optional and not getting in the way of the build.

Copy link
Contributor

@tjruwase tjruwase Jul 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is failing the entire build. Sorry for not linking install.sh earlier. It is the standard script for building from source.

Thanks for the explanation. As far as I can see that c program is not well-formed since it is missing the #include<libaio.h> at the minimum. Also, io_submit takes arguments so I don't expect this call io_submit() to succeed. Thanks.

Apologies for the delayed response. Below is the error I see with master (without your latest changes). I will share the results of your branch asap. Also, to echo @stas00, thanks for doggedly taking on a very challenging problem.

These printout is in two parts, the first is from the very beginning, and the second is from the end of the build.

olruwase@webxt6cb300009C:/data/users/olruwase/deepseed/public/query_libaio$ bash install.sh -s
Attempting to remove deepspeed/git_version_info_installed.py
Attempting to remove dist
Attempting to remove build
Attempting to remove deepspeed.egg-info
Building deepspeed wheel
DS_BUILD_OPS=0
/tmp/io_submit_w38udcc.c:1:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
 main (int argc, char **argv) {
 ^~~~
/tmp/io_submit_w38udcc.c: In function ‘main’:
/tmp/io_submit_w38udcc.c:2:5: warning: implicit declaration of function ‘io_submit’ [-Wimplicit-function-declaration]
     io_submit();
     ^~~~~~~~~
/tmp/io_submitk9f18iq8.c:1:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
 main (int argc, char **argv) {
 ^~~~
/tmp/io_submitk9f18iq8.c: In function ‘main’:
/tmp/io_submitk9f18iq8.c:2:5: warning: implicit declaration of function ‘io_submit’ [-Wimplicit-function-declaration]
     io_submit();
     ^~~~~~~~~
worker-0: Installing collected packages: deepspeed
worker-0: Successfully installed deepspeed-0.4.4+6ae756c
worker-0: Warning: Permanently added '[192.168.0.52]:44896' (ECDSA) to the list of known hosts.
worker-0: /tmp/io_submitlnvp9xtf.c:1:1: warning: return type defaults to 'int' [-Wimplicit-int]
worker-0:  main (int argc, char **argv) {
worker-0:  ^~~~
worker-0: /tmp/io_submitlnvp9xtf.c: In function 'main':
worker-0: /tmp/io_submitlnvp9xtf.c:2:5: warning: implicit declaration of function 'io_submit' [-Wimplicit-function-declaration]
worker-0:      io_submit();
worker-0:      ^~~~~~~~~
worker-0: --------------------------------------------------
worker-0: DeepSpeed C++/CUDA extension op report
worker-0: --------------------------------------------------
worker-0: NOTE: Ops not installed will be just-in-time (JIT) compiled at
worker-0:       runtime if needed. Op compatibility means that your system
worker-0:       meet the required dependencies to JIT install the op.
worker-0: --------------------------------------------------
worker-0: JIT compiled ops requires ninja
worker-0: ninja .................. [OKAY]
worker-0: --------------------------------------------------
worker-0: op name ................ installed .. compatible
worker-0: --------------------------------------------------
worker-0: cpu_adam ............... [NO] ....... [OKAY]
worker-0: fused_adam ............. [NO] ....... [OKAY]
worker-0: fused_lamb ............. [NO] ....... [OKAY]
worker-0: sparse_attn ............ [NO] ....... [OKAY]
worker-0: transformer ............ [NO] ....... [OKAY]
worker-0: stochastic_transformer . [NO] ....... [OKAY]
worker-0: async_io ............... [NO] ....... [OKAY]
worker-0: transformer_inference .. [NO] ....... [OKAY]
worker-0: utils .................. [NO] ....... [OKAY]
worker-0: quantizer .............. [NO] ....... [OKAY]
worker-0: --------------------------------------------------
worker-0: DeepSpeed general environment info:
worker-0: torch install path ............... ['/opt/conda/lib/python3.6/site-packages/torch']
worker-0: torch version .................... 1.8.0a0+17f8c32
worker-0: torch cuda version ............... 11.1
worker-0: nvcc version ..................... 11.1
worker-0: deepspeed install path ........... ['/opt/conda/lib/python3.6/site-packages/deepspeed']
worker-0: deepspeed info ................... 0.4.4+6ae756c, 6ae756c, master
worker-0: deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1
worker-0: Warning: Permanently added '[192.168.0.52]:44896' (ECDSA) to the list of known hosts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tjruwase . In the output about, it looks like it's printing the compiler warnings about the implicit return types, but I don't see the linker error about io_submit being undefined. Actually, the bottom part of the output suggests that it left the async_io op enabled, which probably means it must have also linked successfully.

Do you also see the linker error in your install.sh output?

I've seen people suggest ways to silence the compile/link output from distutils to hide those warnings by redirecting stderr with os.dup2. For now, I'm not hiding that output since it should help us debug the problem.

@@ -59,6 +90,15 @@ def is_compatible(self):
aio_compatible = self.has_function('io_submit', ('aio', ))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I accidentally hit approve, sorry for the confusion

@adammoody
Copy link
Contributor Author

The ./install.sh build also works in my environment, so I'm not able to reproduce the same failure mode. This has been reworked to include try/catch statements using different ccompiler methods. I've also added some print statements to perhaps shed more light on things.

@tjruwase , when you get a chance, would you please try your build again this lastest version?

@stas00
Copy link
Collaborator

stas00 commented Jul 28, 2021

when it works for Tunji, please ping me and I will run a test on my box too before merging.

@tjruwase
Copy link
Contributor

@adammoody, I am so sorry! I previously shared a number of incorrect feedback on this PR as I totally misread my logs. Apologies for the unnecessary alarm and confusion from me and the extra work on you. Let me try to set the record straight.

  1. What self.has_function() was reporting in the build log are compiler warnings, not compiler errors. I recognize that there is a big difference between this two. Moreover, given that the purpose of self.has_function(), as you explained, is simply to use linker errors to detect the presence/absence of a function symbol, then I think it is okay to ignore preceding compiler warnings since they don't prevent the linker step from succeeding.
  2. These compiler warnings did not fail the entire build, as I previously claimed. I simply did not read the previous log carefully. This line from the previous log showed that the async_io is OKAY, i.e. deemed compatible by the build process.
worker-0: async_io ............... [NO] ....... [OKAY]
  1. Still on the previous build, the libaio extension (i.e., deepspeed.ops.aio) was not disabled and is infact usable. I verified this by successfully running the associated unit test as follows:
pytest tests/unit/test_aio.py
======================================================= test session starts ========================================================
platform linux -- Python 3.6.10, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
rootdir: 
plugins: forked-1.3.0, pythonpath-0.7.3, cov-2.10.1, hypothesis-4.50.8
collected 28 items

tests/unit/test_aio.py ............................                                                                          [100%]

================================================== 28 passed in 361.58s (0:06:01) ==================================================
  1. With the latest updates to this branch, the main difference I see are the extra log messages showing the entire failing compiler command line.
worker-0: Compiler default opts: ['gcc', '-pthread', '-B', '/opt/conda/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes']
worker-0: Compiler (so) default opts: ['gcc', '-pthread', '-B', '/opt/conda/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes', '-fPIC']
worker-0: Linker default opts: ['gcc', '-pthread', '-B', '/opt/conda/compiler_compat', '-Wl,--sysroot=/']
worker-0: Linker (so) default opts: ['gcc', '-pthread', '-shared', '-B', '/opt/conda/compiler_compat', '-L/opt/conda/lib', '-Wl,-rpath=/opt/conda/lib', '-Wl,--no-as-needed', '-Wl,--sysroot=/']
worker-0: Tempfile: /tmp/tmpbs6r3xbl/test.c
worker-0: Test program: int main(int argc, char** argv) { io_submit(); return 0; }
worker-0: CFLAGS: []
worker-0: creating tmp/tmpbs6r3xbl
worker-0: gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -c /tmp/tmpbs6r3xbl/test.c -o tmp/tmpbs6r3xbl/test.o
worker-0: /tmp/tmpbs6r3xbl/test.c: In function 'main':
worker-0: /tmp/tmpbs6r3xbl/test.c:1:35: warning: implicit declaration of function 'io_submit' [-Wimplicit-function-declaration]
worker-0:  int main(int argc, char** argv) { io_submit(); return 0; }
worker-0:                                    ^~~~~~~~~
worker-0: LDFLAGS: []
worker-0: gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ tmp/tmpbs6r3xbl/test.o -laio -o /tmp/tmpbs6r3xbl/a.out
worker-0: Cleaning up temp directory for the test C program

@adammoody, once again I do apologize for the unnecessary alarm and extra work.

@stas00, please go ahead with testing on your side.

Going forward, I wonder if we should suppress the expected compiler warnings (i.e., -Wimplicit-int, -Wimplicit-function-declaration) in the build process since they are expected behavior of self.has_function(). What do you guys think?

@tjruwase
Copy link
Contributor

So I went ahead to test my proposal of suppressing these specific compiler warnings using the following change to this branch:

diff --git a/op_builder/builder.py b/op_builder/builder.py
index 033a3b6..db0cfef 100644
--- a/op_builder/builder.py
+++ b/op_builder/builder.py
@@ -204,7 +204,7 @@ class OpBuilder(ABC):
             #os.dup2(filestderr.fileno(), sys.stderr.fileno())

             # Attempt to compile the C program into an object file.
-            cflags = []
+            cflags = ['-Wno-implicit-int', '-Wno-implicit-function-declaration']
             if 'CFLAGS' in os.environ:
                 cflags = os.environ['CFLAGS'].split()
             print('CFLAGS:', cflags)

And now the corresponding build log snippet looks like the following. What do you think?

worker-0: Testing for io_submit
worker-0: Compiler default opts: ['gcc', '-pthread', '-B', '/opt/conda/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes']
worker-0: Compiler (so) default opts: ['gcc', '-pthread', '-B', '/opt/conda/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes', '-fPIC']
worker-0: Linker default opts: ['gcc', '-pthread', '-B', '/opt/conda/compiler_compat', '-Wl,--sysroot=/']
worker-0: Linker (so) default opts: ['gcc', '-pthread', '-shared', '-B', '/opt/conda/compiler_compat', '-L/opt/conda/lib', '-Wl,-rpath=/opt/conda/lib', '-Wl,--no-as-needed', '-Wl,--sysroot=/']
worker-0: Tempfile: /tmp/tmp871com4z/test.c
worker-0: Test program: int main(int argc, char** argv) { io_submit(); return 0; }
worker-0: CFLAGS: ['-Wno-implicit-int', '-Wno-implicit-function-declaration']
worker-0: creating tmp/tmp871com4z
worker-0: gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -Wno-implicit-int -Wno-implicit-function-declaration -c /tmp/tmp871com4z/test.c -o tmp/tmp871com4z/test.o
worker-0: LDFLAGS: []
worker-0: gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ tmp/tmp871com4z/test.o -laio -o /tmp/tmp871com4z/a.out
worker-0: Cleaning up temp directory for the test C program
worker-0: async_io ............... [NO] ....... [OKAY]

@stas00
Copy link
Collaborator

stas00 commented Jul 28, 2021

with libaio preinstalled I get:

    Running setup.py (path:/mnt/nvme1/code/github/00optimize/deepspeed/setup.py) egg_info for package from file:///mnt/nvme1/code/github/00optimize/deepspeed
    Created temporary directory: /tmp/pip-pip-egg-info-xsatlyf7
    Running command python setup.py egg_info
    DS_BUILD_OPS=0
    Testing for io_submit
    Compiler default opts: ['gcc', '-pthread', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes']
    Compiler (so) default opts: ['gcc', '-pthread', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes', '-fPIC']
    Linker default opts: ['gcc', '-pthread', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-Wl,--sysroot=/']
    Linker (so) default opts: ['gcc', '-pthread', '-shared', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-L/home/stas/anaconda3/envs/py38-pt19/lib', '-Wl,-rpath=/home/stas/anaconda3/envs/py38-pt19/lib', '-Wl,--no-as-needed', '-Wl,--sysroot=/']
    Tempfile: /tmp/tmp7b212fef/test.c
    Test program: int main(int argc, char** argv) { io_submit(); return 0; }
    CFLAGS: []
    creating tmp
    creating tmp/tmp7b212fef
    gcc -pthread -B /home/stas/anaconda3/envs/py38-pt19/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -c /tmp/tmp7b212fef/test.c -o tmp/tmp7b212fef/test.o
    /tmp/tmp7b212fef/test.c: In function ‘main’:
    /tmp/tmp7b212fef/test.c:1:35: warning: implicit declaration of function ‘io_submit’ [-Wimplicit-function-declaration]
        1 | int main(int argc, char** argv) { io_submit(); return 0; }
          |                                   ^~~~~~~~~
    LDFLAGS: []
    gcc -pthread -B /home/stas/anaconda3/envs/py38-pt19/compiler_compat -Wl,--sysroot=/ tmp/tmp7b212fef/test.o -laio -o /tmp/tmp7b212fef/a.out
    Cleaning up temp directory for the test C program
    Testing for io_submit
    Compiler default opts: ['gcc', '-pthread', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes']
    Compiler (so) default opts: ['gcc', '-pthread', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes', '-fPIC']
    Linker default opts: ['gcc', '-pthread', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-Wl,--sysroot=/']
    Linker (so) default opts: ['gcc', '-pthread', '-shared', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-L/home/stas/anaconda3/envs/py38-pt19/lib', '-Wl,-rpath=/home/stas/anaconda3/envs/py38-pt19/lib', '-Wl,--no-as-needed', '-Wl,--sysroot=/']
    Tempfile: /tmp/tmp24l1d9ti/test.c
    Test program: int main(int argc, char** argv) { io_submit(); return 0; }
    CFLAGS: []
    creating tmp/tmp24l1d9ti
    gcc -pthread -B /home/stas/anaconda3/envs/py38-pt19/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -c /tmp/tmp24l1d9ti/test.c -o tmp/tmp24l1d9ti/test.o
    /tmp/tmp24l1d9ti/test.c: In function ‘main’:
    /tmp/tmp24l1d9ti/test.c:1:35: warning: implicit declaration of function ‘io_submit’ [-Wimplicit-function-declaration]
        1 | int main(int argc, char** argv) { io_submit(); return 0; }
          |                                   ^~~~~~~~~
    LDFLAGS: []
    gcc -pthread -B /home/stas/anaconda3/envs/py38-pt19/compiler_compat -Wl,--sysroot=/ tmp/tmp24l1d9ti/test.o -laio -o /tmp/tmp24l1d9ti/a.out
    Cleaning up temp directory for the test C program

if you look it's doing the attempt to build the program twice

@stas00
Copy link
Collaborator

stas00 commented Jul 28, 2021

If I uninstall libaio-dev:

    Running setup.py (path:/mnt/nvme1/code/github/00optimize/deepspeed/setup.py) egg_info for package from file:///mnt/nvme1/code/github/00optimize/deepspeed
    Created temporary directory: /tmp/pip-pip-egg-info-femeyzk_
    Running command python setup.py egg_info
    DS_BUILD_OPS=0
    Testing for io_submit
    Compiler default opts: ['gcc', '-pthread', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes']
    Compiler (so) default opts: ['gcc', '-pthread', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes', '-fPIC']
    Linker default opts: ['gcc', '-pthread', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-Wl,--sysroot=/']
    Linker (so) default opts: ['gcc', '-pthread', '-shared', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-L/home/stas/anaconda3/envs/py38-pt19/lib', '-Wl,-rpath=/home/stas/anaconda3/envs/py38-pt19/lib', '-Wl,--no-as-needed', '-Wl,--sysroot=/']
    Tempfile: /tmp/tmpkmtnddn5/test.c
    Test program: int main(int argc, char** argv) { io_submit(); return 0; }
    CFLAGS: []
    creating tmp/tmpkmtnddn5
    gcc -pthread -B /home/stas/anaconda3/envs/py38-pt19/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -c /tmp/tmpkmtnddn5/test.c -o tmp/tmpkmtnddn5/test.o
    /tmp/tmpkmtnddn5/test.c: In function ‘main’:
    /tmp/tmpkmtnddn5/test.c:1:35: warning: implicit declaration of function ‘io_submit’ [-Wimplicit-function-declaration]
        1 | int main(int argc, char** argv) { io_submit(); return 0; }
          |                                   ^~~~~~~~~
    LDFLAGS: []
    gcc -pthread -B /home/stas/anaconda3/envs/py38-pt19/compiler_compat -Wl,--sysroot=/ tmp/tmpkmtnddn5/test.o -laio -o /tmp/tmpkmtnddn5/a.out
    /home/stas/anaconda3/envs/py38-pt19/compiler_compat/ld: cannot find -laio
    collect2: error: ld returned 1 exit status
    Failed to link a test C program
    Cleaning up temp directory for the test C program
     [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
     [WARNING]  async_io: please install the libaio-dev package with dpkg
     [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
    Testing for io_submit
    Compiler default opts: ['gcc', '-pthread', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes']
    Compiler (so) default opts: ['gcc', '-pthread', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-Wl,--sysroot=/', '-Wsign-compare', '-DNDEBUG', '-g', '-fwrapv', '-O3', '-Wall', '-Wstrict-prototypes', '-fPIC']
    Linker default opts: ['gcc', '-pthread', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-Wl,--sysroot=/']
    Linker (so) default opts: ['gcc', '-pthread', '-shared', '-B', '/home/stas/anaconda3/envs/py38-pt19/compiler_compat', '-L/home/stas/anaconda3/envs/py38-pt19/lib', '-Wl,-rpath=/home/stas/anaconda3/envs/py38-pt19/lib', '-Wl,--no-as-needed', '-Wl,--sysroot=/']
    Tempfile: /tmp/tmpbjngta83/test.c
    Test program: int main(int argc, char** argv) { io_submit(); return 0; }
    CFLAGS: []
    creating tmp/tmpbjngta83
    gcc -pthread -B /home/stas/anaconda3/envs/py38-pt19/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -c /tmp/tmpbjngta83/test.c -o tmp/tmpbjngta83/test.o
    /tmp/tmpbjngta83/test.c: In function ‘main’:
    /tmp/tmpbjngta83/test.c:1:35: warning: implicit declaration of function ‘io_submit’ [-Wimplicit-function-declaration]
        1 | int main(int argc, char** argv) { io_submit(); return 0; }
          |                                   ^~~~~~~~~
    LDFLAGS: []
    gcc -pthread -B /home/stas/anaconda3/envs/py38-pt19/compiler_compat -Wl,--sysroot=/ tmp/tmpbjngta83/test.o -laio -o /tmp/tmpbjngta83/a.out
    /home/stas/anaconda3/envs/py38-pt19/compiler_compat/ld: cannot find -laio
    collect2: error: ld returned 1 exit status
    Failed to link a test C program
    Cleaning up temp directory for the test C program
     [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
     [WARNING]  async_io: please install the libaio-dev package with dpkg
     [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.

@stas00
Copy link
Collaborator

stas00 commented Jul 28, 2021

So the only nuance from my side is that many Ubuntu users

  1. may not know what dpkg is, as apt is the package manager (I don't see a way to query a package with apt)
  2. dpgk doesn't automatically fetch packages, whereas apt does

Therefore, could we expand that lookup table to have an entry for the program that we tell users to use? e.g. on Ubuntu that would be:

 [WARNING]  async_io: please install the libaio-dev package with apt

alternatively just dropping the package manager name from the warning, as in:

 [WARNING]  async_io: please install the libaio-dev system package

note I changed the wording to system package, so users won't try to do pip install

@stas00
Copy link
Collaborator

stas00 commented Jul 28, 2021

But while we are at it, as I raised this issue in the past. Why are we issuing a warning and not an exception? We know DS_BUILD_AIO=1 will fail to build, and the warning is much harder to see.

Any reason why not to make it an exception during the build?

this shouldn't succeed w/o libaio, yet it does:

DS_BUILD_AIO=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

and then will fail at runtime.

@tjruwase
Copy link
Contributor

@stas00, so should this case fail because we are building only one extension? I believe by default, we test compatibility of all extensions even though user might not need all of them.

@stas00
Copy link
Collaborator

stas00 commented Jul 28, 2021

Yes, this is what I was suggesting. So if DS_BUILD_AIO=1 (or BUILD_ALL) and the setup is incompatible it should assert.

So this is an additional suggestion.

And if it's just checking for compatibility then warning is right.

@adammoody
Copy link
Contributor Author

re: redhat, is yum the standard package manager these days? I haven't used redhat in decades, but from what I remember rpm was a standalone package manager.

This is different with apt, where dpkg is low-level tool and apt is the package manager.

I believe we use yum to install rpm packages on our systems. It plays a role like apt where it can fetch packages from remote repositories and then install them via rpm.

@tjruwase
Copy link
Contributor

@tjruwase , I just added a change to setup.py that might do the trick before I saw your note.

7bf5c0f

If we think it'll take some iterations, we can do that in a different PR.

Awesome. This worked for me in that without libaio-dev this now breaks, sudo DS_BUILD_AIO=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

      File "/data/users/olruwase/deepseed/public/findlibaio2/setup.py", line 122, in <module>
        assert False, f"Unable to pre-compile {op_name}, perhaps disable with {env_var}=0"
    AssertionError: Unable to pre-compile async_io, perhaps disable with DS_BUILD_AIO=0
    DS_BUILD_OPS=0
     [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
     [WARNING]  async_io: please install the libaio-dev package with apt
     [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Exception information:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 188, in _main
    status = self.run(options, args)
  File "/opt/conda/lib/python3.6/site-packages/pip/_internal/cli/req_command.py", line 185, in wrapper
    return func(self, options, args)
  File "/opt/conda/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 333, in run
    reqs, check_supported_wheels=not options.target_dir
  File "/opt/conda/lib/python3.6/site-packages/pip/_internal/resolution/legacy/resolver.py", line 179, in resolve
    discovered_reqs.extend(self._resolve_one(requirement_set, req))

@tjruwase
Copy link
Contributor

@adammoody, this looks good to merge.

@stas00, what say ye?

@adammoody
Copy link
Contributor Author

@tjruwase , squeezed in one more commit that I think should just run each is_compatible test once. I'll put down the pencil now.

@stas00
Copy link
Collaborator

stas00 commented Jul 28, 2021

Let me re-test

@tjruwase
Copy link
Contributor

@tjruwase , squeezed in one more commit that I think should just run each is_compatible test once. I'll put down the pencil now.

And all the surviving DeepSpeed bugs let out a collective sigh of relief 🥇

@stas00
Copy link
Collaborator

stas00 commented Jul 28, 2021

with the lib it's good, w/o it it's not great, I don't think the user can make any sense out of it:

DS_BUILD_AIO=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

    Running setup.py (path:/mnt/nvme1/code/github/00optimize/deepspeed/setup.py) egg_info for package from file:///mnt/nvme1/code/github/00optimize/deepspeed
    Created temporary directory: /tmp/pip-pip-egg-info-nx64tvmv
    Running command python setup.py egg_info
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/mnt/nvme1/code/github/00optimize/deepspeed/setup.py", line 124, in <module>
        assert False, f"Unable to pre-compile {op_name}, perhaps disable with {env_var}=0"
    AssertionError: Unable to pre-compile async_io, perhaps disable with DS_BUILD_AIO=0
    DS_BUILD_OPS=0
     [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
     [WARNING]  async_io: please install the libaio-dev package with apt
     [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
WARNING: Discarding file:///mnt/nvme1/code/github/00optimize/deepspeed. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.                                                                                                                          
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Exception information:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/cli/base_command.py", line 180, in _main
    status = self.run(options, args)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/cli/req_command.py", line 205, in wrapper
    return func(self, options, args)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/commands/install.py", line 318, in run
    requirement_set = resolver.resolve(
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 103, in resolve
    r = self.factory.make_requirement_from_install_req(
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 444, in make_requirement_from_install_req
    raise self._build_failures[ireq.link]
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 185, in _make_candidate_from_link
    self._editable_candidate_cache[link] = EditableCandidate(
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 334, in __init__
    super().__init__(
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 151, in __init__
    self.dist = self._prepare()
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 234, in _prepare
    dist = self._prepare_distribution()
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 345, in _prepare_distribution
    return self._factory.preparer.prepare_editable_requirement(self._ireq)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/operations/prepare.py", line 622, in prepare_editable_requirement
    dist = _get_prepared_distribution(
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/operations/prepare.py", line 60, in _get_prepared_distribution
    abstract_dist.prepare_distribution_metadata(finder, build_isolation)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/distributions/sdist.py", line 36, in prepare_distribution_metadata
    self.req.prepare_metadata()
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/req/req_install.py", line 549, in prepare_metadata
    self.metadata_directory = self._generate_metadata()
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/req/req_install.py", line 524, in _generate_metadata
    return generate_metadata_legacy(
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/operations/build/metadata_legacy.py", line 67, in generate_metadata
    call_subprocess(
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/pip/_internal/utils/subprocess.py", line 244, in call_subprocess
    raise InstallationSubprocessError(proc.returncode, command_desc)
pip._internal.exceptions.InstallationSubprocessError: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Removed file:///mnt/nvme1/code/github/00optimize/deepspeed from build tracker '/tmp/pip-req-tracker-3wthsjb4'
Removed build tracker: '/tmp/pip-req-tracker-3wthsjb4'

After a very close study I finally found the assert message way before the normal traceback.

   AssertionError: Unable to pre-compile async_io, perhaps disable with DS_BUILD_AIO=0
       DS_BUILD_OPS=0

and then warnings follow, which is very odd.

My vote is at least to kill "perhaps disable with DS_BUILD_AIO=0 DS_BUILD_OPS=0" - as this doesn't suggest anything useful. I enabled these manually because I needed those. And definitely not DS_BUILD_OPS=0 because it defeats the purpose of pre-buillding completely.

What this needs is a rethinking of why warnings are printed separately from the error and after it. Since it's illogical to the user.
But perhaps some other time.

For now, perhaps just changing the assert message to:

AssertionError: Unable to pre-compile async_io - can't find libaio .so and/or header files. See other messages for more details.

will be a good enough first step and then merge.

@adammoody
Copy link
Contributor Author

adammoody commented Jul 28, 2021

Hmm, yeah, the messages are being displayed in a different order than how they are printed, probably due to a mixing of stdout and stderr streams. One of those is coming from assert and the others are self.warnings(). Let me take another stab at it.

@adammoody
Copy link
Contributor Author

adammoody commented Jul 28, 2021

@stas00 , how about this change?

afc559c

I modified setup.py to only print the suggestion to disable an op if it sees that the environment variable for that op is not already defined. The assert message just says that it failed to build, but offers no suggestion.

The order of those messages might still come out mixed (stderr vs stdout), though maybe it's less confusing.

To fix the order, we probably should have builder.warning() write to stderr or define a builder.error() function that prints to stdout like builder.warning() does.

@stas00
Copy link
Collaborator

stas00 commented Jul 28, 2021

how about this change? I modified setup.py to only print the suggestion to disable an op if it sees that the environment variable for that op is not already defined. The assert message just says that it failed to build, but offers no suggestion.

Looks good. thank you!

Trying to sync std streams can be a madness, when I mentioned re-think I was thinking that different components do their work, and print nothing and return success/failure and any messages instead to the caller. Then the manager that runs these components has the freedom to print any messages it collected from the components in the order it needs.

But again, it's probably a job for another time.

@adammoody
Copy link
Contributor Author

It looks like setup.py only uses assert to bail out when it detects a fatal error. I'm guessing that assert is what generates the (noisy) python stack trace. We could avoid that if there is a better way to exit, but I don't know the convention for bailing with an error from a setup script offhand.

@stas00
Copy link
Collaborator

stas00 commented Jul 29, 2021

adding a few empty lines before and after the assert?

e.g. pytorch recently started doing logs with banners that stand out from the other noise, examples:

******************************************
  /home/ubuntu/tmp/test_script.py FAILED
==========================================
Root Cause:
[0]:
  time: 2021-07-23_02:22:07
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 43258)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
==========================================
Other Failures:
[1]:
  time: 2021-07-23_02:22:07
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 43259)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[2]:
  time: 2021-07-23_02:22:07
  rank: 2 (local_rank: 2)
  exitcode: 1 (pid: 43260)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[3]:
  time: 2021-07-23_02:22:07
  rank: 3 (local_rank: 3)
  exitcode: 1 (pid: 43261)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
******************************************

**********************************************************************
               CHILD PROCESS FAILED WITH NO ERROR_FILE                
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 3094556 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

  from torch.distributed.elastic.multiprocessing.errors import record

  @record
  def trainer_main(args):
     # do train
**********************************************************************

may be doing something similar here?

@adammoody
Copy link
Contributor Author

I can revert that with a new commit if needed, but this last commit adds a abort function that prints and error message in red and then calls sys.exit(1) rather than assert.

@stas00
Copy link
Collaborator

stas00 commented Jul 29, 2021

  1. I'm not sure if all terminals support color codes - typically I don't see those being used other than in some small projects. It's up to @tjruwase to approve or not.

  2. sys.exit(1) can break the "super"-system that calls setup.py - e.g. if installing some other package via pip and deepspeed is just a dependency of some package, I think this would be a very unwelcome behavior.

@adammoody
Copy link
Contributor Author

Ah, I see. Great points, @stas00 .

A call to sys.exit(1) definitely seems like a bad idea.

I borrowed the color code idea from the builder.warning() code:

YELLOW = '\033[93m'
END = '\033[0m'
WARNING = f"{YELLOW} [WARNING] {END}"

I will revert this and go back to the assert.

@tjruwase
Copy link
Contributor

I am fine with color code actually. Yes, sys.exit() seems an overkill.

@adammoody
Copy link
Contributor Author

From the previous commit 0e4cde6, we could replace the sys.exit(1) with an assert False, msg.

@adammoody
Copy link
Contributor Author

Ok, I'll stop now, for real this time :-)

Thanks for all of your help with this, @tjruwase and @stas00 .

@stas00
Copy link
Collaborator

stas00 commented Jul 29, 2021

I think your PR got hit by never starting CI - happened a few times to my PRs, you probably need to push an empty commit and it might start:

git commit --allow-empty -m "Trigger CI"
git push

@tjruwase
Copy link
Contributor

I just started the CI.

@tjruwase tjruwase merged commit e82060d into microsoft:master Jul 29, 2021
@adammoody adammoody deleted the findlibaio2 branch July 29, 2021 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants