Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a script to build libtorch #955

Merged
merged 1 commit into from
Jun 30, 2022
Merged

Conversation

powderluv
Copy link
Collaborator

After a lot of trial and error it has a good balance of working
static library. In the future we can cull it further

Linux jit_ir tracer and eager mode to libtorch.a so we can ship
to pypi with a hermtic package. Remove any python level probing

@powderluv powderluv force-pushed the shell-build branch 3 times, most recently from 67ab5ab to 08ee7b1 Compare June 20, 2022 21:49
@rdadolf
Copy link
Collaborator

rdadolf commented Jun 20, 2022

Hey. Thanks again for trying to get this rolling. A couple quick comments, and if you want, I'll review in more detail later:

  • I'd like to understand better the compatibility enforcement mechanism here. By adding a static PyTorch library, it looks like you're taking a 2-body problem and making a 3-body one. I.e., before we had PyTorch and torch-llvm, and now we'll have PyTorch, torch-llvm, and the PyTorch static library torch-llvm is linked to. As I understand it, this means that users will now have to be fairly careful about making sure all three pieces match. PyBind11 tracks the compilation parameters of the C++ it wraps, and attempting to mix similar-looking Python objects will result in type exceptions. What's your thinking on best practices for how users (and developers) should be managing these?
  • You'll definitely want to be setting -fabi-version=<...> on your gcc compiler flags and -U__GXX_ABI_VERSION -D__GXX_ABI_VERSION=<...>. PyTorch's release builds do this, but AFIAK, their source builds do not. It's a bug to have it set incorrectly. Check the old jit cmake file for more info on this.
  • I'd suggest passing in some of the environment variables from cmake instead of relying on the hard-coded defaults. It's good to have those defaults, but cmake already has more information about them, and we should probably leverage that. E.g., the install path, build directory, compiler names, etc. are all known to cmake but not to this script. Since you're calling the script from cmake, it should be fairly straightforward to pass them as part of the exec process.
  • Related to above: the build directory is hard-coded in build_pytorch() and package_pytorch(). I think I'd consider that a bug. It should follow the cmake build location.
  • Less important, but I'd like to request a flag that makes non-detection an error. This build script can be destructive in some ways (for instance, the pip invocation can screw up an environment), so it'd be good to be able to say "I should have this already built. If you can't find it, just quit. Don't build."

@powderluv powderluv force-pushed the shell-build branch 2 times, most recently from 00951c3 to 8651487 Compare June 21, 2022 18:58
@powderluv
Copy link
Collaborator Author

Thank for the comments please see inline:

I'd like to understand better the compatibility enforcement mechanism here. By adding a static PyTorch library, it looks like you're taking a 2-body problem and making a 3-body one. I.e., before we had PyTorch and torch-llvm, and now we'll have PyTorch, torch-llvm, and the PyTorch static library torch-llvm is linked to. As I understand it, this means that users will now have to be fairly careful about making sure all three pieces match. PyBind11 tracks the compilation parameters of the C++ it wraps, and attempting to mix similar-looking Python objects will result in type exceptions. What's your thinking on best practices for how users (and developers) should be managing these?```

I think the goal here is to make it 2-body problem (or rather a single body problem when integrated as a static library). 
So The PyTorch pip install will not matter or be required when we link against the libtorch/ library. The hope is that it gets loosely coupled so in the future it will be some other opset versioning like ONNX that gets us the matching.  That said we will continue to have the required PyTorch version embedded in the setup.py so we always get alignment with that version until we are confident to relax that.  The hope is that we get to torch-mlir with "opset XX" (built with some version of min-libtorch) will be compatible from version 1.xx.xx to 1.yy.yy of PyTorch. Happy to discuss any better way you think about achieving this. 

You'll definitely want to be setting -fabi-version=<...> on your gcc compiler flags and -U__GXX_ABI_VERSION -D__GXX_ABI_VERSION=<...>. PyTorch's release builds do this, but AFIAK, their source builds do not. It's a bug to have it set incorrectly. Check the old jit cmake file for more info on this.

Will do. I think the latest patch is hitting this ABI mismatch issue and I will update it accordingly. 

I'd suggest passing in some of the environment variables from cmake instead of relying on the hard-coded defaults. It's good to have those defaults, but cmake already has more information about them, and we should probably leverage that. E.g., the install path, build directory, compiler names, etc. are all known to cmake but not to this script. Since you're calling the script from cmake, it should be fairly straightforward to pass them as part of the exec process.

I tried and failed in so many ways to get PyTorch's build system to behave. There are quirks from years ago https://github.com/pytorch/pytorch/issues/42018 that make it fragile. I am sure there a few more commits / fixes in there.  So the goal right now is to see if I can wrap up the hair-ball cleanly and then slowly expose other things as required. Ideally the ExternalProject_Add is the right way to do it and pass the CMAKE flags correctly. 

Related to above: the build directory is hard-coded in build_pytorch() and package_pytorch(). I think I'd consider that a bug. It should follow the cmake build location.

Agreed. Again dealing with PyTorch warts upstream :( . PyTorch builds from `python setup.py` assume a build/ dir and they copy over things in. You can't rename that build/ without dealing with more warts upstream. Even if you used `--cmake-only`.  There is an option for a `pure` CMake only build but then there seem to be some issues with generating the right python version strings etc etc. 

The alternative is -- we make a two step process explicit. You download / build libtorch however you want -- and we pass Torch_DIR and everything follows CMake. But I thought it would be cumbersome and was trying to merge them into one.  

Less important, but I'd like to request a flag that makes non-detection an error. This build script can be destructive in some ways (for instance, the pip invocation can screw up an environment), so it'd be good to be able to say "I should have this already built. If you can't find it, just quit. Don't build."

I did the src build explicit with a `-s`. Normal users will just download libtorch and not build. 

Let me know your thoughts. This is all not ideal just dealing with the best hand we have without trying to fix all of PyTorch upstream. 

@silvasean
Copy link
Contributor

silvasean commented Jun 22, 2022

I still don't understand the 3-body problem aspect here.

Say a user does this:

import torch
import torch_mlir

class TheModule(torch.nn.Module):
   ...

torch_mlir.compile(TheModule(), ...)

Internally to torch_mlir.compile we will be calling torch.jit.script which will be creating C++ data structures from the torch Python module, and then those will be passed to our statically linked libtorch -- seems like we are just adding another source of version skew.

In other words, I don't think that "hermetic" is really possible -- the user will always be creating C++ data structures using their own installed PyTorch, which, if we statically link, will potentially skew from ours.

@silvasean
Copy link
Contributor

silvasean commented Jun 22, 2022

Speaking generally about the patch though:

  1. That script is way too hairy for bash -- let's write it in Python.
  2. Let's have this be load bearing in our CI somehow
  3. I don't like having this be part of the usual build process. Can we make this be an out of band thing that we can then point CMake to explicitly and bypass the usual autoprobing? -DTORCH_MLIR_EXPLICIT_TORCH=... or something?

@rdadolf
Copy link
Collaborator

rdadolf commented Jun 23, 2022

I still don't understand the 3-body problem aspect here.
Internally to torch_mlir.compile we will be calling torch.jit.script which will be creating C++ data structures from the torch Python module, and then those will be passed to our statically linked libtorch -- seems like we are just adding another source of version skew.

I think we're saying the same thing here. The three versions are now: the PyTorch Python library used when constructing the compiler (op reg & such), the static library linked in when compiling the compiler, and the PyTorch Python library used when running the compiler (import torch; import torch_mlir). The first must be op-compatible, the latter two must be binary compatible.

In other words, I don't think that "hermetic" is really possible -- the user will always be creating C++ data structures using their own installed PyTorch, which, if we statically link, will potentially skew from ours.

Agree. That's more or less where I was going. I guess my takeaway from that line of thought is "statically linking just inherently has this downside". This isn't too far from the move we made to snapshotting PyTorch wheels, but it does add more stuff.

I think one of the things that's been bothering me is that there's not been a good way in torch-mlir to detect skew problems, especially in the ABI realm, and adding another piece compounds that. Perhaps the right answer here is to add some direct tracking information and compare/throw early. Better to have an explicit error rather than a misleading type mismatch exception or data corruption. That sounds like something that can be done independent from this, though.

@silvasean
Copy link
Contributor

So yesterday I looked through a bit of the wheel upload process / auditwheel, and if I'm understanding this right, to more crisply describe the benefit here, essentially it allows us to use a more permissive manylinux tag on the wheels by reducing our shared library dependency surface area.

However, when considering the use case of a user pip installing torch-mlir without any special flags, they would still need to install the exact PyTorch version. I'm not aware of any way to have pip automagically install the right one without special flags. One possibility would be for us to always take the JIR IR data structures and then serialize them in the installed PyTorch and then deserialize them in our hermetic PyTorch -- that seems to be a relatively stable interface. There are perf implications there though, and it doesn't solve the LTC problem :/

@antoniojkim @henrytwo @wconstab - relatedly, what is the plan for compatibility for LTC backends across PyTorch versions?

@antoniojkim
Copy link
Collaborator

I'm not aware of any way to have pip automagically install the right one without special flags

Isn't there a way to set a hard dependency that will install the correct version? That can be set to the nightly builds for snapshots and the release wheels if Torch-MLIR decides to cut a release every time PyTorch does too.

it doesn't solve the LTC problem :/

What LTC problem are you referring to here?

what is the plan for compatibility for LTC backends across PyTorch versions?

Unless I'm mistaken, once LTC becomes "stable" the interface won't change much (if at all) afterwards. So, in terms of compatibility, it should be backwards compatible from the first "stable" release. When this first "stable" release happens though is something I'm not privy to and is maybe something that @wconstab can comment on.

@rdadolf
Copy link
Collaborator

rdadolf commented Jun 24, 2022

One possibility would be for us to always take the JIR IR data structures and then serialize them in the installed PyTorch and then deserialize them in our hermetic PyTorch

I think I see where you're going with that, but I suspect we'd still have to bend pybind11 backwards to make things work. While I'm definitely not a pybind expert, my understanding is that even simple things like passing a string from one python function to another might break, and we wouldn't get a chance to munge/serialize/protect it---some of that is inside of pybind. By the time we see it, it'd be too late. It's definitely a bit more extreme path to take.

essentially it allows us to use a more permissive manylinux tag on the wheels by reducing our shared library dependency surface area

Can you elaborate a bit on that? If we can compile with a more permissive tag but we can't actually run against those due to PyTorch linking, then do we gain much? I think I'm probably just misreading you or jumping to conclusions.

@wconstab
Copy link
Collaborator

@antoniojkim @henrytwo @wconstab - relatedly, what is the plan for compatibility for LTC backends across PyTorch versions?

Is this just the 'backend api' interface or are you referring to any other interfaces/IR?

We have not formalized a plan for transitioning LTC through 'beta' and 'stable' milestones, although we left it as 'prototype' intentionally in 1.12 as we expect to make any changes needed for initial backends during this time as they ramp up. PTXLA is going through a migration to LTC and we have made several small changes to accommodate. I don't anticipate major changes.

@silvasean
Copy link
Contributor

@antoniojkim @henrytwo @wconstab - relatedly, what is the plan for compatibility for LTC backends across PyTorch versions?

Is this just the 'backend api' interface or are you referring to any other interfaces/IR?

It would be any C++ ABI that is depended on -- this could include struct layout of any data structure passed to backends (e.g. IR data structures)

I think I see where you're going with that, but I suspect we'd still have to bend pybind11 backwards to make things work. While I'm definitely not a pybind expert, my understanding is that even simple things like passing a string from one python function to another might break, and we wouldn't get a chance to munge/serialize/protect it---some of that is inside of pybind. By the time we see it, it'd be too late. It's definitely a bit more extreme path to take.

Ouch. Yeah, good point!

Can you elaborate a bit on that? If we can compile with a more permissive tag but we can't actually run against those due to PyTorch linking, then do we gain much? I think I'm probably just misreading you or jumping to conclusions.

That's exactly what I'm getting at -- I don't think that this PR in isolation makes sense without a larger strategy for how to solve that problem. I asked on dev-discuss to get some feedback: https://dev-discuss.pytorch.org/t/productionizing-native-torch-extensions/668

@wconstab
Copy link
Collaborator

@silvasean do you think that in the near term we can get by with a combination of

  • for developers, tracking working combinations of nightlies some adhoc way
  • for releases, supporting only 'released' versions (e.g. target a release of torchmlir to coincide with torch1.13)

and target a 'stable' API further down the road? or is this more urgent?

@silvasean
Copy link
Contributor

@silvasean do you think that in the near term we can get by with a combination of

  • for developers, tracking working combinations of nightlies some adhoc way
  • for releases, supporting only 'released' versions (e.g. target a release of torchmlir to coincide with torch1.13)

and target a 'stable' API further down the road? or is this more urgent?

That makes sense to me -- is there a way for pip install torch-mlir to Just Work for this arrangement? Maybe you could weigh in on https://dev-discuss.pytorch.org/t/productionizing-native-torch-extensions/668

@powderluv -- wdyt? Sounds like a lot of your customers want to work off nightlies though, so I'm not sure that this will help make things much better for them.

@wconstab
Copy link
Collaborator

is there a way for pip install torch-mlir to Just Work for this arrangement?

well i'm far from a pip/packaging expert. i guess the answer is no, but my suggestion might be to advertise the latest working command in the format of

pip install <some latest-known-working torch nightly wheel URL> torch-mlir and it would install that torch version before torch-mlir?

perhaps you could maintain that 'one liner' on your docs/landing page (not ideal i know)

@powderluv
Copy link
Collaborator Author

I think @wconstab suggestion to track just released versions of torch are perfectly fine.

We do have customers on the nightly / dev train but that is ok -- we will deal with that similar to developers and keep them on "released" torch versions.

The main reason we are trying to target the static build of libtorch for now is that torch-mlir becomes hermetic and the

for developers, tracking working combinations of nightlies some adhoc way

becomes easier since we can now release pypi nightly packages without stepping on the pytorch nightly we installed since weaking linking causes issues like #853

@powderluv powderluv force-pushed the shell-build branch 3 times, most recently from de3d575 to a2f834d Compare June 29, 2022 18:38
@powderluv powderluv requested review from rdadolf June 29, 2022 18:38
@powderluv
Copy link
Collaborator Author

@rdadolf @silvasean this is ready to review. It builds on parity and tests pass on Linux and macOS. The follow on of this will enable source build of libtorch for macOS builds and then we can land the custom op patches.

This moves torch-mlir to link against libtorch on macOS and linux

TESTS: Tests pass. Tested release builds on linux and macOS
@powderluv powderluv requested a review from silvasean June 29, 2022 20:12
@silvasean
Copy link
Contributor

Let's give this a shot. I'm still a little worried about having a download step as part of the build -- let's see if it causes problems in practice.

@rdadolf
Copy link
Collaborator

rdadolf commented Jun 30, 2022

I'll be honest: there's some parts of this approach that I'm not sure I like (I don't really buy the 'hermetic' argument since we break with the wrong external pytorch regardless of how we're built; the build process is getting even more super non-standard and complex; and the auto-checks are kludgy---and I still can't come up with a better one ;) ). But if this is the way to unblock Mac builds and Mac builds are supported and blocking other stuff, then I'd rather see things unblocked and moving. You've already put a huge amount of work into getting a way to work, so if we can get by with this, then we can always revisit the mechanisms behind it later.

@powderluv powderluv merged commit 2b52da9 into llvm:main Jun 30, 2022
@powderluv
Copy link
Collaborator Author

Yes less than ideal workflow based on what we are dealt with from upstream. We will continue trying to land upstream patches so this becomes cleaner / leaner.

Next I will enable source builds for platforms that don't have prebuilts and then static linking with which we should be able to enable custom Ops and also have an auditwheel passable pypi package.

@silvasean
Copy link
Contributor

@rdadolf I agree.

gpetters94 pushed a commit to gpetters94/mlir-npcomp that referenced this pull request Jul 12, 2022
This moves torch-mlir to link against libtorch on macOS and linux

TESTS: Tests pass. Tested release builds on linux and macOS
gpetters94 pushed a commit to gpetters94/mlir-npcomp that referenced this pull request Jul 27, 2022
This moves torch-mlir to link against libtorch on macOS and linux

TESTS: Tests pass. Tested release builds on linux and macOS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants