-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a script to build libtorch #955
Conversation
67ab5ab
to
08ee7b1
Compare
Hey. Thanks again for trying to get this rolling. A couple quick comments, and if you want, I'll review in more detail later:
|
00951c3
to
8651487
Compare
Thank for the comments please see inline:
You'll definitely want to be setting -fabi-version=<...> on your gcc compiler flags and -U__GXX_ABI_VERSION -D__GXX_ABI_VERSION=<...>. PyTorch's release builds do this, but AFIAK, their source builds do not. It's a bug to have it set incorrectly. Check the old jit cmake file for more info on this.
I'd suggest passing in some of the environment variables from cmake instead of relying on the hard-coded defaults. It's good to have those defaults, but cmake already has more information about them, and we should probably leverage that. E.g., the install path, build directory, compiler names, etc. are all known to cmake but not to this script. Since you're calling the script from cmake, it should be fairly straightforward to pass them as part of the exec process.
Related to above: the build directory is hard-coded in build_pytorch() and package_pytorch(). I think I'd consider that a bug. It should follow the cmake build location.
Less important, but I'd like to request a flag that makes non-detection an error. This build script can be destructive in some ways (for instance, the pip invocation can screw up an environment), so it'd be good to be able to say "I should have this already built. If you can't find it, just quit. Don't build."
|
I still don't understand the 3-body problem aspect here. Say a user does this:
Internally to torch_mlir.compile we will be calling torch.jit.script which will be creating C++ data structures from the In other words, I don't think that "hermetic" is really possible -- the user will always be creating C++ data structures using their own installed PyTorch, which, if we statically link, will potentially skew from ours. |
Speaking generally about the patch though:
|
I think we're saying the same thing here. The three versions are now: the PyTorch Python library used when constructing the compiler (op reg & such), the static library linked in when compiling the compiler, and the PyTorch Python library used when running the compiler (
Agree. That's more or less where I was going. I guess my takeaway from that line of thought is "statically linking just inherently has this downside". This isn't too far from the move we made to snapshotting PyTorch wheels, but it does add more stuff. I think one of the things that's been bothering me is that there's not been a good way in torch-mlir to detect skew problems, especially in the ABI realm, and adding another piece compounds that. Perhaps the right answer here is to add some direct tracking information and compare/throw early. Better to have an explicit error rather than a misleading type mismatch exception or data corruption. That sounds like something that can be done independent from this, though. |
So yesterday I looked through a bit of the wheel upload process / auditwheel, and if I'm understanding this right, to more crisply describe the benefit here, essentially it allows us to use a more permissive manylinux tag on the wheels by reducing our shared library dependency surface area. However, when considering the use case of a user pip installing torch-mlir without any special flags, they would still need to install the exact PyTorch version. I'm not aware of any way to have pip automagically install the right one without special flags. One possibility would be for us to always take the JIR IR data structures and then serialize them in the installed PyTorch and then deserialize them in our hermetic PyTorch -- that seems to be a relatively stable interface. There are perf implications there though, and it doesn't solve the LTC problem :/ @antoniojkim @henrytwo @wconstab - relatedly, what is the plan for compatibility for LTC backends across PyTorch versions? |
Isn't there a way to set a hard dependency that will install the correct version? That can be set to the nightly builds for snapshots and the release wheels if Torch-MLIR decides to cut a release every time PyTorch does too.
What LTC problem are you referring to here?
Unless I'm mistaken, once LTC becomes "stable" the interface won't change much (if at all) afterwards. So, in terms of compatibility, it should be backwards compatible from the first "stable" release. When this first "stable" release happens though is something I'm not privy to and is maybe something that @wconstab can comment on. |
I think I see where you're going with that, but I suspect we'd still have to bend pybind11 backwards to make things work. While I'm definitely not a pybind expert, my understanding is that even simple things like passing a string from one python function to another might break, and we wouldn't get a chance to munge/serialize/protect it---some of that is inside of pybind. By the time we see it, it'd be too late. It's definitely a bit more extreme path to take.
Can you elaborate a bit on that? If we can compile with a more permissive tag but we can't actually run against those due to PyTorch linking, then do we gain much? I think I'm probably just misreading you or jumping to conclusions. |
Is this just the 'backend api' interface or are you referring to any other interfaces/IR? We have not formalized a plan for transitioning LTC through 'beta' and 'stable' milestones, although we left it as 'prototype' intentionally in 1.12 as we expect to make any changes needed for initial backends during this time as they ramp up. PTXLA is going through a migration to LTC and we have made several small changes to accommodate. I don't anticipate major changes. |
It would be any C++ ABI that is depended on -- this could include struct layout of any data structure passed to backends (e.g. IR data structures)
Ouch. Yeah, good point!
That's exactly what I'm getting at -- I don't think that this PR in isolation makes sense without a larger strategy for how to solve that problem. I asked on dev-discuss to get some feedback: https://dev-discuss.pytorch.org/t/productionizing-native-torch-extensions/668 |
@silvasean do you think that in the near term we can get by with a combination of
and target a 'stable' API further down the road? or is this more urgent? |
That makes sense to me -- is there a way for @powderluv -- wdyt? Sounds like a lot of your customers want to work off nightlies though, so I'm not sure that this will help make things much better for them. |
well i'm far from a pip/packaging expert. i guess the answer is no, but my suggestion might be to advertise the latest working command in the format of
perhaps you could maintain that 'one liner' on your docs/landing page (not ideal i know) |
I think @wconstab suggestion to track just released versions of torch are perfectly fine. We do have customers on the nightly / dev train but that is ok -- we will deal with that similar to developers and keep them on "released" torch versions. The main reason we are trying to target the static build of libtorch for now is that torch-mlir becomes hermetic and the
becomes easier since we can now release pypi nightly packages without stepping on the pytorch nightly we installed since weaking linking causes issues like #853 |
de3d575
to
a2f834d
Compare
@rdadolf @silvasean this is ready to review. It builds on parity and tests pass on Linux and macOS. The follow on of this will enable source build of libtorch for macOS builds and then we can land the custom op patches. |
This moves torch-mlir to link against libtorch on macOS and linux TESTS: Tests pass. Tested release builds on linux and macOS
Let's give this a shot. I'm still a little worried about having a download step as part of the build -- let's see if it causes problems in practice. |
I'll be honest: there's some parts of this approach that I'm not sure I like (I don't really buy the 'hermetic' argument since we break with the wrong external pytorch regardless of how we're built; the build process is getting even more super non-standard and complex; and the auto-checks are kludgy---and I still can't come up with a better one ;) ). But if this is the way to unblock Mac builds and Mac builds are supported and blocking other stuff, then I'd rather see things unblocked and moving. You've already put a huge amount of work into getting a way to work, so if we can get by with this, then we can always revisit the mechanisms behind it later. |
Yes less than ideal workflow based on what we are dealt with from upstream. We will continue trying to land upstream patches so this becomes cleaner / leaner. Next I will enable source builds for platforms that don't have prebuilts and then static linking with which we should be able to enable custom Ops and also have an auditwheel passable pypi package. |
@rdadolf I agree. |
This moves torch-mlir to link against libtorch on macOS and linux TESTS: Tests pass. Tested release builds on linux and macOS
This moves torch-mlir to link against libtorch on macOS and linux TESTS: Tests pass. Tested release builds on linux and macOS
After a lot of trial and error it has a good balance of working
static library. In the future we can cull it further
Linux jit_ir tracer and eager mode to libtorch.a so we can ship
to pypi with a hermtic package. Remove any python level probing