-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix fromIntArrayRef
call
#1479
Fix fromIntArrayRef
call
#1479
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, great. We discussed in the developer hour that perhaps we should keep LTC off except in a nightly build until some of the SymInt churn quiets down. What do you think @antoniojkim ? It seems like there is breakage here almost weekly these days.
@silvasean I'm sorry I missed the discussion. I don't quite agree with keeping it off. I think catching and fixing these issues as they come up is more in line with how things are done in PyTorch. Of course, its not really my place to assert this, but I don't think customers should be working off Torch-MLIR head without realizing that there will inevitably be churn. Especially since we have a dependency on a package that also goes through similar churn every day. One possible way to alleviate this is to have more eyes and hands available to investigate and fix these failures so that we can get them resolved more quickly (CC: @GlebKazantaev @vaibhavc-cerebras) Another possibility is to consider doing regular (possibly weekly?) releases based off stable(-ish) PyTorch commits. I think its a reasonable tradeoff for customers to be doing weekly updates for a bit more stability. This especially considering that Torch-MLIR relies on PyTorch whose HEAD is not guaranteed to be stable in the first place. Thoughts? |
I think the main issue has been the responsiveness to fixing the issues as they arise. I think if we can get 1-2 business day response time for fixing the LTC issues then it won't be a problem. I would be okay with a weekly cadence too (which the Cerebras folks could probably own, since LTC seems to be the major source of breakage), but we recently set up the continuous PyTorch rolling. @powderluv any thoughts to switching to a weekly PyTorch update? |
I'm not proposing a weekly PyTorch update. I'm in favour of keeping a continuous PyTorch rolling. What I'm proposing is a weekly "release" wheel that can be distributed much like the nightly PyTorch wheel. Unless my understanding of how Torch-MLIR is distributed to customers is incorrect? |
I don't understand what is the relationship between the CI and the releases in the context of this thread. Here we are only talking about the CI. |
wasn't the issue that this was blocking customer's ability to update to the latest Torch-MLIR features? If its just CI that we're talking about, then I think Cerebras can commit to adding more people to actively investigate these issues as they come up so that we can get the issues resolved with more haste |
I think if we have a reasonable commitment to fix LTC issues (1-2 days) it is ok to leave on and if it takes longer we can disable temporarily. (There is nothing wrong in disabling / enabling a feature to unblock the tree). We moved living on latest PyTorch main to a pinned release so it gives us atleast a day of buffering.
Yes but it is not the churn but staleness of the PyTorch. They are ok to live with the churn and have been living with the churn for a while (including turning on / off features) but when torch-mlir gets pinned to an older PyTorch they need to find / install the corresponding CUDA enabled PyTorch version also and wait for any upstream fixes.
This is welcome and we can help. It is ok to disable temporarily if the fix can't be done quickly.
This just delays it further and I would not take this route. We went from living on top of main PyTorch to daily (which is currently pushing longer) and then delaying it longer doesn't remove the need to have eyes on the Feature and fixing it. If we delay this -- then we will have our downstream customers hacking their local old PyTorch and we can't tell them to "upstream it first". Right now we push for fixes in upstream PyTorch and torch-mlir and it has served us well. This is an important reason why torch-mlir takes the burden of keeping up with main. How about we try this: Leave LTC on for now if we have commitment to watch LTC breakages in the nightly (This used to be with every CI in the past) and fix with 1 - 2 days max. Or we temporarily disable LTC at the end of 2 days. Alternatively -- |
This commit in upstream PyTorch: pytorch/pytorch@e8b0bea broke the
fromIntArrayRef
call.Fixing the usages here and re-enabling LTC
CC: @GlebKazantaev @vaibhavc-cerebras