Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fromIntArrayRef call #1479

Merged
merged 3 commits into from
Oct 11, 2022

Conversation

antoniojkim
Copy link
Collaborator

This commit in upstream PyTorch: pytorch/pytorch@e8b0bea broke the fromIntArrayRef call.

Fixing the usages here and re-enabling LTC

CC: @GlebKazantaev @vaibhavc-cerebras

@antoniojkim antoniojkim self-assigned this Oct 11, 2022
Copy link
Contributor

@silvasean silvasean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, great. We discussed in the developer hour that perhaps we should keep LTC off except in a nightly build until some of the SymInt churn quiets down. What do you think @antoniojkim ? It seems like there is breakage here almost weekly these days.

@antoniojkim
Copy link
Collaborator Author

Okay, great. We discussed in the developer hour that perhaps we should keep LTC off except in a nightly build until some of the SymInt churn quiets down. What do you think @antoniojkim ? It seems like there is breakage here almost weekly these days.

@silvasean I'm sorry I missed the discussion. I don't quite agree with keeping it off. I think catching and fixing these issues as they come up is more in line with how things are done in PyTorch. Of course, its not really my place to assert this, but I don't think customers should be working off Torch-MLIR head without realizing that there will inevitably be churn. Especially since we have a dependency on a package that also goes through similar churn every day.

One possible way to alleviate this is to have more eyes and hands available to investigate and fix these failures so that we can get them resolved more quickly (CC: @GlebKazantaev @vaibhavc-cerebras)

Another possibility is to consider doing regular (possibly weekly?) releases based off stable(-ish) PyTorch commits. I think its a reasonable tradeoff for customers to be doing weekly updates for a bit more stability. This especially considering that Torch-MLIR relies on PyTorch whose HEAD is not guaranteed to be stable in the first place.

Thoughts?

@silvasean
Copy link
Contributor

I think the main issue has been the responsiveness to fixing the issues as they arise. I think if we can get 1-2 business day response time for fixing the LTC issues then it won't be a problem.

I would be okay with a weekly cadence too (which the Cerebras folks could probably own, since LTC seems to be the major source of breakage), but we recently set up the continuous PyTorch rolling. @powderluv any thoughts to switching to a weekly PyTorch update?

@antoniojkim
Copy link
Collaborator Author

but we recently set up the continuous PyTorch rolling. @powderluv any thoughts to switching to a weekly PyTorch update?

I'm not proposing a weekly PyTorch update. I'm in favour of keeping a continuous PyTorch rolling. What I'm proposing is a weekly "release" wheel that can be distributed much like the nightly PyTorch wheel. Unless my understanding of how Torch-MLIR is distributed to customers is incorrect?

@silvasean
Copy link
Contributor

but we recently set up the continuous PyTorch rolling. @powderluv any thoughts to switching to a weekly PyTorch update?

I'm not proposing a weekly PyTorch update. I'm in favour of keeping a continuous PyTorch rolling. What I'm proposing is a weekly "release" wheel that can be distributed much like the nightly PyTorch wheel. Unless my understanding of how Torch-MLIR is distributed to customers is incorrect?

I don't understand what is the relationship between the CI and the releases in the context of this thread. Here we are only talking about the CI.

@antoniojkim
Copy link
Collaborator Author

I don't understand what is the relationship between the CI and the releases in the context of this thread. Here we are only talking about the CI.

wasn't the issue that this was blocking customer's ability to update to the latest Torch-MLIR features?

If its just CI that we're talking about, then I think Cerebras can commit to adding more people to actively investigate these issues as they come up so that we can get the issues resolved with more haste

@antoniojkim antoniojkim merged commit 3e08f5a into llvm:main Oct 11, 2022
@powderluv
Copy link
Collaborator

I think if we have a reasonable commitment to fix LTC issues (1-2 days) it is ok to leave on and if it takes longer we can disable temporarily. (There is nothing wrong in disabling / enabling a feature to unblock the tree).

We moved living on latest PyTorch main to a pinned release so it gives us atleast a day of buffering.

but I don't think customers should be working off Torch-MLIR head without realizing that there will inevitably be churn. Especially since we have a dependency on a package that also goes through similar churn every day.

Yes but it is not the churn but staleness of the PyTorch. They are ok to live with the churn and have been living with the churn for a while (including turning on / off features) but when torch-mlir gets pinned to an older PyTorch they need to find / install the corresponding CUDA enabled PyTorch version also and wait for any upstream fixes.

One possible way to alleviate this is to have more eyes and hands available to investigate and fix these failures so that we can get them resolved more quickly (CC: @GlebKazantaev @vaibhavc-cerebras)

This is welcome and we can help. It is ok to disable temporarily if the fix can't be done quickly.

Another possibility is to consider doing regular (possibly weekly?) releases based off stable(-ish) PyTorch commits. I think its a reasonable tradeoff for customers to be doing weekly updates for a bit more stability. This especially considering that Torch-MLIR relies on PyTorch whose HEAD is not guaranteed to be stable in the first place.

This just delays it further and I would not take this route. We went from living on top of main PyTorch to daily (which is currently pushing longer) and then delaying it longer doesn't remove the need to have eyes on the Feature and fixing it. If we delay this -- then we will have our downstream customers hacking their local old PyTorch and we can't tell them to "upstream it first". Right now we push for fixes in upstream PyTorch and torch-mlir and it has served us well. This is an important reason why torch-mlir takes the burden of keeping up with main.

How about we try this:

Leave LTC on for now if we have commitment to watch LTC breakages in the nightly (This used to be with every CI in the past) and fix with 1 - 2 days max. Or we temporarily disable LTC at the end of 2 days.

Alternatively --
We setup and publish a release build with LTC ON and we can fix LTC breakages at our convenience without slowing down the whole train.

dellis23 pushed a commit that referenced this pull request Oct 25, 2022
* Fix fromSymint call

* Update PyTorch requirement

* Re-enable LTC
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants