Fix `fromIntArrayRef` call #1479

antoniojkim · 2022-10-11T15:12:35Z

This commit in upstream PyTorch: pytorch/pytorch@e8b0bea broke the fromIntArrayRef call.

Fixing the usages here and re-enabling LTC

silvasean

Okay, great. We discussed in the developer hour that perhaps we should keep LTC off except in a nightly build until some of the SymInt churn quiets down. What do you think @antoniojkim ? It seems like there is breakage here almost weekly these days.

antoniojkim · 2022-10-11T15:26:26Z

Okay, great. We discussed in the developer hour that perhaps we should keep LTC off except in a nightly build until some of the SymInt churn quiets down. What do you think @antoniojkim ? It seems like there is breakage here almost weekly these days.

@silvasean I'm sorry I missed the discussion. I don't quite agree with keeping it off. I think catching and fixing these issues as they come up is more in line with how things are done in PyTorch. Of course, its not really my place to assert this, but I don't think customers should be working off Torch-MLIR head without realizing that there will inevitably be churn. Especially since we have a dependency on a package that also goes through similar churn every day.

One possible way to alleviate this is to have more eyes and hands available to investigate and fix these failures so that we can get them resolved more quickly (CC: @GlebKazantaev @vaibhavc-cerebras)

Another possibility is to consider doing regular (possibly weekly?) releases based off stable(-ish) PyTorch commits. I think its a reasonable tradeoff for customers to be doing weekly updates for a bit more stability. This especially considering that Torch-MLIR relies on PyTorch whose HEAD is not guaranteed to be stable in the first place.

Thoughts?

silvasean · 2022-10-11T15:53:31Z

I think the main issue has been the responsiveness to fixing the issues as they arise. I think if we can get 1-2 business day response time for fixing the LTC issues then it won't be a problem.

I would be okay with a weekly cadence too (which the Cerebras folks could probably own, since LTC seems to be the major source of breakage), but we recently set up the continuous PyTorch rolling. @powderluv any thoughts to switching to a weekly PyTorch update?

antoniojkim · 2022-10-11T15:56:03Z

but we recently set up the continuous PyTorch rolling. @powderluv any thoughts to switching to a weekly PyTorch update?

I'm not proposing a weekly PyTorch update. I'm in favour of keeping a continuous PyTorch rolling. What I'm proposing is a weekly "release" wheel that can be distributed much like the nightly PyTorch wheel. Unless my understanding of how Torch-MLIR is distributed to customers is incorrect?

silvasean · 2022-10-11T16:00:00Z

but we recently set up the continuous PyTorch rolling. @powderluv any thoughts to switching to a weekly PyTorch update?

I'm not proposing a weekly PyTorch update. I'm in favour of keeping a continuous PyTorch rolling. What I'm proposing is a weekly "release" wheel that can be distributed much like the nightly PyTorch wheel. Unless my understanding of how Torch-MLIR is distributed to customers is incorrect?

I don't understand what is the relationship between the CI and the releases in the context of this thread. Here we are only talking about the CI.

antoniojkim · 2022-10-11T16:02:54Z

I don't understand what is the relationship between the CI and the releases in the context of this thread. Here we are only talking about the CI.

wasn't the issue that this was blocking customer's ability to update to the latest Torch-MLIR features?

If its just CI that we're talking about, then I think Cerebras can commit to adding more people to actively investigate these issues as they come up so that we can get the issues resolved with more haste

powderluv · 2022-10-11T18:11:56Z

I think if we have a reasonable commitment to fix LTC issues (1-2 days) it is ok to leave on and if it takes longer we can disable temporarily. (There is nothing wrong in disabling / enabling a feature to unblock the tree).

We moved living on latest PyTorch main to a pinned release so it gives us atleast a day of buffering.

but I don't think customers should be working off Torch-MLIR head without realizing that there will inevitably be churn. Especially since we have a dependency on a package that also goes through similar churn every day.

Yes but it is not the churn but staleness of the PyTorch. They are ok to live with the churn and have been living with the churn for a while (including turning on / off features) but when torch-mlir gets pinned to an older PyTorch they need to find / install the corresponding CUDA enabled PyTorch version also and wait for any upstream fixes.

One possible way to alleviate this is to have more eyes and hands available to investigate and fix these failures so that we can get them resolved more quickly (CC: @GlebKazantaev @vaibhavc-cerebras)

This is welcome and we can help. It is ok to disable temporarily if the fix can't be done quickly.

Another possibility is to consider doing regular (possibly weekly?) releases based off stable(-ish) PyTorch commits. I think its a reasonable tradeoff for customers to be doing weekly updates for a bit more stability. This especially considering that Torch-MLIR relies on PyTorch whose HEAD is not guaranteed to be stable in the first place.

This just delays it further and I would not take this route. We went from living on top of main PyTorch to daily (which is currently pushing longer) and then delaying it longer doesn't remove the need to have eyes on the Feature and fixing it. If we delay this -- then we will have our downstream customers hacking their local old PyTorch and we can't tell them to "upstream it first". Right now we push for fixes in upstream PyTorch and torch-mlir and it has served us well. This is an important reason why torch-mlir takes the burden of keeping up with main.

How about we try this:

Leave LTC on for now if we have commitment to watch LTC breakages in the nightly (This used to be with every CI in the past) and fix with 1 - 2 days max. Or we temporarily disable LTC at the end of 2 days.

Alternatively --
We setup and publish a release build with LTC ON and we can fix LTC breakages at our convenience without slowing down the whole train.

* Fix fromSymint call * Update PyTorch requirement * Re-enable LTC

antoniojkim added 3 commits October 11, 2022 08:05

Fix fromSymint call

2dc2cdd

Update PyTorch requirement

771923d

Re-enable LTC

b6fe925

antoniojkim requested review from powderluv and silvasean October 11, 2022 15:12

antoniojkim self-assigned this Oct 11, 2022

silvasean approved these changes Oct 11, 2022

View reviewed changes

antoniojkim merged commit 3e08f5a into llvm:main Oct 11, 2022

dellis23 pushed a commit that referenced this pull request Oct 25, 2022

Fix fromIntArrayRef call (#1479)

7909efd

* Fix fromSymint call * Update PyTorch requirement * Re-enable LTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `fromIntArrayRef` call #1479

Fix `fromIntArrayRef` call #1479

antoniojkim commented Oct 11, 2022

silvasean left a comment

antoniojkim commented Oct 11, 2022

silvasean commented Oct 11, 2022

antoniojkim commented Oct 11, 2022

silvasean commented Oct 11, 2022

antoniojkim commented Oct 11, 2022

powderluv commented Oct 11, 2022

Fix fromIntArrayRef call #1479

Fix fromIntArrayRef call #1479

Conversation

antoniojkim commented Oct 11, 2022

silvasean left a comment

Choose a reason for hiding this comment

antoniojkim commented Oct 11, 2022

silvasean commented Oct 11, 2022

antoniojkim commented Oct 11, 2022

silvasean commented Oct 11, 2022

antoniojkim commented Oct 11, 2022

powderluv commented Oct 11, 2022

Fix `fromIntArrayRef` call #1479

Fix `fromIntArrayRef` call #1479