-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce the Arm(R) Ethos(TM)-U Cascading Scheduler #37
Conversation
Thanks for the RFC!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbaret thanks for the detailed RFC! i think my main questions here are whether or not this could be implemented in roughly two pieces similar to USMP:
- a piece which could be reused if we e.g. wanted to attempt to apply this to non-Ethos-U kernels
- the Ethos-U-specific logic.
I understand it may not make sense to integrate with AutoTVM--it might make more sense to wait til TensorIR has fully landed. but just curious to understand in planning terms what the lift might be.
@@ -41,51 +41,17 @@ Deciding on exactly which operators should be cascaded and with what striping pa | |||
|
|||
They key piece of information to calculate in order to characterize a cascade is how the stripe size changes throughout. This is a function of the data dependency between an operator's inputs and outputs. For many operators that we're interested in, an affine transform matrix can be used to represent this dependency if we represent the input and output stripe sizes as vectors. Affine transforms typically consider 'augmented' matrices and vectors (https://en.wikipedia.org/wiki/Affine_transformation#Augmented_matrix) which allow for the representation of constant changes. Concretely, we define the transform matrix M as being the matrix for which the following holds: | |||
|
|||
$$stripe_{in} = {M} \cdot {stripe_{out}}$$ | |||
![meta-schedule-workflow](../resources/cascading-formula-1.png) | |||
|
|||
Let's briefly consider how to derive such a transform matrix for a 3x3 unstrided, undilated and unpadded NHWC convolution. Immediately, the '3x3' kernel tells us something important: a single element in the output depends on 3x3 elements in the height/width of the input. If we were instead to consider a 2x2 region of the output in the height/width dimensions, we'd then need a 4x4 region in the input. So in general, the rule is that we need 2 more elements in height and width when calculating the dependencies of an output stripe. It can be shown that more generally this number is the kernel_size-1 in each axis. Now to consider the channels, in a convolution no matter how many output elements you are computing you'll always need every input channel. This is because the input channel axis is a reduction axis in a convolution, in a sense it isn't 'reflected' in the output. Combining these two observations, we arrive at the following transform matrix: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now to consider the channels, in a convolution no matter how many output elements you are computing you'll always need every input channel.
just curious: what about depthwise convolutions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a depthwise you're right that this will be different, but depthwise will also use a different transform matrix. So here we're just working out the transform matrix for a standard 2D convolution.
Thanks for taking a look @junrushao1994!
For Ethos-U, we have a different compilation flow to 'standard' TVM. In particular, those operators which are being offloaded onto Ethos-U don't go through FuseOps but instead are lowered into TE as one large graph. This is what allows us to avoid the problems of Relay level fusion. I agree that were we to want to generalize this technique further, we would need a solution for this.
I hadn't actually noticed the affine analysis infra before, so thanks for pointing me towards this. From my initial look over it, I suspect there may be elements we could reuse, but the representations are unfamiliar to me as it how to work with them from outside TIR. Perhaps this could be better discussed on a draft PR to demonstrate specifically what we do with the affine transforms and whether the existing infra would be able to achieve the same effect. |
Thank you for your response @mbaret 😄
For sure, we are happy to work together and potentially improve the affine analysis infrastructure. Note that the iter-affine-map is located under CC'ed the author of affine analysis in TVM: @tqchen @spectrometerHBH @Hzfengsy |
Change-Id: I34d0acf303391157cf3ddab9ae771b3f6c667967
Change-Id: Ie4b6fbbc0cc060da6a733c06f276f2c22d2cbc93
Change-Id: I89e9128a7e18791671fb4a7baeee9b64e65174c4
Apologies for the late reply @areusch. Due to some of the uncertainties around implementing this generically (Relay fusion/conflict with AutoTVM/deriving matrices), I think it's more pragmatic to have everything initially scoped just to Ethos-U. But within that implementation, all the truly Ethos-U specific logic is quite self-contained (pretty much entirely in the performance heuristics and matrix definitions). Therefore, should this need reusing in future we could probably lift it out of the Ethos-U namespace. I think this would be very doable as the cascader itself is quite well decoupled from the rest of the Ethos-U compiler, exposing only a TE Schedule -> TE Schedule interface. With some effort, although hopefully not too much, that interface could likely be tweaked to be TensorIR -> TensorIR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Mike, thx for such a nicely written RFC, you're upping the bar.
I think we all agreed in our last live conversation:
- Though the S-TIR machinery is now in main and could be a viable medium instead of TE that transition will be left to future work.
- The transition to TE should hopefully not require duplicating the ScheduleBuilder visitor -- I think someone else at ARM was getting that in place. Sorry I can't keep up.
- I'm hoping we can make constants first class so there's no need for any side inputs. Eg just look for them bound as globals in the IRModule. But that's a refactor which will probably come after this work.
- There's an ongoing tension between working in Relay+TIR vs TIR-only, but the work-exclusively-in-TE/TIR approach is consistent with both the current AOT flow and the USMP analysis work.
If I got that right perhaps capture it in the 'alternates considered' or someplace.
I can't say anything about possible reuse of the affine x-form machinery when doing the abstract interpretation for the memory footprint. Sounds like it might be similar to the S-TIR situation: maybe not now but can go back later.
So LGTM from me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @mbaret we should merge this now :)
To followup on this RFC, @mbaret would be great to start followup with a rolling-buffer or related primitive in TensorIR. So we can smoothly transition the solution as we start to migrate to TIR schedule |
We don't have any plans in the short term to look at extending the rolling buffer primitive into TensorIR, but I'd be happy to provide support/assistance to anyone interested in implementing it. I would hope the effort would not be too great as it's already implemented as a TIR pass. |
No description provided.