Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce the Arm(R) Ethos(TM)-U Cascading Scheduler #37

Merged
merged 3 commits into from
Jan 25, 2022

Conversation

mbaret
Copy link
Contributor

@mbaret mbaret commented Sep 27, 2021

No description provided.

@mbaret
Copy link
Contributor Author

mbaret commented Sep 27, 2021

@junrushao
Copy link
Member

Thanks for the RFC!

  • The TE scheduling part (rolling-buffer) looks good to me.
  • Grouping operators together might be challenging in Relay fusion and lowering, and would love to hear more thoughts from you
  • We have some basic affine analysis infra (iter-affine-map) in TVM, and it would be great if the infra could be reused and improved upon the usecases in your RFC

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbaret thanks for the detailed RFC! i think my main questions here are whether or not this could be implemented in roughly two pieces similar to USMP:

  1. a piece which could be reused if we e.g. wanted to attempt to apply this to non-Ethos-U kernels
  2. the Ethos-U-specific logic.

I understand it may not make sense to integrate with AutoTVM--it might make more sense to wait til TensorIR has fully landed. but just curious to understand in planning terms what the lift might be.

@@ -41,51 +41,17 @@ Deciding on exactly which operators should be cascaded and with what striping pa

They key piece of information to calculate in order to characterize a cascade is how the stripe size changes throughout. This is a function of the data dependency between an operator's inputs and outputs. For many operators that we're interested in, an affine transform matrix can be used to represent this dependency if we represent the input and output stripe sizes as vectors. Affine transforms typically consider 'augmented' matrices and vectors (https://en.wikipedia.org/wiki/Affine_transformation#Augmented_matrix) which allow for the representation of constant changes. Concretely, we define the transform matrix M as being the matrix for which the following holds:

$$stripe_{in} = {M} \cdot {stripe_{out}}$$
![meta-schedule-workflow](../resources/cascading-formula-1.png)

Let's briefly consider how to derive such a transform matrix for a 3x3 unstrided, undilated and unpadded NHWC convolution. Immediately, the '3x3' kernel tells us something important: a single element in the output depends on 3x3 elements in the height/width of the input. If we were instead to consider a 2x2 region of the output in the height/width dimensions, we'd then need a 4x4 region in the input. So in general, the rule is that we need 2 more elements in height and width when calculating the dependencies of an output stripe. It can be shown that more generally this number is the kernel_size-1 in each axis. Now to consider the channels, in a convolution no matter how many output elements you are computing you'll always need every input channel. This is because the input channel axis is a reduction axis in a convolution, in a sense it isn't 'reflected' in the output. Combining these two observations, we arrive at the following transform matrix:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now to consider the channels, in a convolution no matter how many output elements you are computing you'll always need every input channel.

just curious: what about depthwise convolutions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a depthwise you're right that this will be different, but depthwise will also use a different transform matrix. So here we're just working out the transform matrix for a standard 2D convolution.

@mbaret
Copy link
Contributor Author

mbaret commented Oct 6, 2021

Thanks for taking a look @junrushao1994!

Grouping operators together might be challenging in Relay fusion and lowering, and would love to hear more thoughts from you

For Ethos-U, we have a different compilation flow to 'standard' TVM. In particular, those operators which are being offloaded onto Ethos-U don't go through FuseOps but instead are lowered into TE as one large graph. This is what allows us to avoid the problems of Relay level fusion. I agree that were we to want to generalize this technique further, we would need a solution for this.

We have some basic affine analysis infra (iter-affine-map) in TVM, and it would be great if the infra could be reused and improved upon the usecases in your RFC

I hadn't actually noticed the affine analysis infra before, so thanks for pointing me towards this. From my initial look over it, I suspect there may be elements we could reuse, but the representations are unfamiliar to me as it how to work with them from outside TIR. Perhaps this could be better discussed on a draft PR to demonstrate specifically what we do with the affine transforms and whether the existing infra would be able to achieve the same effect.

@junrushao
Copy link
Member

Thank you for your response @mbaret 😄

From my initial look over it, I suspect there may be elements we could reuse, but the representations are unfamiliar to me as it how to work with them from outside TIR

For sure, we are happy to work together and potentially improve the affine analysis infrastructure. Note that the iter-affine-map is located under tvm/arith/iter_affine_map.h, so it's not deeply coupled with TIR, but only reuses several TIR expressions.

CC'ed the author of affine analysis in TVM: @tqchen @spectrometerHBH @Hzfengsy

Change-Id: I34d0acf303391157cf3ddab9ae771b3f6c667967
Change-Id: Ie4b6fbbc0cc060da6a733c06f276f2c22d2cbc93
Change-Id: I89e9128a7e18791671fb4a7baeee9b64e65174c4
@mbaret
Copy link
Contributor Author

mbaret commented Nov 2, 2021

@mbaret thanks for the detailed RFC! i think my main questions here are whether or not this could be implemented in roughly two pieces similar to USMP:

1. a piece which could be reused if we e.g. wanted to attempt to apply this to non-Ethos-U kernels

2. the Ethos-U-specific logic.

I understand it may not make sense to integrate with AutoTVM--it might make more sense to wait til TensorIR has fully landed. but just curious to understand in planning terms what the lift might be.

Apologies for the late reply @areusch. Due to some of the uncertainties around implementing this generically (Relay fusion/conflict with AutoTVM/deriving matrices), I think it's more pragmatic to have everything initially scoped just to Ethos-U. But within that implementation, all the truly Ethos-U specific logic is quite self-contained (pretty much entirely in the performance heuristics and matrix definitions). Therefore, should this need reusing in future we could probably lift it out of the Ethos-U namespace. I think this would be very doable as the cascader itself is quite well decoupled from the rest of the Ethos-U compiler, exposing only a TE Schedule -> TE Schedule interface. With some effort, although hopefully not too much, that interface could likely be tweaked to be TensorIR -> TensorIR.

@mbaret mbaret changed the title Introduce the Arm(R) Ethos(TM)-U Cascading Planner Introduce the Arm(R) Ethos(TM)-U Cascading Scheduler Nov 4, 2021
Copy link
Contributor

@mbs-octoml mbs-octoml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Mike, thx for such a nicely written RFC, you're upping the bar.

I think we all agreed in our last live conversation:

  • Though the S-TIR machinery is now in main and could be a viable medium instead of TE that transition will be left to future work.
  • The transition to TE should hopefully not require duplicating the ScheduleBuilder visitor -- I think someone else at ARM was getting that in place. Sorry I can't keep up.
  • I'm hoping we can make constants first class so there's no need for any side inputs. Eg just look for them bound as globals in the IRModule. But that's a refactor which will probably come after this work.
  • There's an ongoing tension between working in Relay+TIR vs TIR-only, but the work-exclusively-in-TE/TIR approach is consistent with both the current AOT flow and the USMP analysis work.

If I got that right perhaps capture it in the 'alternates considered' or someplace.

I can't say anything about possible reuse of the affine x-form machinery when doing the abstract interpretation for the memory footprint. Sounds like it might be similar to the S-TIR situation: maybe not now but can go back later.

So LGTM from me.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @mbaret we should merge this now :)

@areusch areusch merged commit f9fa824 into apache:main Jan 25, 2022
@tqchen
Copy link
Member

tqchen commented Jan 27, 2022

To followup on this RFC, @mbaret would be great to start followup with a rolling-buffer or related primitive in TensorIR. So we can smoothly transition the solution as we start to migrate to TIR schedule

@mbaret
Copy link
Contributor Author

mbaret commented Feb 1, 2022

We don't have any plans in the short term to look at extending the rolling buffer primitive into TensorIR, but I'd be happy to provide support/assistance to anyone interested in implementing it. I would hope the effort would not be too great as it's already implemented as a TIR pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants