Project: Arcilator Vectorization
Organization: FOSSi Foundation
Mentors: Fabian Schuiki, Martin Erhart
Contributor: Mohamed Atef
- Implement a pass to collect initial vectors.
- Implement simple canonicalization/optimization passes.
- Implement an initial cost model.
- implement splitting/merging/shuffling passes using the cost model.
Note: We talk much time in each of these goals, the project is so open-ended.
the following simple operations:
a = b + c
d = e + f
After we run our pass, this should be converted into
[a, d] = [b, e] + [c, f]
- The IR is sensitive to timing not the order of instructions (also a huge advantage).
- We need to find the initial vectors without introducing a cycle in IR.
- This is a chip, so every element communicates with the others somehow.
So in simple terms, we need to group isomorphic operations with no dependency between them.
We started with the arc.state
operation and gave each operation a rank if two operations
have the same rank then
this means they are independent so we put them in the same vector.
We now have a full-working pass that can group the operations. After calculating
operations we have after running FindInitialVector
pass the number of operations is reduced
by ~33%. I don't mean that we are 33% faster as the packing/unpacking and shuffling costs are
very high, but that's an indication of a promising result
- Add Initial SLP support: llvm/circt#7061
- Fix related crashes: llvm/circt#7100
- Modify VectorizeOp to accept any type: llvm/circt#7087
- Add statistics to
FindInitialVectors
pass: llvm/circt#7113
I didn't expect this step would take most of our work, but it turned out to be very important.
As mentioned before packing/unpacking and shuffling costs are very high, we need to find
ways to reduce them. how to do this? we need to pull as many operations as possible into the
VectorizeOp
body.
One should research the IR to get one real-world canonicalization pattern.
We introduced three canonicalizations/optimizations:
-
MergeVectorizeOps: A pass to merge two
arc.vectorize
operations if one is fully consumed
by the other. -
KeepOneVecOp: A pass to keep only one input vector if it is repeated multiple times. In reality
one won't write the same multiple input vector many times, but after merging we can have repeated
patterns. -
Add more cases to the MergeVectorizeOps pass to make it able to shuffle input vectors.
- A merging pass and now we can have more ops in the
arc.vectorize
body. We also reduced the unpacking
cost of the used vector as we removed it and replaced it with the defining op operands
// Before
%L:4 = arc.vectorize(%a, %b, %c), (%n, %p, %r): (i8, i8, i8, i8, i8, i8) -> (i8, i8, i8) {
^bb0(%arg0: i8, %arg1: i8):
%ret = comb.and %arg0, %arg1: i8
arc.vectorize.return %ret: i8
}
%C:4 = arc.vectorize(%L#0, %L#1, %L#2), (%o, %v, %q) : (i8, i8, i8, i8, i8, i8) -> (i8, i8, i8) {
^bb0(%arg0 : i8, %arg1: i8):
%1692 = arc.call @Just_A_Dummy_Func(%arg0, %arg1) : (i8, i8) -> i8
arc.vectorize.return %1692 : i8
}
// After
%C:4 = arc.vectorize(%a, %b, %c), (%n, %p, %r), (%o, %v, %q) : (i8, i8, i8, i8, i8, i8) -> (i8, i8, i8) {
^bb0(%arg0 : i8, %arg1: i8, %arg2: i8):
%and = comb.and %arg0, %arg1: i8
%call = arc.call @Just_A_Dummy_Func(%and, %arg2) : (i8, i8) -> i8
arc.vectorize.return %call : i8
}
- After adding the
KeepOneVecOp
pass we have lower packing and unpacking costs, and removed the
redundant input vectors
Example:
// Before
%R:4 = arc.vectorize(%b, %e), (%b, %e) : (i8, i8, i8, i8) -> (i8, i8) {
^bb0(%arg0: i8, %arg1: i8):
%ret = comb.mul %arg0, %arg1: i8
arc.vectorize.return %ret: i8
}
// After
%R:4 = arc.vectorize(%b, %e) : (i8, i8) -> (i8, i8) {
^bb0(%arg0: i8):
%ret = comb.mul %arg0, %arg0: i8
arc.vectorize.return %ret: i8
}
- Shuffling the vectors before merging is a great optimization it lowered the vector shuffling costs
and opened more opportunities for merging. furthermore, it lowered removed the cost of vector unpacking
as now we removed the input
vector and merged its defining op body in the using op.
- Merge VectorizeOps: llvm/circt#7146
- Keep one vector input if it's repeated: llvm/circt#7146
- Related crash: llvm/circt#7429
- Shuffle Input vectors before merging: llvm/circt#7394
To see if the vectorization is profitable we need a way to calculate the cost of the module before
and after the vectorization so we introduced the ArcCostModel
.
- We need the pass to cache results and easy to extend e.g. add more operations in the future.
- We need approximate operation costs.
- A great beta cost model.
- Simple Tester pass.
- Arc Cost Model with Tester pass: llvm/circt#7360
- Simple splitting pass: I am working on this.
- Fancy shuffling/splitting pass: One can research how to map these problems into linear optimization problems.
When I started this project, I knew nothing about MLIR and Arcilator. I am still ignorant and have a long way to go,
but I think I have a good understanding of MLIR and Arcilator. From the MLIR side, I think I can work independently,
and for Arcilator, it's very challenging but I think It's clearer and I can work with easy/intermediate tasks on my own.
The most important thing I think I got from the program was working with great minds in the arcilator community.
I enjoyed the program a lot because of them. I am very thankful to Google and FOSSi for letting me know them.