-
Notifications
You must be signed in to change notification settings - Fork 213
promotion to registers (aka private memory) #149
Conversation
|
@tensorCompBot retest this please |
| auto depthAfter = depthBefore + band->nMember(); | ||
| return depthBefore < depth && depthAfter >= depth; | ||
| }; | ||
| return functional::Filter(containsDepth, bands); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functional style rules :)
| } | ||
|
|
||
| schedule = schedule.unite(current); | ||
| prefixMupa = isl::manage(isl_multi_union_pw_aff_intersect_domain( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch, do we really need to go back to C and non-RAII world?
|
|
||
| // Return early if more than one element needs to be stored in registers. | ||
| // TODO: support arrays in registers if they are only accessed with constant | ||
| // subscripts, e.g. if the inner loops are fully unrolled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question regarding unrolling.
Are you planning to unroll at this point in the mapping or to have some deferred mechanism that also involves codegen, or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe you just meant full tiles here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant situations like accessing
for (int i = 0; i < 2; ++i) {
A[2*threadIdx.x + i];that would require having register_A[2] accessed as register_A[i], which would go to local memory instead of registers. Unless the i loop is unrolled, there is not point in promoting this access.
Because autotuner-controlled unrolling happens earlier in the transformation pipeline, I was planning to just if the loop is requested for unroll.
| auto depth = computeThreadIdxxScheduleDepth( | ||
| threadIdxxScheduleDepthState, | ||
| originalAccesses.domain().intersect(activePoints)) + | ||
| 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indenting is a bit sad here, is it what clang_format forces on us in this instance ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I tried to convince it otherwise but failed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe having 1 + functioncall() could help
| originalAccesses.gist_domain(originalAccesses.domain()) | ||
| .apply_domain(schedule); | ||
|
|
||
| for (auto sa : isl::UnionAsVector<isl::union_map>(scheduledAccesses)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the time I get to this, a few weeks after last looking at shared promotion I have forgotten what structure scheduledAccesses have, mind adding a small comment about what the structure of the union_map please?
| } | ||
|
|
||
| namespace { | ||
| isl::val getParamValIfFixed(isl::union_set uset, int pos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one feels reminiscent of a function that @abadams was asking for and that @skimo-openhub exposed for us from ISL. Maybe I am remembering wrong but maybe there is a way to avoid writing this function at all?
If we do need to write it indeed, it looks like a good candidate to move to include/tc/external/detail/islpp.h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it uses the function that @skimo-openhub mentioned, but that function is only working for sets not unions of sets; so I had to write a wrapper
| size_t nRegisters) { | ||
| using namespace tc::polyhedral::detail; | ||
|
|
||
| // Assuming the mapping happens to threads happens in inverse order, i.e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: happens twice
| size_t nRegisters) { | ||
| using namespace tc::polyhedral::detail; | ||
|
|
||
| // Assuming the mapping happens to threads happens in inverse order, i.e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment at the level of the function rather than inside?
| continue; | ||
| } | ||
|
|
||
| // Keep only those bands for which the this depth was recorded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spurious the
| for (int j = 0; j < points.dim(isl::dim_type::param); ++j) { | ||
| auto id = points.get_space().get_dim_id(isl::dim_type::param, j); | ||
| for (size_t i = 0; i < mapping::ThreadId::kMaxDim; ++i) { | ||
| if (not(id == mapping::ThreadId::makeId(i))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this better than != or is there an underlying compilation issue because missing operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was missing, introduced in a follow-up PR; I can cherry-pick it here
| // per-thread-group access relations. | ||
| auto points = activeDomainPoints(root, band); | ||
| size_t nMappedThreads = 0; | ||
| for (int j = 0; j < points.dim(isl::dim_type::param); ++j) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use d for dimension?
|
|
||
| // TODO: sorting of groups and counting the number of promoted elements | ||
| // TODO: check if nvrtc is smart enough to reuse a register for | ||
| // variables in disjoint scopes or do we need to reuse the variable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would certainly hope that nvrtc would do this but that may be a good question to engage with our NVIDIA friends. Pinging Vinod to see what he thinks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @vinodgro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it should allocate the same register to different lifetimes or scopes.
| .useSharedMemory(false) | ||
| .usePrivateMemory(true); | ||
| auto mscop = makeMappedScop(tc, mappingOptions, parameters); | ||
| return std::get<0>(mscop->codegen("fun")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the following changesets, after register promotion is activated can we please check that promotion indeed happens in this unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, minor nits here and there.
Would be good that the test promotion to private actually checks promotion happens though.
I am going to play a bit with it and accept after shaking off things with the autotuner a bit.
src/core/polyhedral/scop.cc
Outdated
| // in particular if the new promotion is strictly smaller in scope | ||
| // and size than the existing ones (otherwise we would need to find | ||
| // the all the existing ones and change their copy relations). | ||
| std::cerr << "FIXME: not promoting because another promotion of tensor " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this spews non trivial amounts of text, how about LOG(WARNING) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just kill it
f8b2339 to
8a74de3
Compare
276b866 to
2ed896e
Compare
2ed896e to
b6af70d
Compare
b6af70d to
c58b6c0
Compare
fa97bb7 to
1cf6a6a
Compare
When converting TC ScheduleTree into isl schedule_tree, children of an extension nodes are introduced progressively, starting with "core" children, i.e. those parts of the extension children that relate to the points present in the root domain. Other children are added before(after) the core children using node.graft_before(node.graft_after) on the fist(last) core child. These functions insert the new sibling subtree before(after) the node and return the node itself. Therefore, when calling these functions multiple times, the order in which siblings are added must be inverse to their expected order of appearance. Existing implementation used the direct order, revert it.
threadIdxxScheduleDepthState is the mapping from active domain points to the schedule depth of the band member mapped to thread x. Since the mapping happens below the given node in fixThreadsBelowFilter, we need to use domain points active in a child of the given node rather than in the given node itself (especially the given node being a filter). Note that the same domain points are active in all children of a node as only ancestor filters are taken into account. Take active domain points of the first child of the given node instead of the node itself. Without this change, threadIdxxScheduleDepthState could have multiple entries for partially overlapping domains, potentially with different values of depth. However, memory promotion code assumes the depth must be the same. Furthermore, having different depths considered as mapped to thread x for the same statement instance is incorrect with respect to the parallel execution model.
These function will be reused for promotion to registers in the following commits.
Remove spurious letters. Print newlines after each reference's access relation for better readbility.
Initial implementation of TensorReference instances was costructing original access maps by intersecting the domain of scop-wise access maps with the set of active domain points obtained by traversing the tree. This set can contain constraints coming from mapping filters, which would have to be removed for further analyses. Initial implementation was gisting the domain of the access relation with itself in an attempt to remove these constranits. It does not work fully because these constraints typically fix some input dimensions of a map to parameters modulo constant, which triggers similar fixing of the output dimensions related to the input dimensions. When constructing TensorReferences, do not use the result of intersecting the scop-wise access maps with active domain points. Instead, only check if this intersection is not empty before adding the original (unconstrained) access relation. Whenever it is necessary in shared memory promotion heuristic, pass the set of active domain points and intersect access relations locally.
If we cannot compute a rectangular approximation of a reference's footprint, do not introduce groups for any other references to this tensor and thus avoid promoting it.
Existing implementation of fullSchedule iteratively takes a flat product of multi union piecewise affine expressions with partial schedule, starting from the innermost band. It also intersects the domain of the multi union piecewise affine expression with the filter sets. However, if the innermost band is zero-dimensional, it is expected to have an explicit domain, which it does not in this case. The domain intersection then fails. Collect the domain information separately from taking flat ranges and intersect the domain of the affine expression only after it was fully constructed.
Assuming mapping to threads starts from the innermost coincident schedule dimension and from the thread x, promote to registers in each subtree below the band member mapped to thread x. Split bands if necessary to ensure that this member is the last one in the band. For each such band, collect references to tensors accessed below it. Group together the references that have overlapping footprints and at least one of them is a write to ensure the most recent value is read. For each group, consider promotion to registers if the footprint contains only one element (hence promotable to a register) and if each element is accessed by at most one thread (registers are private to threads). Do not promote to registers if these references were already promoted to shared memory as this would require either copying from shared memory to registers, or demoting from shared memory first. Do not insert synchroniztaions around these copies as no two threads are accessing the same value. The compiler could load from memory to a register anyway for most arithmetic operations.
A kernel variable can be declared as shared or as automatic (register).
Original implementation of memory promotion (and related parts of code generation) was assuming schedule tree branches only for entire statements. It was therefore using statement ids as a lightweight key for finding relevant active promotions and schedule parts. With the introduction of full/partial tile separation, this assumption no longer holds. Make memory promotion use domain sets instead of statement ids, and find relevant promotions and schedule parts by intersecting these sets with sets of statement instances active in a particular location in the schedule tree.
This function is no longer used. Furthermore, it was supposed to be used under a no-longer-correct assumption about schedule tree structure.
This function wraps isl_set/map_plain_get_val_if_fixed for union sets/maps, and can be useful in different places. Move it to the our extensions for isl bindings and make it a template to accomodate for both sets and maps.
1cf6a6a to
3259e76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is good to go and has been tested along with the followup PR using the autotuner.
Assuming mapping to threads starts from the innermost coincident
schedule dimension and from the thread x, promote to registers in each
subtree below the band member mapped to thread x. Split bands if
necessary to ensure that this member is the last one in the band.
For each such band, collect references to tensors accessed below it.
Group together the references that have overlapping footprints and at
least one of them is a write to ensure the most recent value is read.
For each group, consider promotion to registers if the footprint
contains only one element (hence promotable to a register) and if each
element is accessed by at most one thread (registers are private to
threads). Do not promote to registers if these references were already
promoted to shared memory as this would require either copying from
shraed memory to registers, or demoting from shared memory first.
Do not insert synchroniztaions around these copies as no two threads are
accessing the same value. The compiler could load from memory to a
register anyway for most arithmetic operations.