Skip to content
This repository was archived by the owner on Apr 28, 2023. It is now read-only.

Conversation

@ftynse
Copy link
Contributor

@ftynse ftynse commented Mar 13, 2018

Assuming mapping to threads starts from the innermost coincident
schedule dimension and from the thread x, promote to registers in each
subtree below the band member mapped to thread x. Split bands if
necessary to ensure that this member is the last one in the band.
For each such band, collect references to tensors accessed below it.
Group together the references that have overlapping footprints and at
least one of them is a write to ensure the most recent value is read.
For each group, consider promotion to registers if the footprint
contains only one element (hence promotable to a register) and if each
element is accessed by at most one thread (registers are private to
threads). Do not promote to registers if these references were already
promoted to shared memory as this would require either copying from
shraed memory to registers, or demoting from shared memory first.
Do not insert synchroniztaions around these copies as no two threads are
accessing the same value. The compiler could load from memory to a
register anyway for most arithmetic operations.

@prigoyal
Copy link
Contributor

@tensorCompBot retest this please

auto depthAfter = depthBefore + band->nMember();
return depthBefore < depth && depthAfter >= depth;
};
return functional::Filter(containsDepth, bands);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functional style rules :)

}

schedule = schedule.unite(current);
prefixMupa = isl::manage(isl_multi_union_pw_aff_intersect_domain(
Copy link
Contributor

@nicolasvasilache nicolasvasilache Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch, do we really need to go back to C and non-RAII world?


// Return early if more than one element needs to be stored in registers.
// TODO: support arrays in registers if they are only accessed with constant
// subscripts, e.g. if the inner loops are fully unrolled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question regarding unrolling.
Are you planning to unroll at this point in the mapping or to have some deferred mechanism that also involves codegen, or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe you just meant full tiles here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant situations like accessing

for (int i = 0; i < 2; ++i) {
  A[2*threadIdx.x + i];

that would require having register_A[2] accessed as register_A[i], which would go to local memory instead of registers. Unless the i loop is unrolled, there is not point in promoting this access.

Because autotuner-controlled unrolling happens earlier in the transformation pipeline, I was planning to just if the loop is requested for unroll.

auto depth = computeThreadIdxxScheduleDepth(
threadIdxxScheduleDepthState,
originalAccesses.domain().intersect(activePoints)) +
1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indenting is a bit sad here, is it what clang_format forces on us in this instance ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I tried to convince it otherwise but failed...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe having 1 + functioncall() could help

originalAccesses.gist_domain(originalAccesses.domain())
.apply_domain(schedule);

for (auto sa : isl::UnionAsVector<isl::union_map>(scheduledAccesses)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time I get to this, a few weeks after last looking at shared promotion I have forgotten what structure scheduledAccesses have, mind adding a small comment about what the structure of the union_map please?

}

namespace {
isl::val getParamValIfFixed(isl::union_set uset, int pos) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one feels reminiscent of a function that @abadams was asking for and that @skimo-openhub exposed for us from ISL. Maybe I am remembering wrong but maybe there is a way to avoid writing this function at all?
If we do need to write it indeed, it looks like a good candidate to move to include/tc/external/detail/islpp.h.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it uses the function that @skimo-openhub mentioned, but that function is only working for sets not unions of sets; so I had to write a wrapper

size_t nRegisters) {
using namespace tc::polyhedral::detail;

// Assuming the mapping happens to threads happens in inverse order, i.e.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: happens twice

size_t nRegisters) {
using namespace tc::polyhedral::detail;

// Assuming the mapping happens to threads happens in inverse order, i.e.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment at the level of the function rather than inside?

continue;
}

// Keep only those bands for which the this depth was recorded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spurious the

for (int j = 0; j < points.dim(isl::dim_type::param); ++j) {
auto id = points.get_space().get_dim_id(isl::dim_type::param, j);
for (size_t i = 0; i < mapping::ThreadId::kMaxDim; ++i) {
if (not(id == mapping::ThreadId::makeId(i))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this better than != or is there an underlying compilation issue because missing operator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was missing, introduced in a follow-up PR; I can cherry-pick it here

// per-thread-group access relations.
auto points = activeDomainPoints(root, band);
size_t nMappedThreads = 0;
for (int j = 0; j < points.dim(isl::dim_type::param); ++j) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use d for dimension?


// TODO: sorting of groups and counting the number of promoted elements
// TODO: check if nvrtc is smart enough to reuse a register for
// variables in disjoint scopes or do we need to reuse the variable?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would certainly hope that nvrtc would do this but that may be a good question to engage with our NVIDIA friends. Pinging Vinod to see what he thinks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it should allocate the same register to different lifetimes or scopes.

.useSharedMemory(false)
.usePrivateMemory(true);
auto mscop = makeMappedScop(tc, mappingOptions, parameters);
return std::get<0>(mscop->codegen("fun"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the following changesets, after register promotion is activated can we please check that promotion indeed happens in this unit test?

Copy link
Contributor

@nicolasvasilache nicolasvasilache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, minor nits here and there.
Would be good that the test promotion to private actually checks promotion happens though.
I am going to play a bit with it and accept after shaking off things with the autotuner a bit.

// in particular if the new promotion is strictly smaller in scope
// and size than the existing ones (otherwise we would need to find
// the all the existing ones and change their copy relations).
std::cerr << "FIXME: not promoting because another promotion of tensor "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this spews non trivial amounts of text, how about LOG(WARNING) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just kill it

@nicolasvasilache nicolasvasilache changed the base branch from dev to dev-rebased March 15, 2018 23:05
@nicolasvasilache nicolasvasilache changed the base branch from dev-rebased to dev March 15, 2018 23:08
@ftynse ftynse force-pushed the register-promotion branch 2 times, most recently from f8b2339 to 8a74de3 Compare March 16, 2018 15:53
@ftynse ftynse changed the title [WIP] promotion to registers (aka private memory) promotion to registers (aka private memory) Mar 16, 2018
@ftynse ftynse requested a review from nicolasvasilache March 16, 2018 16:18
@ftynse ftynse force-pushed the register-promotion branch from 276b866 to 2ed896e Compare March 16, 2018 17:14
@ftynse ftynse removed the wip label Mar 16, 2018
@ftynse ftynse force-pushed the register-promotion branch from 2ed896e to b6af70d Compare March 19, 2018 12:32
@ftynse ftynse changed the base branch from dev to master March 19, 2018 13:04
@ftynse ftynse force-pushed the register-promotion branch from b6af70d to c58b6c0 Compare March 19, 2018 13:08
@ftynse ftynse force-pushed the register-promotion branch from fa97bb7 to 1cf6a6a Compare March 21, 2018 15:22
ftynse added 21 commits March 22, 2018 19:55
When converting TC ScheduleTree into isl schedule_tree, children of an
extension nodes are introduced progressively, starting with "core"
children, i.e. those parts of the extension children that relate to the
points present in the root domain.  Other children are added
before(after) the core children using node.graft_before(node.graft_after)
on the fist(last) core child.  These functions insert the new sibling
subtree before(after) the node and return the node itself.  Therefore,
when calling these functions multiple times, the order in which siblings
are added must be inverse to their expected order of appearance.
Existing implementation used the direct order, revert it.
threadIdxxScheduleDepthState is the mapping from active domain points to
the schedule depth of the band member mapped to thread x.  Since the
mapping happens below the given node in fixThreadsBelowFilter, we need
to use domain points active in a child of the given node rather than in
the given node itself (especially the given node being a filter).  Note
that the same domain points are active in all children of a node as only
ancestor filters are taken into account.  Take active domain points of
the first child of the given node instead of the node itself.

Without this change, threadIdxxScheduleDepthState could have multiple
entries for partially overlapping domains, potentially with different
values of depth.  However, memory promotion code assumes the depth must
be the same.  Furthermore, having different depths considered as mapped
to thread x for the same statement instance is incorrect with respect to
the parallel execution model.
These function will be reused for promotion to registers in the following
commits.
Remove spurious letters.  Print newlines after each reference's access
relation for better readbility.
Initial implementation of TensorReference instances was costructing
original access maps by intersecting the domain of scop-wise access
maps with the set of active domain points obtained by traversing the
tree.  This set can contain constraints coming from mapping filters,
which would have to be removed for further analyses.  Initial
implementation was gisting the domain of the access relation with itself
in an attempt to remove these constranits.  It does not work fully
because these constraints typically fix some input dimensions of a map
to parameters modulo constant, which triggers similar fixing of the
output dimensions related to the input dimensions.

When constructing TensorReferences, do not use the result of
intersecting the scop-wise access maps with active domain points.
Instead, only check if this intersection is not empty before adding the
original (unconstrained) access relation.  Whenever it is necessary in
shared memory promotion heuristic, pass the set of active domain points
and intersect access relations locally.
If we cannot compute a rectangular approximation of a reference's
footprint, do not introduce groups for any other references to this
tensor and thus avoid promoting it.
Existing implementation of fullSchedule iteratively takes a flat product
of multi union piecewise affine expressions with partial schedule,
starting from the innermost band.  It also intersects the domain of the
multi union piecewise affine expression with the filter sets.  However,
if the innermost band is zero-dimensional, it is expected to have an
explicit domain, which it does not in this case.  The domain
intersection then fails.  Collect the domain information separately from
taking flat ranges and intersect the domain of the affine expression
only after it was fully constructed.
Assuming mapping to threads starts from the innermost coincident
schedule dimension and from the thread x, promote to registers in each
subtree below the band member mapped to thread x.  Split bands if
necessary to ensure that this member is the last one in the band.
For each such band, collect references to tensors accessed below it.
Group together the references that have overlapping footprints and at
least one of them is a write to ensure the most recent value is read.
For each group, consider promotion to registers if the footprint
contains only one element (hence promotable to a register) and if each
element is accessed by at most one thread (registers are private to
threads).  Do not promote to registers if these references were already
promoted to shared memory as this would require either copying from
shared memory to registers, or demoting from shared memory first.
Do not insert synchroniztaions around these copies as no two threads are
accessing the same value.  The compiler could load from memory to a
register anyway for most arithmetic operations.
A kernel variable can be declared as shared or as automatic (register).
Original implementation of memory promotion (and related parts of code
generation) was assuming schedule tree branches only for entire
statements.  It was therefore using statement ids as a lightweight key
for finding relevant active promotions and schedule parts.  With the
introduction of full/partial tile separation, this assumption no longer
holds.  Make memory promotion use domain sets instead of statement ids,
and find relevant promotions and schedule parts by intersecting these
sets with sets of statement instances active in a particular location
in the schedule tree.
This function is no longer used.  Furthermore, it was supposed to be
used under a no-longer-correct assumption about schedule tree structure.
This function wraps isl_set/map_plain_get_val_if_fixed for union
sets/maps, and can be useful in different places.  Move it to the
our extensions for isl bindings and make it a template to accomodate for
both sets and maps.
@ftynse ftynse force-pushed the register-promotion branch from 1cf6a6a to 3259e76 Compare March 22, 2018 18:58
@ftynse ftynse mentioned this pull request Mar 23, 2018
3 tasks
Copy link
Contributor

@nicolasvasilache nicolasvasilache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is good to go and has been tested along with the followup PR using the autotuner.

@nicolasvasilache nicolasvasilache merged commit f8d91dd into master Mar 26, 2018
@nicolasvasilache nicolasvasilache deleted the register-promotion branch March 26, 2018 19:48
nicolasvasilache added a commit that referenced this pull request Mar 26, 2018
Test introduced in #149 only performs string comparisons.
The behavior was changed in #202 and merging broke master.

This fixes the test.
nicolasvasilache added a commit that referenced this pull request Mar 26, 2018
Test introduced in #149 only performs string comparisons.
The behavior was changed in #202 and merging broke master.

This fixes the test.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants