promotion to registers (aka private memory) by ftynse · Pull Request #149 · facebookresearch/TensorComprehensions · GitHub

This repository was archived by the owner on Apr 28, 2023. It is now read-only.

Contributor

ftynse commented Mar 13, 2018 •

edited

Loading

Assuming mapping to threads starts from the innermost coincident
schedule dimension and from the thread x, promote to registers in each
subtree below the band member mapped to thread x. Split bands if
necessary to ensure that this member is the last one in the band.
For each such band, collect references to tensors accessed below it.
Group together the references that have overlapping footprints and at
least one of them is a write to ensure the most recent value is read.
For each group, consider promotion to registers if the footprint
contains only one element (hence promotable to a register) and if each
element is accessed by at most one thread (registers are private to
threads). Do not promote to registers if these references were already
promoted to shared memory as this would require either copying from
shraed memory to registers, or demoting from shared memory first.
Do not insert synchroniztaions around these copies as no two threads are
accessing the same value. The compiler could load from memory to a
register anyway for most arithmetic operations.

facebook-github-bot added the CLA Signed label

prigoyal added the wip label

Contributor

prigoyal commented Mar 15, 2018

@tensorCompBot retest this please

ftynse mentioned this pull request

[wip] Register promotion improvements #161

Open

nicolasvasilache reviewed

View reviewed changes

src/core/polyhedral/memory_promotion_heuristic.cc

    
                  auto depthAfter = depthBefore + band->nMember();

                  return depthBefore < depth && depthAfter >= depth;

                };

                return functional::Filter(containsDepth, bands);

Contributor

nicolasvasilache Mar 15, 2018

functional style rules :)

nicolasvasilache reviewed

View reviewed changes

src/core/polyhedral/memory_promotion_heuristic.cc Outdated

    
                  }

                  schedule = schedule.unite(current);

                  prefixMupa = isl::manage(isl_multi_union_pw_aff_intersect_domain(

Contributor

nicolasvasilache Mar 15, 2018 •

edited

Loading

ouch, do we really need to go back to C and non-RAII world?

nicolasvasilache reviewed

View reviewed changes

src/core/polyhedral/memory_promotion_heuristic.cc

    
                // Return early if more than one element needs to be stored in registers.

                // TODO: support arrays in registers if they are only accessed with constant

                // subscripts, e.g. if the inner loops are fully unrolled.

Contributor

nicolasvasilache Mar 15, 2018

Quick question regarding unrolling.
Are you planning to unroll at this point in the mapping or to have some deferred mechanism that also involves codegen, or something else?

Contributor

nicolasvasilache Mar 15, 2018

or maybe you just meant full tiles here?

Contributor Author

ftynse Mar 15, 2018

I meant situations like accessing

for (int i = 0; i < 2; ++i) {
  A[2*threadIdx.x + i];

that would require having register_A[2] accessed as register_A[i], which would go to local memory instead of registers. Unless the i loop is unrolled, there is not point in promoting this access.

Because autotuner-controlled unrolling happens earlier in the transformation pipeline, I was planning to just if the loop is requested for unroll.

nicolasvasilache reviewed

View reviewed changes

src/core/polyhedral/memory_promotion_heuristic.cc Outdated

    
                auto depth = computeThreadIdxxScheduleDepth(

                                 threadIdxxScheduleDepthState,

                                 originalAccesses.domain().intersect(activePoints)) +

                    1;

Contributor

nicolasvasilache Mar 15, 2018

nit: indenting is a bit sad here, is it what clang_format forces on us in this instance ?

Contributor Author

ftynse Mar 15, 2018

yes, I tried to convince it otherwise but failed...

Contributor Author

ftynse Mar 15, 2018

maybe having 1 + functioncall() could help

nicolasvasilache reviewed

View reviewed changes

src/core/polyhedral/memory_promotion_heuristic.cc

    
                    originalAccesses.gist_domain(originalAccesses.domain())

                        .apply_domain(schedule);

                for (auto sa : isl::UnionAsVector<isl::union_map>(scheduledAccesses)) {

Contributor

nicolasvasilache Mar 15, 2018

By the time I get to this, a few weeks after last looking at shared promotion I have forgotten what structure scheduledAccesses have, mind adding a small comment about what the structure of the union_map please?

nicolasvasilache reviewed

View reviewed changes

src/core/polyhedral/memory_promotion_heuristic.cc Outdated

    
              }

              namespace {

              isl::val getParamValIfFixed(isl::union_set uset, int pos) {

Contributor

nicolasvasilache Mar 15, 2018

This one feels reminiscent of a function that @abadams was asking for and that @skimo-openhub exposed for us from ISL. Maybe I am remembering wrong but maybe there is a way to avoid writing this function at all?
If we do need to write it indeed, it looks like a good candidate to move to include/tc/external/detail/islpp.h.

Contributor Author

ftynse Mar 15, 2018

it uses the function that @skimo-openhub mentioned, but that function is only working for sets not unions of sets; so I had to write a wrapper

nicolasvasilache reviewed

View reviewed changes

src/core/polyhedral/memory_promotion_heuristic.cc Outdated

    
                  size_t nRegisters) {

                using namespace tc::polyhedral::detail;

                // Assuming the mapping happens to threads happens in inverse order, i.e.

Contributor

nicolasvasilache Mar 15, 2018

nit: happens twice

nicolasvasilache reviewed

View reviewed changes

src/core/polyhedral/memory_promotion_heuristic.cc Outdated

    
                  size_t nRegisters) {

                using namespace tc::polyhedral::detail;

                // Assuming the mapping happens to threads happens in inverse order, i.e.

Contributor

nicolasvasilache Mar 15, 2018

Comment at the level of the function rather than inside?

nicolasvasilache reviewed

View reviewed changes

src/core/polyhedral/memory_promotion_heuristic.cc Outdated

    
                    continue;

                  }

                  // Keep only those bands for which the this depth was recorded.

Contributor

nicolasvasilache Mar 15, 2018

spurious the

nicolasvasilache reviewed

View reviewed changes

src/core/polyhedral/memory_promotion_heuristic.cc Outdated

    
                    for (int j = 0; j < points.dim(isl::dim_type::param); ++j) {

                      auto id = points.get_space().get_dim_id(isl::dim_type::param, j);

                      for (size_t i = 0; i < mapping::ThreadId::kMaxDim; ++i) {

                        if (not(id == mapping::ThreadId::makeId(i))) {

Contributor

nicolasvasilache Mar 15, 2018

is this better than != or is there an underlying compilation issue because missing operator?

Contributor Author

ftynse Mar 15, 2018

it was missing, introduced in a follow-up PR; I can cherry-pick it here

nicolasvasilache reviewed

View reviewed changes

src/core/polyhedral/memory_promotion_heuristic.cc

    
                    // per-thread-group access relations.

                    auto points = activeDomainPoints(root, band);

                    size_t nMappedThreads = 0;

                    for (int j = 0; j < points.dim(isl::dim_type::param); ++j) {

Contributor

nicolasvasilache Mar 15, 2018

nit: use d for dimension?

nicolasvasilache reviewed

View reviewed changes

src/core/polyhedral/memory_promotion_heuristic.cc Outdated

    
                      // TODO: sorting of groups and counting the number of promoted elements

                      // TODO: check if nvrtc is smart enough to reuse a register for

                      // variables in disjoint scopes or do we need to reuse the variable?

Contributor

nicolasvasilache Mar 15, 2018

I would certainly hope that nvrtc would do this but that may be a good question to engage with our NVIDIA friends. Pinging Vinod to see what he thinks

Contributor

nicolasvasilache Mar 15, 2018

vinodgro Mar 15, 2018

yes it should allocate the same register to different lifetimes or scopes.

nicolasvasilache reviewed

View reviewed changes

test/test_mapper_memory_promotion.cc

    
                                            .useSharedMemory(false)

                                            .usePrivateMemory(true);

                  auto mscop = makeMappedScop(tc, mappingOptions, parameters);

                  return std::get<0>(mscop->codegen("fun"));

Contributor

nicolasvasilache Mar 15, 2018

In the following changesets, after register promotion is activated can we please check that promotion indeed happens in this unit test?

nicolasvasilache reviewed

View reviewed changes

Contributor

nicolasvasilache left a comment

Makes sense, minor nits here and there.
Would be good that the test promotion to private actually checks promotion happens though.
I am going to play a bit with it and accept after shaking off things with the autotuner a bit.

nicolasvasilache reviewed

View reviewed changes

src/core/polyhedral/scop.cc Outdated

    
                    // in particular if the new promotion is strictly smaller in scope

                    // and size than the existing ones (otherwise we would need to find

                    // the all the existing ones and change their copy relations).

                    std::cerr << "FIXME: not promoting because another promotion of tensor "

Contributor

nicolasvasilache Mar 15, 2018

actually this spews non trivial amounts of text, how about LOG(WARNING) ?

Contributor Author

ftynse Mar 15, 2018

just kill it

nicolasvasilache changed the base branch from dev to dev-rebased

March 15, 2018 23:05

nicolasvasilache changed the base branch from dev-rebased to dev

March 15, 2018 23:08

ftynse force-pushed the register-promotion branch 2 times, most recently from f8b2339 to 8a74de3 Compare

March 16, 2018 15:53

ftynse changed the title ~~[WIP] promotion to registers (aka private memory)~~ promotion to registers (aka private memory)

ftynse requested a review from nicolasvasilache

March 16, 2018 16:18

ftynse force-pushed the register-promotion branch from 276b866 to 2ed896e Compare

March 16, 2018 17:14

ftynse removed the wip label

ftynse force-pushed the register-promotion branch from 2ed896e to b6af70d Compare

March 19, 2018 12:32

ftynse changed the base branch from dev to master

March 19, 2018 13:04

ftynse force-pushed the register-promotion branch from b6af70d to c58b6c0 Compare

March 19, 2018 13:08

ftynse force-pushed the register-promotion branch from fa97bb7 to 1cf6a6a Compare

March 21, 2018 15:22

ftynse added 21 commits

March 22, 2018 19:55


          toIslScheduleTree: swap the order of non-core filters below extension

cd6f059

When converting TC ScheduleTree into isl schedule_tree, children of an
extension nodes are introduced progressively, starting with "core"
children, i.e. those parts of the extension children that relate to the
points present in the root domain.  Other children are added
before(after) the core children using node.graft_before(node.graft_after)
on the fist(last) core child.  These functions insert the new sibling
subtree before(after) the node and return the node itself.  Therefore,
when calling these functions multiple times, the order in which siblings
are added must be inverse to their expected order of appearance.
Existing implementation used the direct order, revert it.


          fixThreadsBelowFilter: fix domain used for threadIdxxScheduleDepth

8bc7735

threadIdxxScheduleDepthState is the mapping from active domain points to
the schedule depth of the band member mapped to thread x.  Since the
mapping happens below the given node in fixThreadsBelowFilter, we need
to use domain points active in a child of the given node rather than in
the given node itself (especially the given node being a filter).  Note
that the same domain points are active in all children of a node as only
ancestor filters are taken into account.  Take active domain points of
the first child of the given node instead of the node itself.

Without this change, threadIdxxScheduleDepthState could have multiple
entries for partially overlapping domains, potentially with different
values of depth.  However, memory promotion code assumes the depth must
be the same.  Furthermore, having different depths considered as mapped
to thread x for the same statement instance is incorrect with respect to
the parallel execution model.


          add inequality comparison operator for isl::id

ff4c871


          extract bandsContainingScheduleDepth and bandsSplitAfterDepth

a1194c8

These function will be reused for promotion to registers in the following
commits.


          TensorReferenceGroup: change printing format

d8437a8

Remove spurious letters.  Print newlines after each reference's access
relation for better readbility.


          do not include filter constraints in original accesses

1ec2911

Initial implementation of TensorReference instances was costructing
original access maps by intersecting the domain of scop-wise access
maps with the set of active domain points obtained by traversing the
tree.  This set can contain constraints coming from mapping filters,
which would have to be removed for further analyses.  Initial
implementation was gisting the domain of the access relation with itself
in an attempt to remove these constranits.  It does not work fully
because these constraints typically fix some input dimensions of a map
to parameters modulo constant, which triggers similar fixing of the
output dimensions related to the input dimensions.

When constructing TensorReferences, do not use the result of
intersecting the scop-wise access maps with active domain points.
Instead, only check if this intersection is not empty before adding the
original (unconstrained) access relation.  Whenever it is necessary in
shared memory promotion heuristic, pass the set of active domain points
and intersect access relations locally.


          memory promotion: ignore tensors with no rectangular approximation

d2e2460

If we cannot compute a rectangular approximation of a reference's
footprint, do not introduce groups for any other references to this
tensor and thus avoid promoting it.


          fullSchedule: handle innermost zero-dimensional bands

05e56c5

Existing implementation of fullSchedule iteratively takes a flat product
of multi union piecewise affine expressions with partial schedule,
starting from the innermost band.  It also intersects the domain of the
multi union piecewise affine expression with the filter sets.  However,
if the innermost band is zero-dimensional, it is expected to have an
explicit domain, which it does not in this case.  The domain
intersection then fails.  Collect the domain information separately from
taking flat ranges and intersect the domain of the affine expression
only after it was fully constructed.


          bump isl for isl_multi_union_pw_aff_intersect_domain

778f274


          basic heuristic for register promotion

0b3379d

Assuming mapping to threads starts from the innermost coincident
schedule dimension and from the thread x, promote to registers in each
subtree below the band member mapped to thread x.  Split bands if
necessary to ensure that this member is the last one in the band.
For each such band, collect references to tensors accessed below it.
Group together the references that have overlapping footprints and at
least one of them is a write to ensure the most recent value is read.
For each group, consider promotion to registers if the footprint
contains only one element (hence promotable to a register) and if each
element is accessed by at most one thread (registers are private to
threads).  Do not promote to registers if these references were already
promoted to shared memory as this would require either copying from
shared memory to registers, or demoting from shared memory first.
Do not insert synchroniztaions around these copies as no two threads are
accessing the same value.  The compiler could load from memory to a
register anyway for most arithmetic operations.


          Call register promotion from MappedScop

c133cb5


          add primitive test for register promotion not crashing

feed0a4


          introduce a Kind of a promoted declaration

822494f

A kernel variable can be declared as shared or as automatic (register).


          emit declarations for register promotion

720d077


          promoteGroup: disallow double promotion

fbbfcb0


          promote TensorReferenceGroups to registers

bd2d0cd


          use domain sets instead of statement ids in memory promotion

dfe48ca

Original implementation of memory promotion (and related parts of code
generation) was assuming schedule tree branches only for entire
statements.  It was therefore using statement ids as a lightweight key
for finding relevant active promotions and schedule parts.  With the
introduction of full/partial tile separation, this assumption no longer
holds.  Make memory promotion use domain sets instead of statement ids,
and find relevant promotions and schedule parts by intersecting these
sets with sets of statement instances active in a particular location
in the schedule tree.


          private memory: don't use statement ids

49200ae


          drop activeStatements

f736ddb

This function is no longer used.  Furthermore, it was supposed to be
used under a no-longer-correct assumption about schedule tree structure.


          move getParamValIfFixed to islpp.h

94dc353

This function wraps isl_set/map_plain_get_val_if_fixed for union
sets/maps, and can be useful in different places.  Move it to the
our extensions for isl bindings and make it a template to accomodate for
both sets and maps.


          syntactic tests for register promotion

3259e76

ftynse force-pushed the register-promotion branch from 1cf6a6a to 3259e76 Compare

March 22, 2018 18:58

ftynse mentioned this pull request

Modernization of TC repo #194

Closed

3 tasks

nicolasvasilache approved these changes

View reviewed changes

Contributor

nicolasvasilache left a comment

This PR is good to go and has been tested along with the followup PR using the autotuner.

nicolasvasilache merged commit f8d91dd into master

nicolasvasilache deleted the register-promotion branch

March 26, 2018 19:48

nicolasvasilache added a commit that referenced this pull request


          Hotfix update test to latest syntactic changes

19af771

Test introduced in #149 only performs string comparisons.
The behavior was changed in #202 and merging broke master.

This fixes the test.

nicolasvasilache mentioned this pull request

Hotfix update test to latest syntactic changes #218

Merged

nicolasvasilache added a commit that referenced this pull request


          Hotfix update test to latest syntactic changes

0fe158f

Test introduced in #149 only performs string comparisons.
The behavior was changed in #202 and merging broke master.

This fixes the test.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels