Skip to content

HeatStencil

Martin Hartl edited this page Dec 12, 2017 · 5 revisions

Heat Stencil

The objective of this tutorial is to illustrate the usage of slightly advanced features offered by the AllScale API. These features are demonstrated through the step-by-step development of a 2D 5-point stencil code. The covered features include:

  • the utilization of asynchronous parallel loops and loop dependencies
  • the usage of the stencil operator

The codes presented in this tutorial are extracted from the (not much longer) full implementation of the covered development steps. These steps can be obtained from the associated Heat Stencil tutorial sources directory.

Overview

This tutorial covers the following steps:

Sequential Implementation

A basic, sequential C++ implementation of a 2D 5-point heat stencil could look as follows:

#include <algorithm>

const int N = 200;
const int T = 100;
using Grid = std::array<std::array<double,N>,N>;

auto bufferA = std::make_unique<Grid>();
auto bufferB = std::make_unique<Grid>();

// compute simulation steps
for(int t=0; t<T; t++) {
  // update step
  Grid& A = *bufferA;
  Grid& B = *bufferB;

  for(int i=1; i<N-1; i++) {
    for(int j=1; j<N-1; j++) {
      B[i][j] = A[i][j] + k*(A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) - 4*A[i][j];
    }
  }
  std::swap(bufferA,bufferB);
}

A full version of this example can be obtained from here.

The two 2D buffers holding the data for the stencil computation are encoded using nested instances of the std::array container, which provides automatic memory management. The body of the innermost loop represents the update function applied to every element.

Using AllScale Data Structures

The implementation above holds two subtle yet significant drawbacks:

  • the memory of matrices is allocated on the stack; thus, for larger matrices the limitations of the stack will be quickly exceeded, causing the execution to crash
  • due to their internal representation, the content of arrays needs to be copied whenever being passed by value as a parameter or returned from a function; while sometimes being optimized out by the compiler, in general this may cause unnecessary computational work

To mitigate these issues, the implementation above employs std::unique_ptr to move memory allocation to the heap as well as to allow cheap buffer swapping via std::swap. Nevertheless, the C++ STL requires such solutions to be explicitely handled by the user. For ease-of-use, AllScale offers a corresponding data structure StaticGrid, which offers statically-sized, multi-dimensional, heap-allocated buffers:

#include "allscale/api/user/data/static_grid.h"

using namespace allscale::api::user;

const int N = 200;
const int T = 100;
using Grid = data::StaticGrid<double,N,N>;

Grid bufferA;
Grid bufferB;

// compute simulation steps
for(int t=0; t<T; t++) {
  // update step
  Grid& A = bufferA;
  Grid& B = bufferB;

  for(int i=1; i<N-1; i++) {
    for(int j=1; j<N-1; j++) {
      B[{i,j}] = A[{i,j}] + k*(A[{i-1,j}] + A[{i+1,j}] + A[{i,j-1}] + A[{i,j+1}] + (-4)*A[{i,j}]);
    }
  }
  std::swap(bufferA, bufferB);
}

A full version of this example can be obtained from here.

The StaticGrid<double,N,N> container constitutes a two-dimensional array of doubles, allocated on the heap, making move-operations O(1). The number of dimensions may thereby be freely chosen. For instance, StaticGrid<double,N> could be utilized as a vector of N elements, while StaticGrid<double,N,N,N> provides a cube of NxNxN doubles.

Note the subtle difference in addressing elements of a StaticGrid compared to the syntax used for nested std::arrays. While for the array example two subscript operator ([]) calls are required, the StaticGrid enables access to its elements through a single call, accepting a pair of indices.

Besides compensating for the issues of the std::array approach pointed out above, the switch to StaticGrid has one additional side effect: StaticGrid implements the Data Item concept, and may thus be automatically distributed throughout a distributed memory system if accessed by parallel code facilitating distributed memory parallelism. However, so far, our heat stencil code is still sequential.

Parallelization

To parallelize our operation, the most direct way is to simply replace every for loop with a pfor loop. Fortunately, the AllScale API provides one of those, such that we can do so by writing:

#include "allscale/api/user/data/static_grid.h"
#include "allscale/api/user/algorithm/pfor.h"

using namespace allscale::api::user;
using namespace allscale::api::user::algorithm;

const int N = 200;
const int T = 100;
using Grid = data::StaticGrid<double,N,N>;

Grid bufferA;
Grid bufferB;

// compute simulation steps
for(int t=0; t<T; t++) {
  // update step
  Grid& A = bufferA;
  Grid& B = bufferB;

  pfor(1,N-1,[&](int i){
    pfor(1,N-1,[&](int j){
      B[{i,j}] = A[{i,j}] + k*(A[{i-1,j}] + A[{i+1,j}] + A[{i,j-1}] + A[{i,j+1}] + (-4)*A[{i,j}]);
    });
  });
  std::swap(bufferA, bufferB);
}

A full version of this example can be obtained from here.

Note the additional include of the pfor.h header, providing the implementation of the pfor operator. Thereby, the original bodies of both loops are passed in the form of a lambda to the pfor, which is invoking the stencil computation for each iterator i value between 0 and N and j value between 0 and N respectively.

The code is now parallel, without requiring any additional thread management, work mapping or other explicit user interventions.

Note: As the example shows, pfors or any other parallel construct of the AllScale API may be nested, enabling their free utilization within libraries. Doing so will not incur any significant impact on performance, as pfor will automatically fall back to a sequential implementation if nested in a parallel code that already exhibits enough parallelism to saturate a system's concurrent resources.

Nevertheless, it still exhibits a lot of unnecessary synchronization, which will be reduced step by step in the following sections.

Reducing Synchronization Dependencies

By default, pfor loops incur an implicit global barrier at the end of the loop. This implies that every iteration of the inner loop must complete before the next iteration of the outer loop can commence. Since there are no dependencies between any of these individual iterations, they can all be executed in parallel. For this purpose, the pfor operator supports the definition of iteration spaces of higher dimensions, in this case two. Using this capability results in the following implementation:

#include "allscale/api/user/data/static_grid.h"
#include "allscale/api/user/algorithm/pfor.h"

using namespace allscale::api::user;
using namespace allscale::api::user::algorithm;

const int N = 200;
const int T = 100;
using Grid = data::StaticGrid<double,N,N>;
using Point = allscale::utils::Vector<int,2>;

Grid* A = &bufferA;
Grid* B = &bufferB;

// compute simulation steps
for(int t = 0; t<T; t++) {
  // update step
  pfor(Point{1,1}, Point{N-1,N-1}, [&](const Point& p) {
    int i = p.x;
    int j = p.y;
    (*B)[{i,j}] = (*A)[{i,j}] + k*((*A)[{i-1,j}] + (*A)[{i+1,j}] + (*A)[{i,j-1}] + (*A)[{i,j+1}] + (-4)*(*A)[{i,j}]);
  });
  std::swap(A, B);
}

A full version of this example can be obtained from here.

There is now a single pfor loop that iterates over the Cartesian product of NxN. Furthermore, the iteration space will not be processed by linear scanning through it as the original loops did, but by recursively sub-dividing it along alternating dimensions -- leading to a tiling-like localization of data accesses.

Overlapping Time Steps

At this stage, we reduced global synchronization but did not remove it completely. The pfor loop still incurs an implicit global barrier at its end, in every iteration of the outermost loop representing the time steps. Also this global barrier can be removed, as for each iteration (i,j) at time t the stencil computation only accesses its two-dimensional neighbors in t-1. Hence, it would be sufficient to synchronize only on these points in t-1 instead of all of them. Fortunately, the AllScale API offers means to express such synchronization with limited scope, in this case neighborhood_sync. Using it results in the following implementation:

#include "allscale/api/user/data/static_grid.h"
#include "allscale/api/user/algorithm/pfor.h"

using namespace allscale::api::user;
using namespace allscale::api::user::algorithm;

const int N = 200;
const int T = 100;
using Grid = data::StaticGrid<double,N,N>;
using Point = allscale::utils::Vector<int,2>;

Grid* A = &bufferA;
Grid* B = &bufferB;

// compute simulation steps
for(int t = 0; t<T; t++) {
  // update step
  ref = pfor(Point{1,1}, Point{N-1,N-1}, [A,B,k](const Point& p) {
    int i = p.x;
    int j = p.y;
    (*B)[{i,j}] = (*A)[{i,j}] + k*((*A)[{i-1,j}] + (*A)[{i+1,j}] + (*A)[{i,j-1}] + (*A)[{i,j+1}] + (-4)*(*A)[{i,j}]);
  }, neighborhood_sync(ref));
  std::swap(A, B);
}

A full version of this example can be obtained from here.

Note that this implementation captures the return value of the pfor operator, which is a loop_reference, and uses it in neigborhood_sync. neighborhood_sync establishes dependencies, such that iteration (i,j) at time t+1 is executed only after (i+1,j), (i-1,j), (i,j+1), (i,j-1), and (i,j) of time t have been completed.

Note: neighborhood_sync offers neighborhood synchronization on any number of dimensions. Additionally, also one_on_one is available, establishing a simple one-to-one relationship between iterations of loops.

Using the Stencil Operator

Since stencil computations are a frequently utilized template for large-scale high-performance applications, the AllScale API provides a generic stencil operator. Utilizing the pfor operator as illustrated above still requires the outermost loop to represent time. The stencil operator provides a fully combined space-time decomposition of the problem, while allowing the user to provide separate update functions for inner elements and boundary elements. Furthermore, contrary to all previous implementations, only one buffer needs to be allocated by the user. An implementation employing the stencil operator could look like this:

#include "allscale/api/user/data/static_grid.h"
#include "allscale/api/user/algorithm/stencil.h"

using namespace allscale::api::user;
using namespace allscale::api::user::algorithm;

const int N = 200;
const int T = 100;
using Grid = data::StaticGrid<double,N,N>;
using Point = allscale::utils::Vector<int,2>;

Grid temp;

// compute simulation steps
stencil(temp,T,
  // inner elements
  [k,T,N](time_t, const Point& p, const Grid& temp)->double {
    return temp[p] + k*(temp[p+Point{-1,0}] + temp[p+Point{+1,0}] + temp[p+Point{0,-1}] + temp[p+Point{0,+1}] + (-4)*temp[p]);
  },
  // boundaries
  [k,T,N](time_t, const Point&, const Grid&)->double {
    // boundaries are constants
    return 0;
  });

A full version of this example can be obtained from here.

Clone this wiki locally