-
Notifications
You must be signed in to change notification settings - Fork 5
HeatStencil
The objective of this tutorial is to illustrate the usage of slightly advanced features offered by the AllScale API. These features are demonstrated through the step-by-step development of a 2D 5-point stencil code. The covered features include:
- the utilization of asynchronous parallel loops and loop dependencies
- the usage of the stencil operator
The codes presented in this tutorial are extracted from the (not much longer) full implementation of the covered development steps. These steps can be obtained from the associated Heat Stencil tutorial sources directory.
This tutorial covers the following steps:
- Sequential Implementation - an initial C++ reference version of a 2D 5-point stencil
- Using AllScale Data Structures - adapting AllScale data structures
- Parallelization - distributing computation to parallel resources
- Reducing Synchronization Dependencies - on the space domain
- Overlapping Time Steps - breaking global synchronization over time steps
- Using the Stencil Operator - a higher level stencil operator
A basic, sequential C++ implementation of a 2D 5-point heat stencil could look as follows:
#include <algorithm>
const int N = 200;
const int T = 100;
using Grid = std::array<std::array<double,N>,N>;
auto bufferA = std::make_unique<Grid>();
auto bufferB = std::make_unique<Grid>();
// compute simulation steps
for(int t=0; t<T; t++) {
// update step
Grid& A = *bufferA;
Grid& B = *bufferB;
for(int i=1; i<N-1; i++) {
for(int j=1; j<N-1; j++) {
B[i][j] = A[i][j] + k*(A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) - 4*A[i][j];
}
}
std::swap(bufferA,bufferB);
}A full version of this example can be obtained from here.
The two 2D buffers holding the data for the stencil computation are encoded using nested instances of the std::array container, which provides automatic memory management. The body of the innermost loop represents the update function applied to every element.
The implementation above holds two subtle yet significant drawbacks:
- the memory of matrices is allocated on the stack; thus, for larger matrices the limitations of the stack will be quickly exceeded, causing the execution to crash
- due to their internal representation, the content of arrays needs to be copied whenever being passed by value as a parameter or returned from a function; while sometimes being optimized out by the compiler, in general this may cause unnecessary computational work
To mitigate these issues, the implementation above employs std::unique_ptr to move memory allocation to the heap as well as to allow cheap buffer swapping via std::swap. Nevertheless, the C++ STL requires such solutions to be explicitely handled by the user. For ease-of-use, AllScale offers a corresponding data structure StaticGrid, which offers statically-sized, multi-dimensional, heap-allocated buffers:
#include "allscale/api/user/data/static_grid.h"
using namespace allscale::api::user;
const int N = 200;
const int T = 100;
using Grid = data::StaticGrid<double,N,N>;
Grid bufferA;
Grid bufferB;
// compute simulation steps
for(int t=0; t<T; t++) {
// update step
Grid& A = bufferA;
Grid& B = bufferB;
for(int i=1; i<N-1; i++) {
for(int j=1; j<N-1; j++) {
B[{i,j}] = A[{i,j}] + k*(A[{i-1,j}] + A[{i+1,j}] + A[{i,j-1}] + A[{i,j+1}] + (-4)*A[{i,j}]);
}
}
std::swap(bufferA, bufferB);
}A full version of this example can be obtained from here.
The StaticGrid<double,N,N> container constitutes a two-dimensional array of doubles, allocated on the heap, making move-operations O(1). The number of dimensions may thereby be freely chosen. For instance, StaticGrid<double,N> could be utilized as a vector of N elements, while StaticGrid<double,N,N,N> provides a cube of NxNxN doubles.
Note the subtle difference in addressing elements of a StaticGrid compared to the syntax used for nested std::arrays. While for the array example two subscript operator ([]) calls are required, the StaticGrid enables access to its elements through a single call, accepting a pair of indices.
Besides compensating for the issues of the std::array approach pointed out above, the switch to StaticGrid has one additional side effect: StaticGrid implements the Data Item concept, and may thus be automatically distributed throughout a distributed memory system if accessed by parallel code facilitating distributed memory parallelism.
However, so far, our heat stencil code is still sequential.
To parallelize our operation, the most direct way is to simply replace every for loop with a pfor loop. Fortunately, the AllScale API provides one of those, such that we can do so by writing:
#include "allscale/api/user/data/static_grid.h"
#include "allscale/api/user/algorithm/pfor.h"
using namespace allscale::api::user;
using namespace allscale::api::user::algorithm;
const int N = 200;
const int T = 100;
using Grid = data::StaticGrid<double,N,N>;
Grid bufferA;
Grid bufferB;
// compute simulation steps
for(int t=0; t<T; t++) {
// update step
Grid& A = bufferA;
Grid& B = bufferB;
pfor(1,N-1,[&](int i){
pfor(1,N-1,[&](int j){
B[{i,j}] = A[{i,j}] + k*(A[{i-1,j}] + A[{i+1,j}] + A[{i,j-1}] + A[{i,j+1}] + (-4)*A[{i,j}]);
});
});
std::swap(bufferA, bufferB);
}A full version of this example can be obtained from here.
Note the additional include of the pfor.h header, providing the implementation of the pfor operator. Thereby, the original bodies of both loops are passed in the form of a lambda to the pfor, which is invoking the stencil computation for each iterator i value between 0 and N and j value between 0 and N respectively.
The code is now parallel, without requiring any additional thread management, work mapping or other explicit user interventions.
Note: As the example shows,
pfors or any other parallel construct of the AllScale API may be nested, enabling their free utilization within libraries. Doing so will not incur any significant impact on performance, aspforwill automatically fall back to a sequential implementation if nested in a parallel code that already exhibits enough parallelism to saturate a system's concurrent resources.
Nevertheless, it still exhibits a lot of unnecessary synchronization, which will be reduced step by step in the following sections.
By default, pfor loops incur an implicit global barrier at the end of the loop. This implies that every iteration of the inner loop must complete before the next iteration of the outer loop can commence. Since there are no dependencies between any of these individual iterations, they can all be executed in parallel. For this purpose, the pfor operator supports the definition of iteration spaces of higher dimensions, in this case two. Using this capability results in the following implementation:
#include "allscale/api/user/data/static_grid.h"
#include "allscale/api/user/algorithm/pfor.h"
using namespace allscale::api::user;
using namespace allscale::api::user::algorithm;
const int N = 200;
const int T = 100;
using Grid = data::StaticGrid<double,N,N>;
using Point = allscale::utils::Vector<int,2>;
Grid* A = &bufferA;
Grid* B = &bufferB;
// compute simulation steps
for(int t = 0; t<T; t++) {
// update step
pfor(Point{1,1}, Point{N-1,N-1}, [&](const Point& p) {
int i = p.x;
int j = p.y;
(*B)[{i,j}] = (*A)[{i,j}] + k*((*A)[{i-1,j}] + (*A)[{i+1,j}] + (*A)[{i,j-1}] + (*A)[{i,j+1}] + (-4)*(*A)[{i,j}]);
});
std::swap(A, B);
}A full version of this example can be obtained from here.
There is now a single pfor loop that iterates over the Cartesian product of NxN. Furthermore, the iteration space will not be processed by linear scanning through it as the original loops did, but by recursively sub-dividing it along alternating dimensions -- leading to a tiling-like localization of data accesses.
At this stage, we reduced global synchronization but did not remove it completely. The pfor loop still incurs an implicit global barrier at its end, in every iteration of the outermost loop representing the time steps. Also this global barrier can be removed, as for each iteration (i,j) at time t the stencil computation only accesses its two-dimensional neighbors in t-1. Hence, it would be sufficient to synchronize only on these points in t-1 instead of all of them. Fortunately, the AllScale API offers means to express such synchronization with limited scope, in this case neighborhood_sync. Using it results in the following implementation:
#include "allscale/api/user/data/static_grid.h"
#include "allscale/api/user/algorithm/pfor.h"
using namespace allscale::api::user;
using namespace allscale::api::user::algorithm;
const int N = 200;
const int T = 100;
using Grid = data::StaticGrid<double,N,N>;
using Point = allscale::utils::Vector<int,2>;
Grid* A = &bufferA;
Grid* B = &bufferB;
// compute simulation steps
for(int t = 0; t<T; t++) {
// update step
ref = pfor(Point{1,1}, Point{N-1,N-1}, [A,B,k](const Point& p) {
int i = p.x;
int j = p.y;
(*B)[{i,j}] = (*A)[{i,j}] + k*((*A)[{i-1,j}] + (*A)[{i+1,j}] + (*A)[{i,j-1}] + (*A)[{i,j+1}] + (-4)*(*A)[{i,j}]);
}, neighborhood_sync(ref));
std::swap(A, B);
}A full version of this example can be obtained from here.
Note that this implementation captures the return value of the pfor operator, which is a loop_reference, and uses it in neigborhood_sync. neighborhood_sync establishes dependencies, such that iteration (i,j) at time t+1 is executed only after (i+1,j), (i-1,j), (i,j+1), (i,j-1), and (i,j) of time t have been completed.
Note:
neighborhood_syncoffers neighborhood synchronization on any number of dimensions. Additionally, alsoone_on_oneis available, establishing a simple one-to-one relationship between iterations of loops.
Since stencil computations are a frequently utilized template for large-scale high-performance applications, the AllScale API provides a generic stencil operator. Utilizing the pfor operator as illustrated above still requires the outermost loop to represent time. The stencil operator provides a fully combined space-time decomposition of the problem, while allowing the user to provide separate update functions for inner elements and boundary elements. Furthermore, contrary to all previous implementations, only one buffer needs to be allocated by the user. An implementation employing the stencil operator could look like this:
#include "allscale/api/user/data/static_grid.h"
#include "allscale/api/user/algorithm/stencil.h"
using namespace allscale::api::user;
using namespace allscale::api::user::algorithm;
const int N = 200;
const int T = 100;
using Grid = data::StaticGrid<double,N,N>;
using Point = allscale::utils::Vector<int,2>;
Grid temp;
// compute simulation steps
stencil(temp,T,
// inner elements
[k,T,N](time_t, const Point& p, const Grid& temp)->double {
return temp[p] + k*(temp[p+Point{-1,0}] + temp[p+Point{+1,0}] + temp[p+Point{0,-1}] + temp[p+Point{0,+1}] + (-4)*temp[p]);
},
// boundaries
[k,T,N](time_t, const Point&, const Grid&)->double {
// boundaries are constants
return 0;
});A full version of this example can be obtained from here.
Part of the AllScale project - http://www.allscale.eu