-
Notifications
You must be signed in to change notification settings - Fork 24
Standalone Pheno Examples
In the following, we provide some examples to elucidate the usage of kerncraft
's stand-alone benchmarking tools.
We run the kernels on an IvyBridge EP E5-2690v2 machine and specify the according machine file. If you work on a different micro architecture, please use a suitable machine file to reproduce the examples.
All examples are based on a 2D 5pt-stencil Jacobi solver for Laplace's equation with homogeneous boundary conditions. After every run, the error of the solution is printed. In this scope, we will only show the most important parts of the code (initialisation, swaps, etc are not shown).
The complete code can be found here. If not pointed out differently, the code snippets relate to the source file jacobi.c. Please note that the examples found there work on linearised arrays. For the sake of readability, we use two-dimensional arrays in the following.
With chosen boundary conditions, the solution converges against the trivial solution (zero everywhere). We exploit this fact to avoid a complicate error measurement, but use the value of the domain's mid point.
For the 2D 5pt Jacobi solver, the update of the domain is implemented as follows:
const int N = atoi(argv[1]);
const int M = atoi(argv[2]);
for (int j = 1; j < M - 1; ++j) {
for (int i = 1; i < N - 1; ++i) {
y[i][j] = 0.25 * (x[i-1][j] + x[i+1][j]
+ x[i][j-1] + x[i][j+1]);
}
}
As we must not iterate over the boundary, the loops have an offset of 1 for the lower and upper bound.
Now, we will set up the command to benchmark this kernel step-by-step.
As one can easily see in the kernel above, the number of floating-point operations in the inner-most loop is 4 (3 ADDs and 1 MULT). Moreover, the array sizes N
and M
shall be passed as command line arguments to the binary. Note that we also need to account for the offset of the loop ranges.
Hence, the most basic command to benchmark this kernel is
kc-pheno ./jacobi --machine IvyBridgeEP_E5-2690v2.yml --flops 4 -D N 1000 -D M 1000 \
--loop 1:999 --loop 1:999
The order of the variable defines also dictates the order in which the variables are passed to the binary. Also, we must order the loop ranges from the most-outer loop to the most-inner loop.
With this command, the entire main function (inclusive allocation and initialisation) would be benchmarked for one single repetition of the Jacobi solver. Instead, we mark the kernel code with likwid
markers and analyse only the kernel itself but not the initialisation etc. The corresponding command is
kc-pheno ./jacobi --machine IvyBridgeEP_E5-2690v2.yml --flops 4 -D N 1000 -D M 1000 \
--loop 1:999 --loop 1:999 --marker
As this is still not quite fancy and an iterative solver with only one iteration is not that useful, we introduce a repetition loop.
First, we hardcode that the kernel shall be repeated 1000 times:
for(int r = 0; r < 1000; ++r) {...}
const int N = atoi(argv[1]);
const int M = atoi(argv[2]);
likwid_markerStartRegion("jacobi");
for(int r = 0; r < 1000; ++r) {
for (int j = 1; j < M - 1; ++j) {
for (int i = 1; i < N - 1; ++i) {
y[i][j] = 0.25 * (x[i-1][j] + x[i+1][j]
+ x[i][j-1] + x[i][j+1]);
}
}
}
likwid_markerStopRegion("jacobi");
Of course, we also must account for the repetition loop in our kc-pheno
command:
kc-pheno ./jacobi --machine IvyBridgeEP_E5-2690v2.yml --flops 4 -D N 1000 -D M 1000 \
--loop 1:999 --loop 1:999 --repetitions 1000 --marker
Example Output
[...] Results for region 'jacobi'Runtime (per cacheline update): 33.59 cy/CL MEM volume (per repetition): 298000 Byte Performance: 2857.97 MFLOP/s Performance: 714.49 MLUP/s Performance: 714.49 It/s
Data Transfers: cache | accesses evicts misses L1 | 40.01 LOAD/CL 1.02 CL/CL 2.38 CL/CL L2 | 3.40 CL/CL 0.98 CL/CL 1.99 CL/CL L3 | 2.97 CL/CL 0.01 CL/CL 0.01 CL/CL
Phenomenological ECM model: { 25.6 || 20.0 | 6.8 | 5.9 | 0.1 } cy/CL [...]
But what if we wanted to pass the number of repetitions as command line argument to be more flexible with the achievable accuracy?
First, we adjust the kernel code as follows
for(int r = 0; r < R; ++r) {...}
const int R = atoi(argv[1]);
const int N = atoi(argv[2]);
const int M = atoi(argv[3]);
likwid_markerStartRegion("jacobi");
for(int r = 0; r < R; ++r) {
for (int j = 1; j < M - 1; ++j) {
for (int i = 1; i < N - 1; ++i) {
y[i][j] = 0.25 * (x[i-1][j] + x[i+1][j]
+ x[i][j-1] + x[i][j+1]);
}
}
}
likwid_markerStopRegion("jacobi");
Then, we need to add another define in out kc-pheno
command and pass it to the number of repetitions:
kc-pheno ./jacobi --machine IvyBridgeEP_E5-2690v2.yml --flops 4 \
-D reps 100 -D N 1000 -D M 1000 \
--loop 1:999 --loop 1:999 --repetitions reps --marker
Example Output
[...] Results for region 'jacobi'Runtime (per cacheline update): 33.66 cy/CL MEM volume (per repetition): 344000 Byte Performance: 2851.84 MFLOP/s Performance: 712.96 MLUP/s Performance: 712.96 It/s
Data Transfers: cache | accesses evicts misses L1 | 40.01 LOAD/CL 1.02 CL/CL 2.37 CL/CL L2 | 3.40 CL/CL 0.98 CL/CL 1.99 CL/CL L3 | 2.97 CL/CL 0.03 CL/CL 0.03 CL/CL
Phenomenological ECM model: { 25.5 || 20.0 | 6.8 | 5.9 | 0.3 } cy/CL [...]
Things change if we want to solve Laplace's equation with respect to a certain accuracy of the result which is a common use case. In this case, we need to replace the for
loop with a while
loop that iterates as long as the error of the solution is greater than a certain tolerance. Hence, the number of iterations is not known before the simulation and cannot be passed to the kc-pheno
command.
But help is on the way: likwid
markers. When placing the likwid
markers inside of the while loop, we can obtain the number of repetitions from the likwid
output und use it for analysing the performance.
The corresponding kernel code is found in jacobi_while.c and reads as follows:
while(error > tol) {...}
const double tol = atof(argv[1]);
const int N = atoi(argv[2]);
const int M = atoi(argv[3]);
while(x[N/2][M/2] > tol) {
likwid_markerStartRegion("jacobi");
for (int j = 1; j < M - 1; ++j) {
for (int i = 1; i < N - 1; ++i) {
y[i][j] = 0.25 * (x[i-1][j] + x[i+1][j]
+ x[i][j-1] + x[i][j+1]);
}
}
likwid_markerStopRegion("jacobi");
}
Important: placing the markers inside the
while
loop introduces overhead that might not be negligible.
The benchmarking command does not look much different from the command for known numbers of repetitions:
kc-pheno ./jacobi_while --machine IvyBridgeEP_E5-2690v2.yml --flops 4 \
-D tol 1e-4 -D N 1000 -D M 1000 \
--loop 1:999 --loop 1:999 --repetitions marker --marker
Example Output
[...] Results for region 'jacobi'Runtime (per cacheline update): 33.64 cy/CL MEM volume (per repetition): 938073 Byte Performance: 2854.16 MFLOP/s Performance: 713.54 MLUP/s Performance: 713.54 It/s
Data Transfers: cache | accesses evicts misses L1 | 40.03 LOAD/CL 1.04 CL/CL 2.39 CL/CL L2 | 3.43 CL/CL 0.99 CL/CL 2.01 CL/CL L3 | 2.99 CL/CL 0.02 CL/CL 0.03 CL/CL
Phenomenological ECM model: { 25.7 || 20.0 | 6.9 | 6.0 | 0.2 } cy/CL [...]
Let us revisit the example from Variable Number of Repetitions. Recall the kc-pheno
command
kc-pheno ./jacobi --machine IvyBridgeEP_E5-2690v2.yml --flops 4 \
-D reps 100 -D N 1000 -D M 1000 \
--loop 1:999 --loop 1:999 --repetitions reps --marker
Those of you that have run the example might have noticed that there is a warning in the output (we omitted it above for the sake of readability):
WARNING: Could not extrapolate to a 1.5s run (for at least one region). Measurements might not be accurate.
In the following, we will try to extend the runtime by adjusting different variables linearly or logarithmically. The good news is that the benchmarking tool handles that task for us.
For the Jacobi solver with for
loop, there are two possibilities to extend the runtime: either we increase the number of repetitions or the array size.
In the first case, we allow the benchmarking tool to increase the number of iterations from 100 to 2000. The corresponding command to do so is
kc-pheno ./jacobi --machine IvyBridgeEP_E5-2690v2.yml --flops 4 \
-D reps 100:2000 -D N 1000 -D M 1000 \
--loop 1:999 --loop 1:999 --repetitions reps --marker
Example Output
[...] Results for region 'jacobi'Runtime (per cacheline update): 32.78 cy/CL MEM volume (per repetition): 574467 Byte Performance: 2928.42 MFLOP/s Performance: 732.10 MLUP/s Performance: 732.10 It/s
Data Transfers: cache | accesses evicts misses L1 | 40.01 LOAD/CL 1.03 CL/CL 2.37 CL/CL L2 | 3.40 CL/CL 0.99 CL/CL 2.00 CL/CL L3 | 3.00 CL/CL 0.00 CL/CL 0.01 CL/CL
Phenomenological ECM model: { 25.6 || 20.0 | 6.8 | 6.0 | 0.0 } cy/CL [...]
Likewise, we can specify logarithmic extrapolation for e.g. the tolerance in the Jacobi kernel with the while
loop.
Imagine, we started with a tolerance of 0.01
and array sizes of 600
. This combination of parameters yields a runtime of less than 1.5s in total. As we do not want to increase the array size due to some reasons, we choose to increase the accuracy:
kc-pheno ./jacobi_while --machine IvyBridgeEP_E5-2690v2.yml --flops 4 \
-D tol 1e-2:1e-9 -D N 600 -D M 600 \
--loop 1:599 --loop 1:599 --repetitions marker --marker
Example Output
[...] Runtime (per cacheline update): 33.27 cy/CL MEM volume (per repetition): 22136 Byte Performance: 2885.52 MFLOP/s Performance: 721.38 MLUP/s Performance: 721.38 It/sData Transfers: cache | accesses evicts misses L1 | 40.07 LOAD/CL 1.00 CL/CL 2.03 CL/CL L2 | 3.04 CL/CL 0.98 CL/CL 2.03 CL/CL L3 | 3.01 CL/CL 0.00 CL/CL 0.01 CL/CL
Phenomenological ECM model: { 25.6 || 20.0 | 6.1 | 6.0 | 0.0 } cy/CL [...]
Again, this output does not differ much from previous output. But with increased verbosity level, you can see that the benchmarking tool changed the tolerance from 1e-2
to 1e-6
to obtain a runtime of approximately 1.7s.
Finally, we have a look at an example with multiple likwid
regions. The source file can be found here.
The kernel codes are (initialisation etc omitted):
const int R = atoi(argv[1]);
const int N = atoi(argv[2]);
const int M = atoi(argv[3]);
likwid_markerStartRegion("jacobi_2d");
for(int r = 0; r < R; ++r) {
for (int j = 1; j < M - 1; ++j) {
for (int i = 1; i < N - 1; ++i) {
y[i][j] = 0.25 * (x[i-1][j] + x[i+1][j]
+ x[i][j-1] + x[i][j+1]);
}
}
}
likwid_markerStopRegion("jacobi_2d");
likwid_markerStartRegion("jacobi_3d");
for(int r = 0; r < R; ++r) {
for (int k = 1; k < N-1; ++k) {
for (int j = 1; j < N-1; ++j) {
for (int i = 1; i < N-1; ++i) {
b[i][j][j] = (a[i-1][j][k] + a[i+1][j][k] +
a[i][j-1][k] + a[i][j+1][k] +
a[i][j][k-1] + a[i][j][k+1]) / 6.0;
}
}
}
}
likwid_markerStopRegion("jacobi_3d");
As the defined variables hold for all regions, we do not have to change them. What is new is that we need to provide different numbers of flops for both regions and adjust the loop ranges according to the kernels.
The corresponding benchmark command is
kc-pheno ./jacobi_multiple --machine IvyBridgeEP_E5-2690v2.yml --flops jacobi_2d:4 --flops jacobi_3d:6 \
-D reps 100 -D N 1000 -D M 1000 --loop jacobi_3d:1:999 --loop jacobi_2d,jacobi_3d:1:999 \
--loop jacobi_2d,jacobi_3d:1:999 --marker jacobi_2d,jacobi_3d --repetitions reps
The output is following:
Example Output
[...] Results for region 'jacobi_2d'Runtime (per cacheline update): 34.07 cy/CL MEM volume (per repetition): 738000 Byte Performance: 2817.55 MFLOP/s Performance: 704.39 MLUP/s Performance: 704.39 It/s
Data Transfers: cache | accesses evicts misses L1 | 40.01 LOAD/CL 1.02 CL/CL 2.37 CL/CL L2 | 3.39 CL/CL 0.98 CL/CL 1.99 CL/CL L3 | 2.98 CL/CL 0.05 CL/CL 0.05 CL/CL
Phenomenological ECM model: { 25.6 || 20.0 | 6.8 | 6.0 | 0.4 } cy/CL
[...]
Results for region 'jacobi_3d'
Runtime (per cacheline update): 118.85 cy/CL MEM volume (per repetition): 37466701000 Byte Performance: 1211.59 MFLOP/s Performance: 201.93 MLUP/s Performance: 201.93 It/s
Data Transfers: cache | accesses evicts misses L1 | 56.06 LOAD/CL 1.14 CL/CL 6.18 CL/CL L2 | 7.32 CL/CL 0.99 CL/CL 4.01 CL/CL L3 | 5.00 CL/CL 1.01 CL/CL 3.72 CL/CL
Phenomenological ECM model: { 46.5 || 28.0 | 14.6 | 10.0 | 18.8 } cy/CL [...]
Of course, variable adjustment can also be performed when having more than one likwid
region. Assume, we have a different number of repetitions for both kernels and want to adjust them separately. We could realise this with
-D rep1 jacobi_2d:100:500 -D rep2 jacobi_3d:10:100 -R jacobi2d:rep1 -R jacobi_3d:rep2