Skip to content

Standalone Pheno Examples

Julian Hammer edited this page Jan 9, 2020 · 2 revisions

In the following, we provide some examples to elucidate the usage of kerncraft's stand-alone benchmarking tools.

We run the kernels on an IvyBridge EP E5-2690v2 machine and specify the according machine file. If you work on a different micro architecture, please use a suitable machine file to reproduce the examples.

General Remarks on Example Kernel

All examples are based on a 2D 5pt-stencil Jacobi solver for Laplace's equation with homogeneous boundary conditions. After every run, the error of the solution is printed. In this scope, we will only show the most important parts of the code (initialisation, swaps, etc are not shown).
The complete code can be found here. If not pointed out differently, the code snippets relate to the source file jacobi.c. Please note that the examples found there work on linearised arrays. For the sake of readability, we use two-dimensional arrays in the following.

With chosen boundary conditions, the solution converges against the trivial solution (zero everywhere). We exploit this fact to avoid a complicate error measurement, but use the value of the domain's mid point.

Examples

For the 2D 5pt Jacobi solver, the update of the domain is implemented as follows:

const int N = atoi(argv[1]);
const int M = atoi(argv[2]);

for (int j = 1; j < M - 1; ++j) {
    for (int i = 1; i < N - 1; ++i) {
        y[i][j] = 0.25 * (x[i-1][j] + x[i+1][j] 
                        + x[i][j-1] + x[i][j+1]);
    }
}

As we must not iterate over the boundary, the loops have an offset of 1 for the lower and upper bound.

Basic Usage

Now, we will set up the command to benchmark this kernel step-by-step.
As one can easily see in the kernel above, the number of floating-point operations in the inner-most loop is 4 (3 ADDs and 1 MULT). Moreover, the array sizes N and M shall be passed as command line arguments to the binary. Note that we also need to account for the offset of the loop ranges.
Hence, the most basic command to benchmark this kernel is

kc-pheno ./jacobi --machine IvyBridgeEP_E5-2690v2.yml --flops 4 -D N 1000 -D M 1000 \
     --loop 1:999 --loop 1:999

The order of the variable defines also dictates the order in which the variables are passed to the binary. Also, we must order the loop ranges from the most-outer loop to the most-inner loop.
With this command, the entire main function (inclusive allocation and initialisation) would be benchmarked for one single repetition of the Jacobi solver. Instead, we mark the kernel code with likwid markers and analyse only the kernel itself but not the initialisation etc. The corresponding command is

kc-pheno ./jacobi --machine IvyBridgeEP_E5-2690v2.yml --flops 4 -D N 1000 -D M 1000 \
     --loop 1:999 --loop 1:999 --marker

As this is still not quite fancy and an iterative solver with only one iteration is not that useful, we introduce a repetition loop.

Fixed Number of Repetitions

First, we hardcode that the kernel shall be repeated 1000 times:

for(int r = 0; r < 1000; ++r) {...}

const int N = atoi(argv[1]);
const int M = atoi(argv[2]);

likwid_markerStartRegion("jacobi");

for(int r = 0; r < 1000; ++r) {

    for (int j = 1; j < M - 1; ++j) {
        for (int i = 1; i < N - 1; ++i) {
            y[i][j] = 0.25 * (x[i-1][j] + x[i+1][j] 
                            + x[i][j-1] + x[i][j+1]);
        }
    }
}

likwid_markerStopRegion("jacobi");

Of course, we also must account for the repetition loop in our kc-pheno command:

kc-pheno ./jacobi --machine IvyBridgeEP_E5-2690v2.yml --flops 4 -D N 1000 -D M 1000 \
     --loop 1:999 --loop 1:999 --repetitions 1000 --marker
Example Output

[...]
Results for region 'jacobi'

Runtime (per cacheline update): 33.59 cy/CL MEM volume (per repetition): 298000 Byte Performance: 2857.97 MFLOP/s Performance: 714.49 MLUP/s Performance: 714.49 It/s

Data Transfers: cache | accesses evicts misses L1 | 40.01 LOAD/CL 1.02 CL/CL 2.38 CL/CL L2 | 3.40 CL/CL 0.98 CL/CL 1.99 CL/CL L3 | 2.97 CL/CL 0.01 CL/CL 0.01 CL/CL

Phenomenological ECM model: { 25.6 || 20.0 | 6.8 | 5.9 | 0.1 } cy/CL [...]

Variable Number of Repetitions

But what if we wanted to pass the number of repetitions as command line argument to be more flexible with the achievable accuracy?
First, we adjust the kernel code as follows

for(int r = 0; r < R; ++r) {...}

const int R = atoi(argv[1]);
const int N = atoi(argv[2]);
const int M = atoi(argv[3]);

likwid_markerStartRegion("jacobi");

for(int r = 0; r < R; ++r) {

    for (int j = 1; j < M - 1; ++j) {
        for (int i = 1; i < N - 1; ++i) {
            y[i][j] = 0.25 * (x[i-1][j] + x[i+1][j] 
                            + x[i][j-1] + x[i][j+1]);
        }
    }
}

likwid_markerStopRegion("jacobi");

Then, we need to add another define in out kc-pheno command and pass it to the number of repetitions:

kc-pheno ./jacobi --machine IvyBridgeEP_E5-2690v2.yml --flops 4 \
     -D reps 100 -D N 1000 -D M 1000 \
     --loop 1:999 --loop 1:999 --repetitions reps --marker
Example Output

[...]
Results for region 'jacobi'

Runtime (per cacheline update): 33.66 cy/CL MEM volume (per repetition): 344000 Byte Performance: 2851.84 MFLOP/s Performance: 712.96 MLUP/s Performance: 712.96 It/s

Data Transfers: cache | accesses evicts misses L1 | 40.01 LOAD/CL 1.02 CL/CL 2.37 CL/CL L2 | 3.40 CL/CL 0.98 CL/CL 1.99 CL/CL L3 | 2.97 CL/CL 0.03 CL/CL 0.03 CL/CL

Phenomenological ECM model: { 25.5 || 20.0 | 6.8 | 5.9 | 0.3 } cy/CL [...]

Unknown Number of Repetitions

Things change if we want to solve Laplace's equation with respect to a certain accuracy of the result which is a common use case. In this case, we need to replace the for loop with a while loop that iterates as long as the error of the solution is greater than a certain tolerance. Hence, the number of iterations is not known before the simulation and cannot be passed to the kc-pheno command.

But help is on the way: likwid markers. When placing the likwid markers inside of the while loop, we can obtain the number of repetitions from the likwid output und use it for analysing the performance.
The corresponding kernel code is found in jacobi_while.c and reads as follows:

while(error > tol) {...}

const double tol = atof(argv[1]);
const int N = atoi(argv[2]);
const int M = atoi(argv[3]);

while(x[N/2][M/2] > tol) {

    likwid_markerStartRegion("jacobi");

    for (int j = 1; j < M - 1; ++j) {
        for (int i = 1; i < N - 1; ++i) {
            y[i][j] = 0.25 * (x[i-1][j] + x[i+1][j] 
                            + x[i][j-1] + x[i][j+1]);
        }
    }

    likwid_markerStopRegion("jacobi");
}

Important: placing the markers inside the while loop introduces overhead that might not be negligible.

The benchmarking command does not look much different from the command for known numbers of repetitions:

kc-pheno ./jacobi_while --machine IvyBridgeEP_E5-2690v2.yml --flops 4 \
     -D tol 1e-4 -D N 1000 -D M 1000 \
     --loop 1:999 --loop 1:999 --repetitions marker --marker
Example Output

[...]
Results for region 'jacobi'

Runtime (per cacheline update): 33.64 cy/CL MEM volume (per repetition): 938073 Byte Performance: 2854.16 MFLOP/s Performance: 713.54 MLUP/s Performance: 713.54 It/s

Data Transfers: cache | accesses evicts misses L1 | 40.03 LOAD/CL 1.04 CL/CL 2.39 CL/CL L2 | 3.43 CL/CL 0.99 CL/CL 2.01 CL/CL L3 | 2.99 CL/CL 0.02 CL/CL 0.03 CL/CL

Phenomenological ECM model: { 25.7 || 20.0 | 6.9 | 6.0 | 0.2 } cy/CL [...]

Kernel with Adjustable Variables

Let us revisit the example from Variable Number of Repetitions. Recall the kc-pheno command

kc-pheno ./jacobi --machine IvyBridgeEP_E5-2690v2.yml --flops 4 \
     -D reps 100 -D N 1000 -D M 1000 \
     --loop 1:999 --loop 1:999 --repetitions reps --marker

Those of you that have run the example might have noticed that there is a warning in the output (we omitted it above for the sake of readability):

WARNING: Could not extrapolate to a 1.5s run (for at least one region). Measurements might not be accurate.

In the following, we will try to extend the runtime by adjusting different variables linearly or logarithmically. The good news is that the benchmarking tool handles that task for us.

Linear Extrapolation

For the Jacobi solver with for loop, there are two possibilities to extend the runtime: either we increase the number of repetitions or the array size.

In the first case, we allow the benchmarking tool to increase the number of iterations from 100 to 2000. The corresponding command to do so is

kc-pheno ./jacobi --machine IvyBridgeEP_E5-2690v2.yml --flops 4 \
     -D reps 100:2000 -D N 1000 -D M 1000 \
     --loop 1:999 --loop 1:999 --repetitions reps --marker
Example Output

[...]
Results for region 'jacobi'

Runtime (per cacheline update): 32.78 cy/CL MEM volume (per repetition): 574467 Byte Performance: 2928.42 MFLOP/s Performance: 732.10 MLUP/s Performance: 732.10 It/s

Data Transfers: cache | accesses evicts misses L1 | 40.01 LOAD/CL 1.03 CL/CL 2.37 CL/CL L2 | 3.40 CL/CL 0.99 CL/CL 2.00 CL/CL L3 | 3.00 CL/CL 0.00 CL/CL 0.01 CL/CL

Phenomenological ECM model: { 25.6 || 20.0 | 6.8 | 6.0 | 0.0 } cy/CL [...]

Although it does not look too different from previous outputs, one can see (when increasing the verbose level) that the tool increased the number of repetitions to 1500 to obtain a runtime of approximately 2.1s.

Logarithmic Extrapolation

Likewise, we can specify logarithmic extrapolation for e.g. the tolerance in the Jacobi kernel with the while loop.
Imagine, we started with a tolerance of 0.01 and array sizes of 600. This combination of parameters yields a runtime of less than 1.5s in total. As we do not want to increase the array size due to some reasons, we choose to increase the accuracy:

kc-pheno ./jacobi_while --machine IvyBridgeEP_E5-2690v2.yml --flops 4 \
     -D tol 1e-2:1e-9 -D N 600 -D M 600 \
     --loop 1:599 --loop 1:599 --repetitions marker --marker
Example Output

[...]
Runtime (per cacheline update): 33.27 cy/CL
MEM volume (per repetition): 22136 Byte
Performance: 2885.52 MFLOP/s
Performance: 721.38 MLUP/s
Performance: 721.38 It/s

Data Transfers: cache | accesses evicts misses L1 | 40.07 LOAD/CL 1.00 CL/CL 2.03 CL/CL L2 | 3.04 CL/CL 0.98 CL/CL 2.03 CL/CL L3 | 3.01 CL/CL 0.00 CL/CL 0.01 CL/CL

Phenomenological ECM model: { 25.6 || 20.0 | 6.1 | 6.0 | 0.0 } cy/CL [...]

Again, this output does not differ much from previous output. But with increased verbosity level, you can see that the benchmarking tool changed the tolerance from 1e-2 to 1e-6 to obtain a runtime of approximately 1.7s.

Kernel with Multiple likwid Regions

Finally, we have a look at an example with multiple likwid regions. The source file can be found here.
The kernel codes are (initialisation etc omitted):

const int R = atoi(argv[1]);
const int N = atoi(argv[2]);
const int M = atoi(argv[3]);

likwid_markerStartRegion("jacobi_2d");
for(int r = 0; r < R; ++r) {

    for (int j = 1; j < M - 1; ++j) {
        for (int i = 1; i < N - 1; ++i) {
            y[i][j] = 0.25 * (x[i-1][j] + x[i+1][j] 
                            + x[i][j-1] + x[i][j+1]);
        }
    }
}
likwid_markerStopRegion("jacobi_2d");

likwid_markerStartRegion("jacobi_3d");
for(int r = 0; r < R; ++r) {

    for (int k = 1; k < N-1; ++k) {
        for (int j = 1; j < N-1; ++j) {
            for (int i = 1; i < N-1; ++i) {
                b[i][j][j] = (a[i-1][j][k] + a[i+1][j][k] +
                              a[i][j-1][k] + a[i][j+1][k] +
                              a[i][j][k-1] + a[i][j][k+1]) / 6.0;
            }
        }
    }

}
likwid_markerStopRegion("jacobi_3d");

As the defined variables hold for all regions, we do not have to change them. What is new is that we need to provide different numbers of flops for both regions and adjust the loop ranges according to the kernels.

The corresponding benchmark command is

kc-pheno ./jacobi_multiple --machine IvyBridgeEP_E5-2690v2.yml --flops jacobi_2d:4 --flops jacobi_3d:6 \
    -D reps 100 -D N 1000 -D M 1000 --loop jacobi_3d:1:999 --loop jacobi_2d,jacobi_3d:1:999 \
    --loop jacobi_2d,jacobi_3d:1:999 --marker jacobi_2d,jacobi_3d --repetitions reps

The output is following:

Example Output

[...]
Results for region 'jacobi_2d'

Runtime (per cacheline update): 34.07 cy/CL MEM volume (per repetition): 738000 Byte Performance: 2817.55 MFLOP/s Performance: 704.39 MLUP/s Performance: 704.39 It/s

Data Transfers: cache | accesses evicts misses L1 | 40.01 LOAD/CL 1.02 CL/CL 2.37 CL/CL L2 | 3.39 CL/CL 0.98 CL/CL 1.99 CL/CL L3 | 2.98 CL/CL 0.05 CL/CL 0.05 CL/CL

Phenomenological ECM model: { 25.6 || 20.0 | 6.8 | 6.0 | 0.4 } cy/CL

[...]

Results for region 'jacobi_3d'

Runtime (per cacheline update): 118.85 cy/CL MEM volume (per repetition): 37466701000 Byte Performance: 1211.59 MFLOP/s Performance: 201.93 MLUP/s Performance: 201.93 It/s

Data Transfers: cache | accesses evicts misses L1 | 56.06 LOAD/CL 1.14 CL/CL 6.18 CL/CL L2 | 7.32 CL/CL 0.99 CL/CL 4.01 CL/CL L3 | 5.00 CL/CL 1.01 CL/CL 3.72 CL/CL

Phenomenological ECM model: { 46.5 || 28.0 | 14.6 | 10.0 | 18.8 } cy/CL [...]

Of course, variable adjustment can also be performed when having more than one likwid region. Assume, we have a different number of repetitions for both kernels and want to adjust them separately. We could realise this with

-D rep1 jacobi_2d:100:500 -D rep2 jacobi_3d:10:100 -R jacobi2d:rep1 -R jacobi_3d:rep2