Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update profiling section #266

Merged
merged 7 commits into from
Jun 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 2 additions & 6 deletions pages/docs/running/running-output-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,13 +65,9 @@ TimeWindow Iteration ResRel(Temperature) ResRel(Heat-Flux)
* `Iteration` is the coupling iteration counter within each time window. So, in the first time window, 6 iterations were necessary to converge, in the second time window 3.
* And then two convergence measure are defined in the example. Two relative ones -- hence the `...Rel(...)`. The two columns `ResRel(Temperature)` and `RelRel(Force)` give the relative residual for temperature and heat flux, respectively, at the start of each iteration.

## precice-MySolver-events.json
## precice-events/*

Recorded events with timestamps. See page on [performance analysis](tooling-performance-analysis.html).

## precice-MySolver-events-summary.log

Summary of all events timings. See page on [performance analysis](tooling-performance-analysis.html).
Recorded events of all participants and ranks. See page on [performance analysis](tooling-performance-analysis.html) for more information.

## precice-postProcessingInfo.log

Expand Down
283 changes: 244 additions & 39 deletions pages/docs/tooling/tooling-performance-analysis.md
Original file line number Diff line number Diff line change
@@ -1,70 +1,275 @@
---
title: Performance analysis
permalink: tooling-performance-analysis.html
keywords: tooling, json, performance, events, analysis, cpu time
keywords: tooling, json, performance, events, analysis, cpu time, hpc, profiling
summary: "preCICE comes with an internal performance analysis framework"
---

## Working with events
## Introduction

preCICE uses the [EventTimings](https://github.com/precice/EventTimings) library to profile major logical blocks of work.
The library generates files during the finalization step of each participant and writes them to their current working directories.
preCICE uses an internal profiling framework to measure code sections inside and between API calls of the library.
fsimonis marked this conversation as resolved.
Show resolved Hide resolved
Existing external frameworks cannot cope with the complexity of the multi-executable setup of preCICE.
To understand the performance of a coupled run, it is necessary to look at the interplay of all coupled participants.

For a participant called `MySolver`, the files are called as follows:
These named code sections are called `events` and each event generates records during runtime.
Each rank of each participant gathers local records and writes them to a JSON file. To store these JSON files, preCICE creates a directory called `precice-events` in the current working directory of each participant.
All these files of all participants then need to be post-processed and merged to a single file using the `precice-events` python script.
You can then use the resulting file to analyze the profiling data.

* `precice-MySolver-events.json`
* `precice-MySolver-events-summary.log`
## Fundamental Events

## The events summary file
Some events are useful for everyone, while others are only useful for developers.
To keep this feature as useful as possible, we pre-selected a set of events that we deem fundamental.
These include the main preCICE API functions and the time spent between these calls inside the solvers.
Fundamental events should give you an insight in the overhead of preCICE as well as load imbalance between ranks etc.

The events summary file contains a table of events, their occurrences and some statistics on their runtime.
This can be helpful to quickly identify where the preCICE library spends most of its time.
Fundamental events are:

It is especially helpful to focus on [noteworthy events](#noteworthy-events).
* `_GLOBAL` time spent from the initialization of the events framework to the finalization. Starts in the construction of the participant and ends in finalize or the destructor.
* `construction` time spent in construction of the Participant, including configuration and setting up the intra-communication of each participant.
* `solver.initialize` time spent in the solver until `initialize()` is called. This normally includes setting meshes, defining initial data and preparing the solver.
* `initialize()` time spent in preCICE `initialize()`. This includes establishing communication between participants, mesh and data transfer, as well as mapping computation.
* `solver.advance` time spent in the solver between `advance()` calls, including the time between `initialize()` and the first `advance()` call.
* `advance()` time spent in preCICE `advance()`. This includes data mapping, data transfer, acceleration.

This is an example output:
## Measuring Blocking Operations

```text
Run finished at Wed Aug 1 09:41:10 2018
Global runtime = 12859ms / 12s
Number of processors = 4
# Rank: 0
Some parts of preCICE involve communication, which cannot be interleaved efficiently with other computations.
Measuring the runtime of such operations can be tricky, as the time spent waiting should not be misinterpreted.
Synchronizing all ranks using a barrier solves the issue, but has an impact on performance.

Event | Count | Total[ms] | Max[ms] | Min[ms] | Avg[ms] | T[%] |
---------------------------------------------------------------------------------------
_GLOBAL | 1 | 12859 | 12859 | 12859 | 12859 | 99 |
advance | 1 | 84 | 84 | 84 | 84 | 0 |
To keep this waiting component of the overall measurement to a minimum without affecting performance, we added a configuration option to toggle the synchonization before required events.
Use the `sync-mode` attribute to enable such synchonization if you need it.

Name | Max | MaxOnRank | Min | MinOnRank | Min/Max |
--------------------------------------------------------------------------
_GLOBAL | 12859 | 0 | 12859 | 0 | 1 |
advance | 119 | 2 | 83 | 1 | 0 |
```xml
<precice-configuration sync-mode="1">
...
</precice-configuration>
```

`T[%]` prints the relative runtime. Note that this can be more than 100% summed up, since events can be nested, like in the example above.
## Configuration

## The events JSON file
You can configure the profiling with the `<profiling/>` tag located inside `<precice-configuration>`.
The default settings enable profiling of fundamental events and write the event files the working directory of each solver.

The events JSON file contains the full picture of events and attached data.
The full configuration consists of:

* `mode` either `off`, `fundamental` (default), or `all`
* `directory` location to create the `precice-events` folder in and write the event files to
* the flush frequency `flush-every`: `0` means only at the end of the simulation, `n>0` means to flush every n records

### Examples

To turn the profiling off:

```xml
<precice-configuration>
<profiling mode="off" />
</precice-configuration>
```

To write the files of all participants into a common directory (often `..` works too):

```xml
<precice-configuration>
<profiling director="/path/to/dir" />
</precice-configuration>
```

To prevent any write operations of the filesystem until the end of the simulation:

```xml
<precice-configuration>
<profiling flush-every="0" />
</precice-configuration>
```

To profile blocking operations:

```xml
<precice-configuration sync-mode="1">
<profiling mode="all" />
</precice-configuration>
```
Comment on lines +87 to +93
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need updating


## Post-processing workflow

The general workflow looks as follows:

1. Run the simulation
2. Merge the event files
3. Analyze each participant, which executes a mapping or computes acceleration
4. Analyze each participant to check for load-imbalance between ranks
5. Visualize the simulation to check for load-imbalance between participants

The rest of the section will go through the process of analyzing participants step by step.

### Merging Event Files

After the simulation completes, you can find `precice-events` folders in the configured location, defaulting to the working directory of each participant.
An example could look like this:

```txt
.
├── A
│   └── precice-events
│   └── A-0-1.json
├── B
│   └── precice-events
│   └── B-0-1.json
└── precice-config.xml
```

To find and merge these files run:

```console
$ ls
A
B
$ precice-events merge A B
Searching A : found 1 files in A/precice-events
Searching B : found 1 files in B/precice-events
Found 2 unique event files
Found a single run for participant B
Found a single run for participant A
Loading event files
Globalizing event names
Grouping events
Aligning B (-179us) with A
Writing to events.json
fsimonis marked this conversation as resolved.
Show resolved Hide resolved
$ ls
A
B
events.json
```

The merge command searches passed directories for the event files.
You can also pass individual files if you are not interested in all ranks

The merge command is written in pure python without external dependencies to make it easy to use on clusters.
After you run `precice-events merge`, you end up with a single file, which can be additionally compressed and transferred to another machine.
This is especially handy for very large and/or long simulations on clusters or supercomputers.

The result of this step is a single `events.json` file.

### Visualizing the simulation

You can run `precice-events trace` to export the `events.json` file as `trace.json` in the [trace events format](https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview).

```console
$ precice-events trace
Reading events file events.json
Writing to trace.json
```

You can use the [events2trace](https://raw.githubusercontent.com/precice/EventTimings/master/extra/events2trace.py) tool to convert events to the [trace format](https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/view).
The tool allows to merge the events output of multiple participants into a single output in the trace format.
This trace format can then be visualized using the following tools:

* [speedscope.app](https://www.speedscope.app/)
* [ui.perfetto.dev](https://ui.perfetto.dev)
* [profiler.firefox.com](https://profiler.firefox.com/)
* [speedscope.app](https://www.speedscope.app/)
* [`chrome://tracing/`](chrome://tracing/) in Chromium browsers [_(see full list)_](https://en.wikipedia.org/wiki/Chromium_(web_browser)#Active)

An example trace visualization using `chrome://tracing/` of the elastictube1d example looks as following:
These visualization tools cannot handle large runs though.
There are two options to reduce the trace size:
You can select the first `N` ranks using `-l N` and you can pick specific ranks using `-r RANK ...`
These two selectors are combined.

As an example, to select the first 3 ranks and in addition ranks 10 and 200:

```console
$ precice-events trace -l 3 -r 10 200
Reading events file events.json
Selected ranks: 0,1,2,10,200
Writing to trace.json
```

![Trace example](images/docs/tooling/elastictube1d-events.png)
An example trace visualization using `ui.perfetto.dev` of the [elastic tube 1d tutorial](tutorials-elastic-tube-1d.html) looks as follows.
Note the alternating executions of the solver due to the serial coupling scheme.

You can also evaluate the data in the events files yourself.
Please pay special attention to the timestamps as they are based on the system clock.
Larger clusters are often subject to clock-drift, so you need to shift and normalize the time-scales yourself.
![example of the elastic tube 1d tutorial visualized by perfetto](images/docs/tooling/profiling-aste-perfetto-serial.png)

## Noteworthy events
An example trace visualization using `ui.perfetto.dev` of a parallel [ASTE](tooling-aste.html) run on two and four ranks looks as follows.
This first version contains only fundamental events, which is the default profiling setting.

Noteworthy events are
![example of parallel ASTE with fundamental events only visualized by perfetto](images/docs/tooling/profiling-aste-perfetto-parallel-fundamental.png)

1. The communication buildup, which can become very expensive on clusters. This is generally bound to the filesystem. This is not a problem for The main latency here is the distributed file-system.
This second version contains all events using the configuration `<profiling mode="all" />`.

![example of parallel ASTE with all events visualized by perfetto](images/docs/tooling/profiling-aste-perfetto-parallel-all.png)

### Analyzing participants

You can run `precice-events analyze NAME` to analyze the participant `NAME` and display a table of recorded events.
The output differs for serial and parallel participants.

The output for serial solvers contains a table displaying the name of the event, followed by the sum, count, mean, min, and max runtime.

<div style="display:contents;overflow-x:auto" markdown="1">

```console
$ precice-events analyze Fluid
Reading events file events.json
Output timing are in us.
event | sum count mean min max
_GLOBAL | 34125973.0 1 34125973.0 34125973.0 34125973.0
advance | 32893268.0 793 41479.530895334174 116.0 45942.0
construction | 59470.0 1 59470.0 59470.0 59470.0
construction/configure | 59265.0 1 59265.0 59265.0 59265.0
finalize | 42664.0 1 42664.0 42664.0 42664.0
initialize | 181787.0 1 181787.0 181787.0 181787.0
initialize/m2n.requestPrimaryRankConnection.Solid | 266.0 1 266.0 266.0 266.0
solver.advance | 947727.0 794 1193.610831234257 550.0 15470.0
solver.initialize | 29.0 1 29.0 29.0 29.0
```

</div>

The output for parallel solvers is slightly more complex.
After the name of the event, the table contains three blocks, each containing the sum, count, mean, min, and max runtime for a specific rank.

1. The first block displays the primary rank (0).
2. The second block displays the secondary rank which spent the least total time in `advance`.
3. The third block displays the secondary rank which spent the most total time in `advance`.

<div style="display:block;overflow-x:auto" markdown="1">

```console
$ precice-events analyze B
Reading events file events.json
Output timing are in us.
Selection contains the primary rank 0, the cheapest secondary rank 2, and the most expensive secondary rank 1.
event | R0:sum R0:count R0:mean R0:min R0:max | R2:sum R2:count R2:mean R2:min R2:max | R1:sum R1:count R1:mean R1:min R1:max
_GLOBAL | 66888.0 1 66888.0 66888.0 66888.0 | 66953.0 1 66953.0 66953.0 66953.0 | 68462.0 1 68462.0 68462.0 68462.0
advance | 170.0 1 170.0 170.0 170.0 | 154.0 1 154.0 154.0 154.0 | 905.0 1 905.0 905.0 905.0
construction | 3179.0 1 3179.0 3179.0 3179.0 | 3171.0 1 3171.0 3171.0 3171.0 | 3132.0 1 3132.0 3132.0 3132.0
construction/com.initializeIntraCom | 130.0 1 130.0 130.0 130.0 | 157.0 1 157.0 157.0 157.0 | 153.0 1 153.0 153.0 153.0
construction/configure | 2863.0 1 2863.0 2863.0 2863.0 | 2828.0 1 2828.0 2828.0 2828.0 | 2786.0 1 2786.0 2786.0 2786.0
finalize | 291.0 1 291.0 291.0 291.0 | 365.0 1 365.0 365.0 365.0 | 251.0 1 251.0 251.0 251.0
initialize | 54588.0 1 54588.0 54588.0 54588.0 | 54008.0 1 54008.0 54008.0 54008.0 | 54450.0 1 54450.0 54450.0 54450.0
initialize/m2n.acceptPrimaryRankConnection.A | 1227.0 1 1227.0 1227.0 1227.0 | 1569.0 1 1569.0 1569.0 1569.0 | 1123.0 1 1123.0 1123.0 1123.0
solver.advance | 7063.0 2 3531.5 621.0 6442.0 | 8018.0 2 4009.0 454.0 7564.0 | 8016.0 2 4008.0 635.0 7381.0
solver.initialize | 1557.0 1 1557.0 1557.0 1557.0 | 1201.0 1 1201.0 1201.0 1201.0 | 1678.0 1 1678.0 1678.0 1678.0
```

</div>

### Processing in other software

You can run `precice-events export` to export the `events.json` file as `events.csv`.
This contains CSV data including all individual events from all participants and ranks.
It also contains additional data entries attached to events.
The header of the result CSV is `participant,rank,size,event,timestamp,duration,data`.

```console
$ precice-events export
Reading events file events.json
Writing to events.csv
```

Example of loading the csv into pandas and extracting rank 0 of participant A:

```py
import pandas
df = pandas.read_csv("events.csv")
selection = df[ (df["participant"] == "A") & (df["rank"] == 0) ]
```