- XPTI Tracing Framework
In order to understand different parts of an application or library, the ability to capture information about the application or library is needed. Using such information, one can create meaningful representations such as call-graphs, execution trace views etc. XPTI tracing framework is one such framework that allows developers to instrument their code with XPTI API and forward interesting or useful events during the application's execution, as determined by the developer.
The XPTI tracing framework is a lightweight tracing framework that provides a simple API for instrumenting code, which allows developers to capture relationships through nodes and edges and track the execution of the aforementioned nodes and other functions that may be of interest. The API also provides the ability to associate each trace point with end-user source code information such as source file, function name and line number, for example. The goal of this framework is to provide a low overhead solution that tools can use to build performance analytical models. This document describes the different components of this framework and a testing methodology to determine the cost of using this framework in your applications.
Current implementation of the framework uses std containers by default. There
is also an implementation that relies on the concurrent containers in
Threading Building Blocks(TBB) and this can be enabled
by using the define -DXPTI_USE_TBB
with cmake
. The std container based
implementation is a thread-safe implementation, but has not been optimized for
performance. Increasing the number of threads accessing the framework will
increase the contention costs in the current implementation and may affect the
performance of the framework.
To enable the build to use TBB for the framework and tests, use the commands as shown below:
% cd xptifw
% cmake -DXPTI_ENABLE_TBB=ON -DXPTI_SOURCE_DIR=$SYCL_HOME/xpti ./
NOTE: This document is best viewed with Markdown Reader plugin for Chrome or the Markdown Preview Extension for Visual Studio Code.
The framework consists of a proxy library that is a static library for use within your applications. However, the proxy library consists of stubs that forward calls to a dynamic component, if available and if tracing has been enabled. If tracing has not been enabled or the dynamic component not found at run time, the stub implementations return immediately.
The framework currently employs environment variables to determine if tracing has been enabled and where to find the dynamic component. Both of these must be set for it to successfully dispatch event streams.
-
The environment variable that indicates tracing has been enabled for the run is defined by
XPTI_TRACE_ENABLE
.To enable tracing, the possible values are
XPTI_TRACE_ENABLE=1
orXPTI_TRACE_ENABLE=true
and to disable, the possible values areXPTI_TRACE_ENABLE=0
orXPTI_TRACE_ENABLE=false
.Currently, if the variable is not defined, it is assumed to be
true
. -
The environment variable
XPTI_FRAMEWORK_DISPATCHER
points to the XPTI dispatcher or the dynamic component that allows the static library to load the shared object into memory and dispatch the event streams to subscribers of the event streams. The dispatcher or the dynamic component manages the subscribers to the event streams through an environment variableXPTI_SUBSCRIBERS
, but this variable has no bearing on the functionality supported by the static library portion of the framework.
The above diagram describes the dependencies and/or the interactions between a
sample application or library that has been instrumented with XPTI API, the
XPTI dispatcher and the subscriber loaded by the dispatcher. All API calls
made by the application and the library go to the static library in this
diagram and if XPTI_TRACE_ENABLE
is not enabled or if the path to the
dispatcher is not provided in the environment variable
XPTI_FRAMEWORK_DISPATCHER
, then the calls to the static library
xptiTraceEnabled()
return immediately.
In the case both these variables are enabled, then the calls will be forwarded
to the dynamic library which will attempt to load the subscribers pointed to by
the environment variable XPTI_SUBSCRIBERS
. The hypothetical trace data
captured by the subscriber is shown as well under the Resulting trace data
part of the above diagram.
The dispatcher is a dynamic library that implements the XPTI API and is dynamically loaded by the static proxy library, if the static library is used to link with the application or library. Linking with the dynamic library instead of the static library is also an option, however the dynamic library will now have to be shipped with the application. Using the static library allows instrumented applications or libraries to get around this problem.
A subscriber in XPTI is a shared object that is dynamically loaded by the dispatcher. Events generated by an instrumented application or library are forwarded to the subscriber by the dispatcher through a callback mechanism. The ownership of the subscriber is controlled by tools or applications that consume the generated event streams and must follow the protocol or handshake defined for an event stream.
There are three important things that a subscriber must implement to be
functional: (1) xptiTraceInit
, (2) xptiTraceFinish
and (3) callback
handlers. The xptiTraceInit
and xptiTraceFinish
API calls are used by the
dispatcher loading the subscriber dynamically to determine if the subscriber
is a valid subscriber. If these entry points are not present, then the
subscriber is not loaded.
The xptiTraceInit
callback is called by the dispatcher when the generator of
a new stream of data makes a call to xptiInitialize
for the new stream. The
implementation of the xptiTraceInit
function is where the subscriber would
follow the specification or protocol defined for the stream to subscribe to
events from various trace point types. The code snippet below shows an example
of such an implementation for the stream "foo"
.
#include "xpti_data_types.h"
XPTI_CALLBACK_API void xptiTraceInit
(
unsigned int major_version, ///< Major version
unsigned int minor_version, ///< Minor version
const char *version_str, ///< Version as a string
const char *stream_name ///< Stream name
) {
if (stream_name) {
// Only register callbacks if the major version is the
// expected version and the stream is the stream type
// you care about
if(std::string("foo") != stream_name &&
major_version > 0 && major_version < 3)
return;
g_stream_id = xptiRegisterStream(stream_name);
xptiRegisterCallback(g_stream_id, graph_create, graph_create);
xptiRegisterCallback(g_stream_id, node_create, node_create);
xptiRegisterCallback(g_stream_id, edge_create, edge_create);
xptiRegisterCallback(g_stream_id, region_begin, algorithm_begin);
xptiRegisterCallback(g_stream_id, region_end, algorithm_end);
xptiRegisterCallback(g_stream_id, task_begin, trace_point_begin);
xptiRegisterCallback(g_stream_id, task_end, trace_point_end);
...
}
else {
// handle the case when a bad stream name has been provided
}
}
XPTI_CALLBACK_API void xptiTraceFinish(const char *stream_name) {
// Free any dynamically allocated memory for the stream
// and any additional cleanup operation
...
}
The above code snippet shows the xptiTraceFinish
call as well and this
function is used to clean up memory and any other data structures that were
allocated to handle the stream. The xptiTraceFinish
call is made by the
dispatcher when the instrumented code is winding down a data stream by calling
xptiFinalize
for the stream.
The implementation of the callbacks is where attention needs to be given to the handshake protocol or specification for a given stream the subscriber wants to attach to and consume the data. The instrumented library may send additional user data during the notification of each trace point and this data could be of a different type for each trace point notification.
XPTI_CALLBACK_API void trace_point_begin(
xpti_trace_point_type_t trace_type,
xpti_trace_event_data_t *parent,
xpti_trace_event_data_t *event,
uint64_t instance,
const void *user_data)
{
/// Capture time here
my_time_t begin = get_my_time();
// Capture thread id or cpu or whatever else
my_device_t dev = get_my_device_instance();
/// If access to the payload is required, it can be done like this.
auto p = xptiQueryPayload(event);
// Send the data off to some serializing buffer asynchronously
emit_data(begin, dev,...);
}
For example, the specification for a given event stream, the trace point type
graph_create
might send a data structure of a certain type as user_data and
task_begin
might send a character string. Resolving the user_data
parameter
requires strict conformance to the specification for the stream.
In addition to the user_data
, the unique id that describes the event is
available under event->unique_id
. For most cases, this should be sufficient
to resolve a given event. However, in many cases, a particular event may be
exercised within a loop. Since a trace point event is based on the
instrumentation at a specific location in the code, the unique_id
of this
will always remain the same. However, with each instance of this event, and
instance ID may be emitted that keeps track of the instance of this event. The
combined value of the unique_id
and instance_id
should always be unique.
NOTE: A subscriber must implement the
xptiTraceInit
andxptiTraceFinish
APIs for the dispatcher to successfully load the subscriber.
NOTE: The specification for a given event stream must be consulted before implementing the callback handlers for various trace types.
The XPTI framework package provides a reference implementation of the XPTI dispatcher and a sample subscriber that can be used to see what is being emitted by any stream generated using XPTI. If you wish to skip the rest of the document and inspect the generated stream, you can follow the steps outlined below.
-
Build the XPTI framework dispatcher: The instructions below show how to build the library with standard containers. If you have access to TBB, you can enable the macro
-DXPTI_USE_TBB
in the cmake command.% cd xptifw % cmake -DXPTI_SOURCE_DIR=$SYCL_HOME/xpti ./ % make
The binaries will be built and installed in
lib/Release
. These include the dispatcher, a sample subscriber that prints the contents of the stream, the unit test and a performance characterization application for the framework. -
Run an instrumented SYCL application: To enable the dispatcher and subscriber, set the following environment variables. The commands for enabling the environment variables are provided for Linux environments in the example below:
% export XPTI_TRACE_ENABLE=1 % export XPTI_FRAMEWORK_DISPATCHER=/path/to/libxptifw.so % export XPTI_SUBSCRIBERS=/path/to/libbasic_collector.so
You can now run a SYCL application that has been linked with a runtime that supports the XPTI instrumentation and inspect the resulting stream. An example collector that subscribes to a specific stream is also provided under
xptifw/samples/syclpi_collector
. This example demonstrates how a tool can selectively subscribe to a known stream and ignore all other traces. All trace notifications for the streams that have no callbacks registered will return immediately. -
Running the unit tests: The unit tests included cover the exported API and incorporate some correctness tests.
% <xptifw-dir>/lib/Release/xpti_tests
-
Understanding the throughput of the framework: This document discusses the performance of the framework in detail in the sections Performance of the Framework and Modeling and projection. For details on the command line arguments, please refer to these sections.
NOTE: These tests rely on the availability of TBB for creating the multi-threaded tests and will not be created if TBB has not been enabled during the build process.
% <xptifw-dir>/lib/Release/run_test --trace-points 1000 --type performance --overhead 1.5 --num-threads 0,1,2,3 --test-id 1,2 --tp-frequency 50
The above command will run the performance tests in which 1000 trace points are created and each trace point visited twice. The trace point creation and notification costs are measured in single thread and multi-threaded scenarios and the output shows the throughput projection of the framework using the events/sec metric at 1.5% overheads to the application runtime.
The current version of the instrumentation API adopts a model where traces are generated in pairs for a give trace scope and a scoped class is made available that assists developers instrumenting their code. The APIs are divided into two parts: (1) the public API exported by the tracing framework which are implemented by the static library and the dispatcher and (2) the callback API that tools will implement to create a subscriber.
The XPTI framework exports a small set of API functions that are a part of the static library and dispatcher exports and deemed sufficient for the uses-cases that have been considered so far. Since the primary goal is to gather execution traces of compute elements in an application, the APIs address this scope for now. Currently, they allow developers to capture relationship information as nodes and edges in a graph, where the nodes represent a compute element or an action with a latency associated with it and the edges represent the dependencies between the compute elements which may be events or memory objects. In addition to such relationship events, the API allows to you trace arbitrary regions of code similar to conventional tracing methods.
For each interesting trace point in an application, a notification can be sent out by the framework. However, if there are no subscribers to consume this notification event, the framework returns immediately. This allows developers that want to instrument applications or run-times to limit the overheads considerably.
The API is documented in the file xpti_trace_framework.h
that can be found
under xpti/doc
. Some of the API functions and concepts that warrant
additional insight are discussed further.
When instrumenting an application, developers can decide how many streams they
want to generate. In some cases, organizing trace data by streams may be
preferable. The xptiInitialize
function facilitates the organization of data
streams by allowing each component that generates a stream of data to make a
call to xptiInitialize
. The types of events captured by the instrumentation
and the protocol for the handshake between the trace generation for a given
stream must be defined for a given stream as a contract or specification.
This allows subscribers to rely on this specification to implement a tool that
can consume this data and do something useful with it.
The xptiIntialize
function reports to all the subscribers that a new stream
of data is about to be generated and the name of the stream along with some
version information of the stream is sent to the subscriber.
The version information is primarily provided to ensure that subscribers to the
event stream can choose not to handle an event stream if it is an unsupported
version. The up-to-date documentation of the XPTI API is always maintained in
the xpti_trace_framework.h
header file. The application that is instrumented
must attempt to send the initialization only once, but the subscriber must
be prepared to handle the case when multiple initialization callbacks occur for
a given stream.
The application or library being instrumented to generate a stream of data must attempt to finalize the stream by making this call. This allows the dispatcher to notify all the subscribers that a stream is about to end.
To recap some of the discussion in the Architecture section,
this API call returns true
in the following situations:
-
When
XPTI_TRACE_ENABLE
is not set, but theXPTI_FRAMEWORK_DISPATCHER
andXPTI_SUBSCRIBERS
variables are set to valid libraries. This assumes that a tool has been created and pointed to byXPTI_SUBSCRIBERS
and the tool has been linked against the dynamic component or dispatcher. In general, the dynamic component or the dispatcher and the tool component or the subscriber are owned by the tool attempting to listen to the instrumented stream of data. -
When using the static library for linking in the instrumented application or library, this call returns
true
only ifXPTI_FRAMEWORK_DISPATCHER
is set to a valid library andXPTI_TRACE_ENABLE
is not set tofalse
.
We will begin our discussion by detailing the various public APIs that are exported by the framework and when they are meant to be used. The framework API is what will be used by developers instrumenting their code. The primary goal of the API is to support the instrumentation of code that may or may not fall into function boundaries.
- First, the places in the code where instrumentation is warranted should be
identified. Each trace point is unique and will be associated with a
payload
data structure that encapsulates: (1) a unique name, such as a function or kernel name or something meaningful if the trace point marks a section of code, (2) the source file it is located in, (3) the line number where this interesting event occurs and (4) the column number of the interesting event. Compilers such asgcc
,clang
and the Intel compilers can generate all of this information easily through builtin functions. - Secondly, an event must be created for this trace point region and this
process of creating an event will use the
payload
information and create an event. If the payload has already been registered, then the previously registered and associated event will be returned. This process will also create aunique_id
for the event. - Thirdly, the scope of this trace point must be determined and for a given
scope, as in a related pair of events, the
unique_id
created at the begin trace point must be preserved and used for the end trace point as well. - Finally, the callbacks registered for these types of events must be notified.
The trace point event describes the event used to notify the subscriber and is
usually associated with a payload that describes the event. Since application
code is being instrumented with XPTI, the payload may consist of a function
name
, source file name
and line number
, which forms a unique combination
of strings and numbers that is used to create the unique_id
associated with
an event. Using the event
or the unique_id
, one should be able to query the
payload
information. When a notification occurs for a trace point, the trace
point event and trace point type information is sent to the subscriber. A given
event may be used to notify subscribers as multiple trace point types. For
example, a node may represent a computational entity and an event created for
the node may be emitted as node_create
, task_begin
and task_end
notifications to record the creation, the beginning of the execution of an
instance of the node and when the execution of that instance has completed.
We will first look at the xpti::trace_point_payload_t
data structure that
is defined in xpti_data_types.h
.
xpti::payload_t p("function1", "main.cpp", 104, 5, function1);
The payload data structure can be created with a set of unique descriptors for
the region of code being instrumented, such as a function name, source file
name and line number, for example. However, it can also take in a function
name and pointer to the function or just a pointer to the function that
uniquely describes the payload that will be used to create a trace point
event. This information is used by the xptiMakeEvent
function to create a
unique_id
for the trace point event.
The next section looks at using the payload information to create a trace point
event. Each trace point is unique, from a language or code section standpoint.
A trace point maybe visited multiple times, but the payload and the event
describing the trace point will always be the same. The tracing framework must
guarantee that when a trace point is visited, the same unique_id
is
retrieved for it. For frequent visits to the same trace point site, we must be
able to look up the unique_id
of the payload efficiently or we cache the
information at the trace point location.
Once a payload structure has been created, it is used to associate the trace
point that this payload represents to an event that captures additional
information about the trace point. The framework has a list of predefined trace
point types that may be used to mark various trace points with an appropriate
type. They are declared in the header file xpti_data_types.h
and are used to
describe the creation of a graph, node, edges or the instantiation of a node as
task_begin
and task_end
pair of trace point notifications.
These trace points represent the types of actions commonly associated with
instrumenting a library or application. However, in cases where there is no
direct mapping from the predefined trace point types to a language structure or
if one needs to describe an action that is orthogonal to code structure such as
diagnostic information for executing code, the framework allows each
instrumentation stream to extend the available trace point types with new trace
point types that map to these unsupported constructs.
This API allows streams to extend the existing trace point types and generate new types that map to the action about to be described by this API. The description of the API is followed by a code example that shows the use of this API.
uint16_t xptiRegisterUserDefinedTracePoint(
const char *vendor_name,
uint8_t user_defined_tp);
Argument | Description |
---|---|
vendor_name |
The name of the tool or vendor implementing the tool that is describing this extension to the trace points. |
user_defined_tp |
The user defined trace point which is of type uint16_t . The 8 most significant bits of the 16-bit value encodes the vendor_name and 8 least significant bits are used to encode the extensions for this tool. A maximum of 128 new user defined trace points can be created per vendor_name . |
The code example of extending the default or pre-defined trace point types is
shown below. As you can see in the example, the user defined trace point types
are initialized in the enum
using the macros XPTI_TRACE_POINT_BEGIN
and
XPTI_TRACE_POINT_END
for the same trace point type 0
. By default, the
trace point types are designed to define the scope of the action be described.
This requires 1-bit to represent begin or end, which leaves the remaining
7-bits to describe 128 unique trace point extensions for a given tool_name
.
typedef enum {
my_read_begin = XPTI_TRACE_POINT_BEGIN(0),
my_read_end = XPTI_TRACE_POINT_END(0),
my_allocate_begin = XPTI_TRACE_POINT_BEGIN(1),
my_allocate_end = XPTI_TRACE_POINT_END(1)
}tp_extension_t;
...
uint16_t tp1_start = xptiRegisterUserDefinedTracePoint("myTest",
my_read_begin);
uint16_t tp1_end = xptiRegisterUserDefinedTracePoint("myTest",
my_read_end);
uint16_t tp2_start = xptiRegisterUserDefinedTracePoint("myTest",
my_allocate_begin);
uint16_t tp2_end = xptiRegisterUserDefinedTracePoint("myTest",
my_allocate_end);
...
xptiNotifySubscribers(stream_id, tp1_start, parent, event, instance,
nullptr);
If the callback handler for this stream needs to know if this is an extension or a predefined type, they can use the following macros to decipher the trace point type.
uint8_t tool_vendor_id = XPTI_TOOL_ID(tp1_start);
uint8_t tp_type = XPTI_EXTRACT_USER_DEFINED_ID(tp1_start);
if(tool_vendor_id == 0) {
// Default pre-defined type
}
else {
// User-defined trace type
// Here: tp_type will be tp_extension_t::my_read_begin
}
This mechanism allows different kinds of information to be captured and the trace point type describes the type of information expected by the notification. The trace point type is only used when notifying the subscribers of an event with the trace point type acting as a qualifier for the event.
This API allows streams to extend the existing trace point event types and generate new types that map to the semantic description of the trace event being created. The description of the API is followed by a code example that shows the use of this API.
uint16_t xptiRegisterUserDefinedTracePoint(
const char *vendor_name,
uint8_t user_defined_event);
Argument | Description |
---|---|
vendor_name |
The name of the tool or vendor implementing the tool that is describing this extension to the event types. |
user_defined_event |
The user defined event which is of type uint16_t . The 8 most significant bits of the 16-bit value encodes the vendor_name and 8 least significant bits are used to encode the extensions for this tool. A maximum of 128 new user defined event types can be created per vendor_name . |
Similar to trace point types, the xpti::trace_event_type_t
can also be
extended. The events that are predefined by the framework fall under {graph,
algorithm, barrier, scheduler, async, lock, offload_read, offload_write,
user_defined}
. Let's take the example of having to extend the event types to include a diagnostic category.
typedef enum {
my_diagnostic_A = XPTI_EVENT(0),
my_diagnostic_B = XPTI_EVENT(1)
} event_extension_t;
...
uint16_t my_ev1 = xptiRegisterUserDefinedEventType("myTest", my_diagnostic_A)
...
uint64_t InstanceNo;
MyEvent = xptiMakeEvent("application_foo", &Payload,
my_ev1, xpti::trace_activity_type_t::active,
&InstanceNo);
When this information is provided to the callback handlers in subscribers through notifications, the handler can decide what it wants to do with the extended types. If it is not designed to handle it, it can choose to ignore the event. On the other hand, a subscriber that is designed to handle it must conform to the specifications defined by the stream that is generating the extended type events.
uint8_t tool_vendor_id = XPTI_TOOL_ID(tp1_start);
uint8_t ev_type = XPTI_EXTRACT_USER_DEFINED_ID(tp1_start);
if(tool_vendor_id == 0) {
// Default pre-defined type
}
else {
// User-defined event type
// Here: tp_type will be event_extension_t::my_diagnostic_A
}
The xptiMakeEvent
combines the payload information with information about the trace point being defined to create an xpti::trace_event_data_t
.
xpti::trace_event_data_t *xptiMakeEvent(const char *name,
xpti::payload_t *payload, uint16_t event,
xpti::trace_activity_type_t activity,
uint64_t *instance_no);
Argument | Description |
---|---|
name |
Name of the event, which is typically a function or kernel name. |
payload |
The payload that this trace event represents. The payload in XPTI represents the source file, function name and line number, if available. If the source information is not available, it may contain a function name and code pointer virtual address or just the virtual address. This allows one to get the payload as meta-data for a given trace point. |
event |
The event type this trace point represents. It could be one of the predefined types or an extended type. |
activity |
Describes the activity type, as in active time or overhead time etc. |
instance_no |
If xptiMakeEvent is used each time this code location is visited to create or look up a previously created event, the instance_no parameter is incremented to indicate the instance ID of the current visit. |
The created trace event data type is returned. In case the payload information
is the same, a previously created event is returned. If global user_data
needs to be specified for this trace event that may be used by tools, it can
be allocated and stored in xpti::trace_event_data_t
structure under
user_data
.
The code sample below shows a sample code snippet that creates such an trace point event using a payload and uses the created event to notify all subscribers of the event qualified by a trace point type.
if ( xptiTraceEnabled() ) {
// example
uint64_t instance_no;
auto stream_id = xptiRegisterStream("myStream");
xptiInitialize("myStream", 1, 0, "myStream 1.0");
xpti::payload_t p("application_graph");
auto event = xptiMakeEvent( "app", &p,
xpti::trace_event_type_t::graph,
xpti::trace_activity_type_t::active,
&instance_no);
}
...
if (event && xptiTraceEnabled()) {
// If the event has been created, then notify
// all subscribers about the graph creation
xptiNotifySubscribers(stream_id,
xpti::trace_point_type_t::graph_create,
nullptr, // no parent
event,
instance_no,
nullptr // no user data);
}
As discussed in previous sections, creating a trace point is only one part of a trace point definition. The part that actually lets a tool know that such a trace event occurred is through a notification of the aforementioned event. In this section, we will describe the API and its use.
xpti::result_t xptiNotifySubscribers(uint8_t stream_id,
uint16_t trace_type,
xpti::trace_event_data_t *parent,
xpti::trace_event_data_t *event,
uint64_t instance,
const void *temporal_user_data);
Argument | Description |
---|---|
stream_id |
The stream that this notification belongs to. The stream ID is obtained from xptiRegisterStream . |
trace_type |
The trace point type that describes the current notification. It could be one of the pre-defined types or a user-extension. |
parent |
A parent trace event, if present. |
event |
The current trace event for which the notification is being sent out. |
instance |
This value indicates the instance of the current trace event. If this is being used to monitor functions, this value should indicate the call count at that time. |
temporal_user_data |
This is a field that holds per instance user data and is valid for just that value of instance |
The code example below shows an example 'C' code that is instrumented with the
framework API and this will generate traces for the functions in the program.
However, in this example, we use the helper scoped class provided by the
framework to emit notifications for the begin and end of the scope through
the xpti::trace_point_type_t::task_begin
and
xpti::trace_point_type_t::task_end
automatically. In this example, the per instance user data is not sent and the scoped_notify
defaults that to nullptr
.
void function1() {
uint64_t instance_id = 0;
xpt::trace_event_data_t event;
if (xptiTraceEnabled()) {
xpti::payload_t p("function1","main.cpp",104, 2,function1);
event = xptiMakeEvent("function1",&p,
xpti::trace_event_type_t::algorithm,
xpti::trace_activity_type_t::active, &instance_id);
}
xpti::framework::scoped_notify ev("myStream",
xpti::trace_point_type_t::region_begin, nullptr, &event,
instance_id);
for(int i = 0; i < 5; ++i ) {
function2();
}
}
In order to estimate the overheads one could experience by using the framework, this section will outline a couple of scenarios. Some of the key operations that would result in overheads are listed below. For each of these operations, we will construct scenarios that will provide us with measurements to determine how many events/sec we can process before the overheads start to become a problem. We will use an overhead limit of 1% as this data will be used to build an analytical model in the future.
Data structure | Operation | Description |
---|---|---|
String table | String insert | Every string that is exposed to the framework is registered with the string table and only the string IDs used |
String lookup | The lookup of a string in the string table can be implicit or explicit. For example, when a previously defined payload is submitted again, the strings present will be registered. However, there is an early exit if the string already exists in the table due to an implicit lookup. | |
Payload | Creation | A payload constructor may take in function name , source file name , line number, column number and an address or a function name and an address or just an address. Whenever strings are presented, they are first registered with the string table and the string IDs are used in the creation of a hash that is used to generate a unique_id for the event. |
Trace point event | Metadata add | One can add arbitrary number of <key, value> pairs to an event as strings. These strings will be registered with the string table and the metadata is maintained as a map of two string IDs. |
Metadata lookup | The metadata associated with a trace point event is looked up explicitly in the string table. | |
Creation | Includes string table registration for the strings contained in the payload, the hash generation for a payload and insertion of the payload and event into a map. | |
Framework lookup | If a trace point has already been created, the event can be looked up using its unique_id . |
|
Notification | Events that are created are notified to all subscribers | |
Subscriber | Callback handler | When a notification is received by the subscriber, the event is handled. The callback handlers must be efficient to minimize the overheads. |
Using the operations described in the table above, a set of tests are designed
to evaluate the performance of these operations in near real-world scenarios.
The tests are present under xptifw/basic_test
and report these results in
tabular form. The performance test in particular accepts various configuration
parameters from the command line. An example command is shown in the section
below.
run_test --trace-points 1000 --type performance --overhead 1.5 --num-threads 0,1,2,3 --test-id 1,2 --tp-frequency 50
The above command will run the test for 1000 trace points and compile the performance measurements for the set of threads given as input with a maximum overhead at 1.5% and for every trace point created, it will be visited twice. A description of the command line arguments is provided in detail below:
--type, -y [
required
]
- This flag takes in the type of tests that need to be run. The allowed options are [semantic, performance].
- semantic: Runs semantic tests that test the correctness of the framework operations and they are split into three separate tests.
- Performs string table tests on a 1000 strings
- Performs tests on trace point events by checking to see if the same event is returned for the same payload and so on.
- Performs notification tests to see if trace events are notified correctly and record the instances of notifications per events.
- performance: Runs performance tests on the provided input configuration and these tests measure the cost of various operations used in the framework. These tests are split into two separate tests.
- Data structure tests that capture the average cost of string table inserts, lookups, trace point event creation and lookup using the same payload or
unique_id
for the event and notification.- Runs instrumentation tests and projects the number of events that can be serviced per second using the configuration provided on the command line. These tests are where the --overhead and --tp-frequency arguments are used.
--trace-points, -t [
required
]
- Number of trace point events to create and use for the test. The expected range is [10-100000].
--test-id, -i [
required
]
- Takes in a list of tests that are comma separated and runs the requested tests. This command line argument takes in a range as well and the format is described below:
- Comma separated sequence: --test-id 1,2,3
- Range description: --test-id 1:3:1
--num-threads, -n [
required
]
- Takes in a list of thread counts to use for the requested tests.
- Comma separated sequence: --num-threads 0,1,2,4,8,12,16
- Range description: --num-threads 0:2:1,4:16:4
--tp-frequency, --f
- The trace point creation frequency basically allows the test to determine the total number of trace point visits to perform for every trace point event that is created. If the trace point creation frequency is 10%, then every trace point event that is created must be visited 10 times. Since we know how many trace point events were requested from the command line (--trace-points N), we multiply this value (N) by 100/f where f = trace point frequency in percent to get the total number of trace point visits.
- So, if number of trace points is 5000 and trace point frequency is 10%, the total number of trace point visits the test is going to perform is 5000 x 1/0.1 = 50000
--overhead
- The overhead input allows the test framework to use the measured performance of the trace point creation and notification to come up with an estimate of how many events can be serviced per second with the given configuration.
- The default overheads for which the events/sec are computed is 1%
- If the overheads desired is 1%, then the following formula is used to compute the events/sec:
total cost of instrumentation (I) = (cost of trace point creation + cost of notification)
So, if --trace-points 5000 --tp-frequency 10, this will be:
I = 5000xCost(TP Create) + 50000xCost(Notify)
Average cost (A) = I/50000, for 50000 events notified
This cost A does not take into account the cost of the callback handler. In our projections, we use a handler cost of 10ns, 100ns and 500ns to get the events/sec that can be serviced. On an average, the handler costs for real-world cases will be somewhere between 80ns-400ns.
So, if the average cost is A and this is 1% overhead, the total run time must be 100xA ns
Events/second E = 1000,000,000 ns/(100xA)ns
Using the metrics described above, we run the tests with varying overheads and trace point creation frequencies to determine the maximum number of events that can be serviced for that configuration. Some sample configurations are shown below:
- Configuration where each trace point event created is only visited once
run_test --trace-points 5000 --type performance --num-threads 0,1,2,4 --test-id 1,2 --tp-frequency 100
- Configuration where each trace point event is visited twice
run_test --trace-points 5000 --type performance --num-threads 0,1,2,4 --test-id 1,2 --tp-frequency 50
- Configuration where each trace point event is visited ten times
run_test --trace-points 5000 --type performance --num-threads 0,1,2,4 --test-id 1,2 --tp-frequency 10
In order to determine the number of events that the framework can service in a second, the performance tests use the following approach. If the total instrumentation cost is 1µs and for this cost to be under 1% total overhead, the amount of work that needs to be accomplished for every trace event would be 1µs x 100 = 100µs. In this case, the maximum number of events that can be notified/serviced would be:
1 sec/100µs = 1000000µs/100µs = 10000 events/sec
The total instrumentation cost would include some of the time in the infrastructure in the framework and the cost of handling each notification through callbacks in the subscriber.
On an average, some trace points are visited only once and others 10s-100s of times. We assume that each trace point created will be visited at least 10 times. The command line arguments that would test such a configuration is shown below.
```bash
run_test --trace-points 10000 --type performance --num-threads 0,1,2,4 --test-id 1,2 --tp-frequency 10
We take average cost of a trace point event creation and the cost of 10 notifications for each such event as it is visited 10 times to form the basis of the cost incurred within the framework. This information is reported by the performance test. The total instrumentation cost as discussed in the previous section comprises of a framework cost and a callback handler cost in the subscriber.
Framework cost FWcost = Avg{TPcreate + 10 x TPnotify}
Subscriber cost Callbackcost = Ct which could be anywhere in the range [10-10000]ns
Total cost Costtotal = FWcost + Ct
Using the information from the report or one such instance captured in the table above, we know that:
FWcost = ~55ns
Using different values for Ct = [10, 100, 500, 1000]ns, we get the table that shows the events/sec that can be serviced for total instrumentation cost for the configuration. It can be noticed that as the callback handler costs increase, the events/sec is inversely proportional to the callback handler costs. The work unit cost for determining the number of events/sec is given by:
Wcost = 100 x [FWcost + Ct] for the configuration that limits overheads to 1%.
The more times a trace point event is visited, the more events per second can be serviced by the framework as the cost of a trace point event creation can be amortized over all the visits to the same trace point. However, Ct will eventually limit the events/sec when they get to be significantly larger than FWcost.
NOTE: All measurements reported in this document were measured on an NUC form-factor machine with Intel® Core™ i7-8559U @ 2.7 GHz processor running Ubuntu 18.04. The tests were compiled to use Threading Building Blocks concurrent containers for these runs.
Operation | Statistic | Scenario | Count | Framework Cost(ns) |
---|---|---|---|---|
String table insertion | Cost/insertion | Create a large number of strings (>1000) and insert them into the table. Measure the average cost of multi-threaded insertion. | 10000 | ~150-500ns |
String table lookup | Cost/lookup | Look up the strings added in the insertion test in random order and measure the average cost of the lookup. | 20000 | ~40ns |
String table insert/lookup | Cost of insert/lookup | Strings that are added to the string table may be looked up multiple times. On an average, we assume that ever string added to the string table is looked up twice. If strings are looked up more often than the twice assumed in this test, then the average cost of insertion/lookup will be lower. | 30000 | ~130ns |
Trace point creation | Cost/creation | Create unique trace points and measure the average cost for each creation. | 10000 | ~1100ns |
Trace point creation | Cost/creation | Attempt to create previously created trace points again and measure the average cost for each creation. Since the payloads will be the same, there should be an early exit after it is determined that they are the same. | 100000 | ~275ns |
Trace point lookup using unique_id (FW lookup) |
Cost/lookup | This time will purely be the cost of lookup of finding the trace point event, given its unique_id . |
100000 | ~35ns |
Trace point event caching | Cost/lookup | If the trace point event is cached at the event site, then this cost is 0. This is the most efficient mode of using it and amortizes the cost od trace point event creation the best. | 100000 | ~0ns |
Trace event notification | Cost/notification | Measure the average cost of notification. Here the the callback handler registered will return immediately. The callback handler overheads are modeled separately when the maximum number of events that can be serviced per sec are computed. | 100000 | ~10ns |
Trace event composite cost [FWcost] | Average cost/trace point | Create N unique trace points and MxN trace point lookups + MxN notifications. Measure the total time and get the average using MxN as the denominator. | 100000 | ~55ns |
NOTE: The trace point, as implemented in the code block in the previous section can lead to significant runtime overheads as the framework has to lookup the
unique_id
of the payload provided with the trace point and this lookup can be costly. It is recommended that events created for each trace point are cached locally for subsequent use. An example implementation of this optimization is is shown in the code block below using a static variable, for example. If the instrumentation is a part of a class, then the event can be saved as a member variable.
void function1() {
uint64_t instance_id = 0;
static xpti::trace_event_data_t *f1_event = nullptr;
// Only create the event if it hasn't already been created.
// When the data structure f1_event is initialized, the
// unique id is set to invalid_id.
if (xptiTraceEnabled() && !f1_event) {
xpti::payload_t p("function1","main.cpp",104, 2,function1);
f1_event = xptiMakeEvent("function1", &p,
xpti::trace_event_type_t::algorithm,
xpti::trace_activity_type_t::active,
&instance_id);
}
xpti::framework::scoped_notify ev("myStream",
xpti::trace_point_type_t::region_begin,
nullptr, f1_event, instance_id);
for(int i = 0; i < 5; ++i ) {
function2();
}
}
NOTE: Using the framework cost, FWcost = ~55ns, and a set of callback handler costs, we get the following table that shows the events/sec that can be handled by XPTI for a given configuration.
Trace points | Threads | Overhead | Events/sec @Ct=10ns | Events/sec @ Ct=100ns | Events/sec @ Ct=500ns | Events/sec @ Ct=1000 |
---|---|---|---|---|---|---|
10000 | Serial | 1% | ~150K | ~64K | ~18K | ~9.5K |
4 | 1% | ~150K | ~64K | ~18K | ~9.5K | |
Serial | 2% | ~300K | ~127K | ~36K | ~19K | |
4 | 2% | ~290K | ~125K | ~36K | ~19K | |
1000 | Serial | 1% | ~165K | ~66K | ~18K | ~9.5K |
4 | 1% | ~165K | ~66K | ~18K | ~9.5K | |
Serial | 2% | ~360K | ~137K | ~37K | ~19K | |
4 | 2% | ~345K | ~135K | ~37K | ~19K |
The above data from the table is only provided as a guideline on what to expect with the instrumentation. The events/second can be quite high if the callback handlers are written efficiently. So the range of events that can be serviced can be ~300,000 to 10000 per second depending on the cost of handling the callbacks.