You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: affinity/cpp-20/d0796r2.md
+32-14Lines changed: 32 additions & 14 deletions
Original file line number
Diff line number
Diff line change
@@ -96,9 +96,10 @@ Some systems give additional user control through explicit binding of threads to
96
96
97
97
In this paper we describe the problem space of affinity for C++, the various challenges which need to be addressed in defining a partitioning and affinity interface for C++, and some suggested solutions. These include:
98
98
99
-
* How to represent, identify and navigate the topology of execution resources available within a heterogeneous or distributed system.
100
-
* How to query and measure the relative affininty between different execution resources within a system.
101
-
* How to bind execution and allocation particular execution resource(s).
99
+
* How to represent, identify and navigate the topology of execution and memory resources available within a heterogeneous or distributed system.
100
+
* How to query and measure the relative affininty between execution and memory resources within a system.
101
+
* How to bind execution to particular execution resource(s).
102
+
* How to bind allocation to particular memory resource(s).
102
103
* What kind of and level of interface(s) should be provided by C++ for affinity.
103
104
104
105
Wherever possible, we also evaluate how an affinity-based solution could be scaled to support both distributed and heterogeneous systems.
@@ -110,25 +111,39 @@ There are also some additional challenges which we have been investigating but a
110
111
111
112
### Querying and representing the system topology
112
113
113
-
The first task in allowing C++ applications to leverage memory locality is to provide the ability to query a *system* for its *resource topology* (commonly represented as a tree or graph) and traverse its *execution resources*.
114
+
The first task in allowing C++ applications to leverage memory locality is to provide the ability to query a *system* for its *resource topology* (commonly represented as a tree or graph) and traverse its *execution resources* and *memory resources*.
114
115
115
116
The capability of querying underlying *execution resources* of a given *system* is particularly important towards supporting affinity control in C++. The current proposal for executors [[22]][p0443r4] leaves the *execution resource* largely unspecified. This is intentional: *execution resources* will vary greatly between one implementation and another, and it is out of the scope of the current executors proposal to define those. There is current work [[23]][p0737r0] on extending the executors proposal to describe a typical interface for an *execution context*. In this paper a typical *execution context* is defined with an interface for construction and comparison, and for retrieving an *executor*, waiting on submitted work to complete and querying the underlying *execution resource*. Extending the executors interface to provide topology information can serve as a basis for providing a unified interface to expose affinity. This interface cannot mandate a specific architectural definition, and must be generic enough that future architectural evolutions can still be expressed.
116
117
117
-
Two important considerations when defining a unified interface for querying the *resource topology* of a *system*, are (a) what level of abstraction such an interface should have, and (b) at what granularity it should describe the topology's *execution resources*. As both the level of abstraction of an *execution resource* and the granularity that it is described in will vary greatly from one implementation to another, it’s important for the interface to be generic enough to support any level of abstraction. To achieve this we propose a generic hierarchical structure of *execution resources*, each *execution resource* being composed of other *execution resources* recursively. Each *execution resource* within this hierarchy can be used to place memory (i.e., allocate memory within the *execution resource’s* memory region), place execution (i.e. bind an execution to an *execution resource’s execution agents*), or both.
118
+
Two important considerations when defining a unified interface for querying the *resource topology* of a *system*, are (a) what level of abstraction such an interface should have, and (b) at what granularity it should describe the topology's *execution resources* and *memory resources*. As both the level of abstraction of resources and the granularity that they describe in will vary greatly from one implementation to another, it’s important for the interface to be generic enough to support any level of abstraction. To achieve this we propose generic hierarchical structures for *execution resources* and *memory resources*, where each *resource* being composed of other *resources* recursively.
118
119
119
-
For example, a NUMA system will likely have a hierarchy of nodes, each capable of placing memory and placing agents. A CPU + GPU system may have GPU local memory regions capable of placing memory, but not capable of placing agents.
120
+
Each *execution resource* within its hierarchy can be used to place execution (i.e. bind an execution to an *execution resource*).
121
+
Each *memory resource* within its hierarchy can be used to place memory (i.e., allocate memory within the *memory resource*).
122
+
For example, a NUMA system will likely have a hierarchy of nodes, each capable of placing memory and placing execution. A CPU + GPU system may have GPU local memory regions capable of placing memory, but not capable of placing execution.
120
123
121
-
Nowadays, there are various APIs and libraries that enable this functionality. One of the most commonly used is the [Portable Hardware Locality (hwloc)][hwloc]. Hwloc presents the hardware as a tree, where the root node represents the whole machine and subsequent levels represent different partitions depending on different hardware characteristics. The picture below shows the output of the hwloc visualization tool (lstopo) on a 2-socket Xeon E5300 server. Note that each socket is represented by a package in the graph. Each socket contains its own cache memories, but both share the same NUMA memory region. Note also that different I/O units are visible underneath: Placement of these units with respect to memory and threads can be critical to performance. The ability to place threads and/or allocate memory appropriately on the different components of this system is an important part of the process of application development, especially as hardware architectures get more complex. The documentation of lstopo [[21]][lstopo] shows more interesting examples of topologies that appear on today's systems.
124
+
Nowadays, there are various APIs and libraries that enable this functionality. One of the most commonly used is the [Portable Hardware Locality (hwloc)][hwloc]. Hwloc presents the execution and memory hardware as a single tree, where the root node represents the whole machine and subsequent levels represent different partitions depending on different hardware characteristics. The picture below shows the output of the hwloc visualization tool (lstopo) on a 2-socket Xeon E5300 server. Note that each socket is represented by a package in the graph. Each socket contains its own cache memories, but both share the same NUMA memory region. Note also that different I/O units are visible underneath: Placement of these units with respect to memory and threads can be critical to performance. The ability to place threads and/or allocate memory appropriately on the different components of this system is an important part of the process of application development, especially as hardware architectures get more complex. The documentation of lstopo [[21]][lstopo] shows more interesting examples of topologies that appear on today's systems.
122
125
123
126

124
127
125
-
The interface of `thread_execution_resource_t` proposed in the execution context proposal [[23]][p0737r0] proposes a hierarchical approach where there is a root resource and each resource has a number of child resources. However, systems are becoming increasingly non-hierarchical and a traditional tree-based representation of a *system’s resource topology* may not suffice any more [[24]][exposing-locality]. The HSA standard solves this problem by allowing a node in the topology to have multiple parent nodes [19].
128
+
The interface of `thread_execution_resource_t` proposed in the execution context proposal [[23]][p0737r0] proposes a hierarchical approach where there is a root resource and each resource has a number of child resources.
126
129
127
-
The interface for querying the *resource topology* of a *system* must be flexible enough to allow querying all *execution resources* available under an *execution context*, querying the *execution resources* available to the entire system, and constructing an *execution context* for a particular *execution resource*. This is important, as many standards such as OpenCL [[6]][opencl-2-2] and HSA [[7]][hsa] require the ability to query the *resource topology* available in a *system* before constructing an *execution context* for executing work.
130
+
Some heterogeneous systems execution and memory resources are not naturally represented by a single tree [[24]][exposing-locality]. The HSA standard solves this problem by allowing a node in the topology to have multiple parent nodes [19].
131
+
132
+
The interface for querying the *resource topology* of a *system*
133
+
must be flexible enough to allow
134
+
querying the *execution resources* and *memory resources*
135
+
available to the entire system,
136
+
affinity between an *execution resource* and *memory resource*,
137
+
querying the *execution resource* associated with an *execution context*, and
138
+
constructing an *execution context* for a particular *execution resource*.
139
+
This is important, as many standards such as OpenCL [[6]][opencl-2-2]
140
+
and HSA [[7]][hsa] require the ability to query the *resource topology*
141
+
available in a *system* before constructing an *execution context*
142
+
for executing work.
128
143
129
144
> For example, an implementation may provide an execution context for a particular execution resource such as a static thread pool or a GPU context for a particular GPU device, or an implementation may provide a more generic execution context which can be constructed from a number of CPU and GPU devices queryable through the system resource topology.
130
145
131
-
### Topology discovery & fault tolerance
146
+
### Dynamic resource discovery & fault tolerance: currently out of scope
132
147
133
148
In traditional single-CPU systems, users may reason about the execution resources with standard constructs such as `std::thread`, `std::this_thread` and `thread_local`. This is because the C++ machine model requires that a system have **at least one thread of execution, some memory and some I/O capabilities**. Thus, for these systems, users may make some assumptions about the system resource topology as part of the language and its supporting standard library. For example, one may always ask for the available hardware concurrency, since there is always at least one thread, and one may always use thread-local storage.
134
149
@@ -158,7 +173,7 @@ The initial solution should target systems with a single addressable memory regi
158
173
159
174
### Querying the relative affinity of partitions
160
175
161
-
In order to make decisions about where to place execution or allocate memory in a given *system’s resource topology*, it is important to understand the concept of affinity between different *execution resources*. This is usually expressed in terms of latency between two resources. Distance does not need to be symmetric in all architectures.
176
+
In order to make decisions about where to place execution or allocate memory in a given *system’s resource topology*, it is important to understand the concept of affinity between an *execution resource* and a *memory resource*. This is usually expressed in terms of latency between these resources. Distance does not need to be symmetric in all architectures.
162
177
163
178
The relative position of two components in the topology does not necessarily indicate their affinity. For example, two cores from two different CPU sockets may have the same latency to access the same NUMA memory node.
164
179
@@ -187,9 +202,12 @@ traditional *context of execution* usage that refers
187
202
to the state of a single executing callable; *e.g.*,
188
203
program counter, registers, stack frame. *--end note*]
189
204
190
-
The **concurrency** of an execution resource is the maximum number
191
-
of execution agents that could make concurrent forward progress
205
+
The **concurrency** of an execution resource is an upper bound of the
206
+
number of execution agents that could make concurrent forward progress
192
207
on that execution resource.
208
+
It is guaranteed that no more than **concurrency** execution agents
209
+
could make concurrent forward progress; it is not guaranteed that
210
+
**concurrency** execution agents will ever make concurrent forward progress.
193
211
194
212
### Execution resources
195
213
@@ -217,7 +235,7 @@ for (auto res : execution::this_system::resources()) {
217
235
218
236
### Querying relative affinity
219
237
220
-
The `affinity_query` class template provides an abstraction for a relative affinity value between two `execution_resource`s, derived from a particular `affinity_operation` and `affinity_metric`. The `affinity_query` is templated by `affinity_operation` and `affinity_metric` and is constructed from two `execution_resource`s. An `affinity_query` does not mean much on it's own, instead a relative magnitude of affinity can be queried by using comparison operators. If nessesary the value of an `affinity_query` can also be queried through `native_affinity`, though the return value of this is implementation defined.
238
+
The `affinity_query` class template provides an abstraction for a relative affinity value between an execution resource and a memory resource, derived from a particular `affinity_operation` and `affinity_metric`. The `affinity_query` is templated by `affinity_operation` and `affinity_metric` and is constructed from an execution resource and memory resource. An `affinity_query` does not mean much on it's own, instead a relative magnitude of affinity can be queried by using comparison operators. If nessesary the value of an `affinity_query` can also be queried through `native_affinity`, though the return value of this is implementation defined.
221
239
222
240
Below *(listing 3)* is an example of how you can query the relative affinity between two `execution_resource`s.
0 commit comments