Skip to content

Commit 81dd4d4

Browse files
tzanussivinodkoul
authored andcommitted
dmaengine: idxd: Add IDXD performance monitor support
Implement the IDXD performance monitor capability (named 'perfmon' in the DSA (Data Streaming Accelerator) spec [1]), which supports the collection of information about key events occurring during DSA and IAX (Intel Analytics Accelerator) device execution, to assist in performance tuning and debugging. The idxd perfmon support is implemented as part of the IDXD driver and interfaces with the Linux perf framework. It has several features in common with the existing uncore pmu support: - it does not support sampling - does not support per-thread counting However it also has some unique features not present in the core and uncore support: - all general-purpose counters are identical, thus no event constraints - operation is always system-wide While the core perf subsystem assumes that all counters are by default per-cpu, the uncore pmus are socket-scoped and use a cpu mask to restrict counting to one cpu from each socket. IDXD counters use a similar strategy but expand the scope even further; since IDXD counters are system-wide and can be read from any cpu, the IDXD perf driver picks a single cpu to do the work (with cpu hotplug notifiers to choose a different cpu if the chosen one is taken off-line). More specifically, the perf userspace tool by default opens a counter for each cpu for an event. However, if it finds a cpumask file associated with the pmu under sysfs, as is the case with the uncore pmus, it will open counters only on the cpus specified by the cpumask. Since perfmon only needs to open a single counter per event for a given IDXD device, the perfmon driver will create a sysfs cpumask file for the device and insert the first cpu of the system into it. When a user uses perf to open an event, perf will open a single counter on the cpu specified by the cpu mask. This amounts to the default system-wide rather than per-cpu counting mentioned previously for perfmon pmu events. In order to keep the cpu mask up-to-date, the driver implements cpu hotplug support for multiple devices, as IDXD usually enumerates and registers more than one idxd device. The perfmon driver implements basic perfmon hardware capability discovery and configuration, and is initialized by the IDXD driver's probe function. During initialization, the driver retrieves the total number of supported performance counters, the pmu ID, and the device type from idxd device, and registers itself under the Linux perf framework. The perf userspace tool can be used to monitor single or multiple events depending on the given configuration, as well as event groups, which are also supported by the perfmon driver. The user configures events using the perf tool command-line interface by specifying the event and corresponding event category, along with an optional set of filters that can be used to restrict counting to specific work queues, traffic classes, page and transfer sizes, and engines (See [1] for specifics). With the configuration specified by the user, the perf tool issues a system call passing that information to the kernel, which uses it to initialize the specified event(s). The event(s) are opened and started, and following termination of the perf command, they're stopped. At that point, the perfmon driver will read the latest count for the event(s), calculate the difference between the latest counter values and previously tracked counter values, and display the final incremental count as the event count for the cycle. An overflow handler registered on the IDXD irq path is used to account for counter overflows, which are signaled by an overflow interrupt. Below are a couple of examples of perf usage for monitoring DSA events. The following monitors all events in the 'engine' category. Becuuse no filters are specified, this captures all engine events for the workload, which in this case is 19 iterations of the work generated by the kernel dmatest module. Details describing the events can be found in Appendix D of [1], Performance Monitoring Events, but briefly they are: event 0x1: total input data processed, in 32-byte units event 0x2: total data written, in 32-byte units event 0x4: number of work descriptors that read the source event 0x8: number of work descriptors that write the destination event 0x10: number of work descriptors dispatched from batch descriptors event 0x20: number of work descriptors dispatched from work queues # perf stat -e dsa0/event=0x1,event_category=0x1/, dsa0/event=0x2,event_category=0x1/, dsa0/event=0x4,event_category=0x1/, dsa0/event=0x8,event_category=0x1/, dsa0/event=0x10,event_category=0x1/, dsa0/event=0x20,event_category=0x1/ modprobe dmatest channel=dma0chan0 timeout=2000 iterations=19 run=1 wait=1 Performance counter stats for 'system wide': 5,332 dsa0/event=0x1,event_category=0x1/ 5,327 dsa0/event=0x2,event_category=0x1/ 19 dsa0/event=0x4,event_category=0x1/ 19 dsa0/event=0x8,event_category=0x1/ 0 dsa0/event=0x10,event_category=0x1/ 19 dsa0/event=0x20,event_category=0x1/ 21.977436186 seconds time elapsed The command below illustrates filter usage with a simple example. It specifies that MEM_MOVE operations should be counted for the DSA device dsa0 (event 0x8 corresponds to the EV_MEM_MOVE event - Number of Memory Move Descriptors, which is part of event category 0x3 - Operations. The detailed category and event IDs are available in Appendix D, Performance Monitoring Events, of [1]). In addition to the event and event category, a number of filters are also specified (the detailed filter values are available in Chapter 6.4 (Filter Support) of [1]), which will restrict counting to only those events that meet all of the filter criteria. In this case, the filters specify that only MEM_MOVE operations that are serviced by work queue wq0 and specifically engine number engine0 and traffic class tc0 having sizes between 0 and 4k and page size of between 0 and 1G result in a counter hit; anything else will be filtered out and not appear in the final count. Note that filters are optional - any filter not specified is assumed to be all ones and will pass anything. # perf stat -e dsa0/filter_wq=0x1,filter_tc=0x1,filter_sz=0x7, filter_eng=0x1,event=0x8,event_category=0x3/ modprobe dmatest channel=dma0chan0 timeout=2000 iterations=19 run=1 wait=1 Performance counter stats for 'system wide': 19 dsa0/filter_wq=0x1,filter_tc=0x1,filter_sz=0x7, filter_eng=0x1,event=0x8,event_category=0x3/ 21.865914091 seconds time elapsed The output above reflects that the unspecified workload resulted in the counting of 19 MEM_MOVE operation events that met the filter criteria. [1]: https://software.intel.com/content/www/us/en/develop/download/intel-data-streaming-accelerator-preliminary-architecture-specification.html [ Based on work originally by Jing Lin. ] Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Link: https://lore.kernel.org/r/0c5080a7d541904c4ad42b848c76a1ce056ddac7.1619276133.git.zanussi@kernel.org Signed-off-by: Vinod Koul <vkoul@kernel.org>
1 parent a161046 commit 81dd4d4

File tree

8 files changed

+979
-0
lines changed

8 files changed

+979
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
What: /sys/bus/event_source/devices/dsa*/format
2+
Date: April 2021
3+
KernelVersion: 5.13
4+
Contact: Tom Zanussi <tom.zanussi@linux.intel.com>
5+
Description: Read-only. Attribute group to describe the magic bits
6+
that go into perf_event_attr.config or
7+
perf_event_attr.config1 for the IDXD DSA pmu. (See also
8+
ABI/testing/sysfs-bus-event_source-devices-format).
9+
10+
Each attribute in this group defines a bit range in
11+
perf_event_attr.config or perf_event_attr.config1.
12+
All supported attributes are listed below (See the
13+
IDXD DSA Spec for possible attribute values)::
14+
15+
event_category = "config:0-3" - event category
16+
event = "config:4-31" - event ID
17+
18+
filter_wq = "config1:0-31" - workqueue filter
19+
filter_tc = "config1:32-39" - traffic class filter
20+
filter_pgsz = "config1:40-43" - page size filter
21+
filter_sz = "config1:44-51" - transfer size filter
22+
filter_eng = "config1:52-59" - engine filter
23+
24+
What: /sys/bus/event_source/devices/dsa*/cpumask
25+
Date: April 2021
26+
KernelVersion: 5.13
27+
Contact: Tom Zanussi <tom.zanussi@linux.intel.com>
28+
Description: Read-only. This file always returns the cpu to which the
29+
IDXD DSA pmu is bound for access to all dsa pmu
30+
performance monitoring events.

drivers/dma/Kconfig

+12
Original file line numberDiff line numberDiff line change
@@ -300,6 +300,18 @@ config INTEL_IDXD_SVM
300300
depends on PCI_PASID
301301
depends on PCI_IOV
302302

303+
config INTEL_IDXD_PERFMON
304+
bool "Intel Data Accelerators performance monitor support"
305+
depends on INTEL_IDXD
306+
help
307+
Enable performance monitor (pmu) support for the Intel(R)
308+
data accelerators present in Intel Xeon CPU. With this
309+
enabled, perf can be used to monitor the DSA (Intel Data
310+
Streaming Accelerator) events described in the Intel DSA
311+
spec.
312+
313+
If unsure, say N.
314+
303315
config INTEL_IOATDMA
304316
tristate "Intel I/OAT DMA support"
305317
depends on PCI && X86_64

drivers/dma/idxd/Makefile

+2
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,4 @@
11
obj-$(CONFIG_INTEL_IDXD) += idxd.o
22
idxd-y := init.o irq.o device.o sysfs.o submit.o dma.o cdev.o
3+
4+
idxd-$(CONFIG_INTEL_IDXD_PERFMON) += perfmon.o

drivers/dma/idxd/idxd.h

+45
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@
99
#include <linux/wait.h>
1010
#include <linux/cdev.h>
1111
#include <linux/idr.h>
12+
#include <linux/pci.h>
13+
#include <linux/perf_event.h>
1214
#include "registers.h"
1315

1416
#define IDXD_DRIVER_VERSION "1.00"
@@ -29,6 +31,7 @@ enum idxd_type {
2931
};
3032

3133
#define IDXD_NAME_SIZE 128
34+
#define IDXD_PMU_EVENT_MAX 64
3235

3336
struct idxd_device_driver {
3437
struct device_driver drv;
@@ -61,6 +64,31 @@ struct idxd_group {
6164
int tc_b;
6265
};
6366

67+
struct idxd_pmu {
68+
struct idxd_device *idxd;
69+
70+
struct perf_event *event_list[IDXD_PMU_EVENT_MAX];
71+
int n_events;
72+
73+
DECLARE_BITMAP(used_mask, IDXD_PMU_EVENT_MAX);
74+
75+
struct pmu pmu;
76+
char name[IDXD_NAME_SIZE];
77+
int cpu;
78+
79+
int n_counters;
80+
int counter_width;
81+
int n_event_categories;
82+
83+
bool per_counter_caps_supported;
84+
unsigned long supported_event_categories;
85+
86+
unsigned long supported_filters;
87+
int n_filters;
88+
89+
struct hlist_node cpuhp_node;
90+
};
91+
6492
#define IDXD_MAX_PRIORITY 0xf
6593

6694
enum idxd_wq_state {
@@ -241,6 +269,8 @@ struct idxd_device {
241269
struct work_struct work;
242270

243271
int *int_handles;
272+
273+
struct idxd_pmu *idxd_pmu;
244274
};
245275

246276
/* IDXD software descriptor */
@@ -437,4 +467,19 @@ int idxd_cdev_get_major(struct idxd_device *idxd);
437467
int idxd_wq_add_cdev(struct idxd_wq *wq);
438468
void idxd_wq_del_cdev(struct idxd_wq *wq);
439469

470+
/* perfmon */
471+
#if IS_ENABLED(CONFIG_INTEL_IDXD_PERFMON)
472+
int perfmon_pmu_init(struct idxd_device *idxd);
473+
void perfmon_pmu_remove(struct idxd_device *idxd);
474+
void perfmon_counter_overflow(struct idxd_device *idxd);
475+
void perfmon_init(void);
476+
void perfmon_exit(void);
477+
#else
478+
static inline int perfmon_pmu_init(struct idxd_device *idxd) { return 0; }
479+
static inline void perfmon_pmu_remove(struct idxd_device *idxd) {}
480+
static inline void perfmon_counter_overflow(struct idxd_device *idxd) {}
481+
static inline void perfmon_init(void) {}
482+
static inline void perfmon_exit(void) {}
483+
#endif
484+
440485
#endif

0 commit comments

Comments
 (0)