Skip to content
This repository was archived by the owner on Oct 31, 2024. It is now read-only.

Commit 66a5074

Browse files
authored
Merge pull request #6 from hsane2001/main
Updating documentation and adding support for containerd and crio
2 parents c885b1b + fced42c commit 66a5074

File tree

6 files changed

+122
-215
lines changed

6 files changed

+122
-215
lines changed

Dockerfile

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
FROM ubuntu:20.04
2+
ENV DEBIAN_FRONTEND=noninteractive
3+
WORKDIR /
4+
RUN apt update ; apt-get install apt-transport-https ca-certificates -y ; update-ca-certificates
5+
RUN apt-get update && \
6+
apt-get upgrade -y && \
7+
apt-get install --no-install-recommends -y \
8+
zip bison build-essential cmake flex git libedit-dev \
9+
libllvm12 llvm-12-dev libclang-12-dev python zlib1g-dev libelf-dev libfl-dev python3-setuptools \
10+
liblzma-dev arping netperf iperf linux-tools-generic python3-pip && rm -rf /var/lib/apt/lists/*
11+
RUN rm /usr/bin/perf
12+
RUN ln -s /usr/lib/linux-tools/*/perf /usr/bin/perf
13+
RUN git clone https://github.com/iovisor/bcc.git
14+
RUN mkdir bcc/build; cd bcc/build ; cmake .. ; make ; make install ; cmake -DPYTHON_CMD=python3 .. ; cd src/python/ ; make ; make install ; cd ../..
15+
COPY procmon/ .
16+
COPY requirements.txt .
17+
RUN pip install -r requirements.txt

README.md

Lines changed: 62 additions & 175 deletions
Original file line numberDiff line numberDiff line change
@@ -1,193 +1,80 @@
1-
# Workload Interference Detector
2-
3-
## Introduction
4-
5-
Workload Interference Detector is a tool that leverages the Intel Performance Monitoring Units (PMU) to monitor and detect interference between workloads. Traditional PMU drivers that work in counting mode (i.e., emon, perf-stat) provide system level analysis with very little overhead. However, these drivers lack the ability to breakdown the system level metrics (CPI, cache misses, etc) at a process or application level. With eBPF, it is possible to associate the process context with the HW counter data, providing the ability to breakdown PMU metrics by process at a system level. Additionally, since eBPF runs filters in the kernel and uses perf in counting mode, this incurs very little overhead, allowing for real-time performance tracking.
6-
7-
## Contents:
8-
9-
*_procmon_*: Dumps performance metrics per process in counting mode through eBPF functionality using perf interface.
10-
11-
*_dockermon_*: Shows the same performance metrics but on the container level (i.e. a single record for each container-core, or a single record for each container). It also has the option to export data to cloudwatch. Please check cloudwatch pricing: https://aws.amazon.com/cloudwatch/pricing/
12-
13-
*_NN_detect_*: Monitors the performance for a given workload (process or container) and compares it to a reference-signature. If any of the performance metrics deviates by an amount > a user-specified threshold (10% by default), the workload is flagged as a noisy neighbor victim and a list of workloads that likely caused the performance degradation is shown.
14-
15-
## Installation
16-
17-
1. Install all distribution-specific requirements for [compiling BCC from source.](https://github.com/iovisor/bcc/blob/master/INSTALL.md#source)
18-
19-
2. Test it using a quick example:
1+
<div align="center">
2+
3+
<div id="user-content-toc">
4+
<ul>
5+
<summary><h1 style="display: inline-block;">Workload Interference Detector</h1></summary>
6+
</ul>
7+
</div>
8+
9+
![CodeQL](https://github.com/intel/interferencedetector/actions/workflows/codeql.yml/badge.svg)[![License](https://img.shields.io/badge/License-MIT-blue)](https://github.com/intel/interferencedetector/blob/master/LICENSE)
10+
11+
[Requirements](#requirements) | [Usage](#usage) | [Demo](#demo) | [Notes](#notes)
12+
</div>
13+
14+
Workload Interference Detector uses a combination of hardware events and ebpf to capture a wholistic signature of a workload's performance at very low overhead.
15+
1. instruction efficiency
16+
- cycles
17+
- instructions
18+
- cycles per instruction
19+
2. disk IO
20+
- local bandwidth (MB/s)
21+
- remote bandwidth (MB/s)
22+
- disk reads (MB/s)
23+
- disk writes (MB/s)
24+
3. network IO
25+
- network transmitted (MB/s)
26+
- network received (MB/s)
27+
4. cache
28+
- L1 instrutions misses per instruction
29+
- L1 data hit ratio
30+
- L1 data miss ratio
31+
- L2 miss ratio
32+
- L3 miss ratio
33+
5. scheduling
34+
- scheduled count
35+
- average queue length
36+
- average queue latency (ms)
37+
38+
## Requirements
39+
1. Linux Perf
40+
2. [BCC compiled from source.](https://github.com/iovisor/bcc/blob/master/INSTALL.md#source)
41+
3. `pip install -r requirements.txt`
42+
4. Access to PMU
43+
- Bare-metal
44+
- VM with vPMU exposed (uncore metrics like disk IO will be zero)
45+
5. Intel Xeon chip
46+
- Skylake
47+
- Cascade Lake
48+
- Ice Lake
49+
- Sapphire Rapids
50+
6. Python
51+
52+
## Usage
53+
1. Monitor processes
2054
```
21-
cd procmon
2255
sudo python3 procmon.py
2356
```
24-
25-
3. For monitoring docker containers, run the following command:
26-
```
27-
cd procmon
28-
sudo python3 dockermon.py
57+
2. Monitor containers (can also export to cloudwatch)
2958
```
30-
31-
4. For monitoring the performance of a process, run the following command:
59+
sudo python3 cmon.py
3260
```
33-
cd procmon
34-
sudo python3 NN_detect.py --pid <process-pid> --ref_signature <processes's reference signature> --distance_ratio 0.15
61+
3. Detect process or container interference. A list of workloads that likely caused the performance degradation is shown.
3562
```
63+
# process
64+
sudo python3 NN_detect.py --pid <process-pid> --ref_signature <processes's reference signature> --distance_ratio 0.15
3665
37-
5. For monitoring the performance of a container, run the following command:
38-
```
39-
cd procmon
66+
# container
4067
sudo python3 NN_detect.py --cid <container id> --ref_signature <container's reference signature> --distance_ratio 0.15
4168
```
4269

70+
## Demo
4371

44-
## Usage and Example Output
45-
46-
### Procmon
47-
```
48-
usage: procmon.py [-h] [-f SAMPLE_FREQ] [-p PID] [-c CPU] [-d DURATION] [-i INTERVAL] [--aggregate_cpus] [--aggregate_cgroup] [--acc] [-v]
49-
50-
eBPF based Core metrics by PID
51-
52-
options:
53-
-h, --help show this help message and exit
54-
-f SAMPLE_FREQ, --sample_freq SAMPLE_FREQ
55-
Sample one in this many number of events
56-
-p PID, --pid PID PID
57-
-c CPU, --cpu CPU cpu number
58-
-d DURATION, --duration DURATION
59-
duration
60-
-i INTERVAL, --interval INTERVAL
61-
interval in seconds
62-
--aggregate_cpus Aggregate all the counters across CPUs, the cpu field will be set to zero for all PIDs/Containers
63-
--aggregate_cgroup Aggregate all the counters on cgroup level, every contaiiner will then have a single row
64-
--acc collect events in accumulate mode. If not set, all counter cleared in each round
65-
-v, --verbose show raw counters in every interval
66-
67-
```
68-
69-
### Example output
70-
```
71-
Timestamp,PID,process,cgroupID,core,cycles,insts,cpi,l1i_mpi,l1d_hit_ratio,l1d_miss_ratio,l2_miss_ratio,l3_miss_ratio,local_bw,remote_bw,disk_reads,disk_writes,network_tx,network_rx,avg_q_len
72-
1676052270.426364,4203,mlc,6759,10,3034000000,5222000000,0.58,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
73-
1676052270.426398,4257,python3,5534,60,169000000,57000000,2.96,0.06,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
74-
1676052270.426417,4203,mlc,6759,8,3094000000,5225000000,0.59,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.00
75-
1676052270.42643,4203,mlc,6759,7,3262000000,5225000000,0.62,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.00
76-
1676052270.426441,4203,mlc,6759,9,2936000000,5220000000,0.56,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.00
77-
---------------------------------------------------------------------------------
78-
Timestamp,PID,process,cgroupID,core,cycles,insts,cpi,l1i_mpi,l1d_hit_ratio,l1d_miss_ratio,l2_miss_ratio,l3_miss_ratio,local_bw,remote_bw,disk_reads,disk_writes,network_tx,network_rx,avg_q_len
79-
1676052271.429533,4203,mlc,6759,10,3094000000,4808000000,0.64,0.00,0.00,1.00,0.19,0.33,4134.40,0.00,0.00,0.00,0.00,0.00,2.00
80-
1676052271.429563,4257,python3,5534,60,9000000,8000000,1.12,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
81-
1676052271.429583,2756,sshd,5534,52,1000000,1000000,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1280.00,0.00,0.00
82-
1676052271.429605,4203,mlc,6759,8,3094000000,4663000000,0.66,0.00,0.00,1.00,0.30,0.42,6323.20,0.00,0.00,0.00,0.00,0.00,2.00
83-
1676052271.429619,4203,mlc,6759,7,3095000000,4653000000,0.67,0.00,0.00,1.00,0.30,0.42,6080.00,0.00,0.00,0.00,0.00,0.00,2.00
84-
1676052271.429632,4203,mlc,6759,9,3095000000,4673000000,0.66,0.00,0.00,1.00,0.30,0.42,6323.20,0.00,0.00,0.00,0.00,0.00,2.00
85-
86-
```
87-
### Dockermon
88-
```
89-
usage: dockermon.py [-h] [-v] [--collect_signatures] [-d DURATION] [--aggregate_on_core | --aggregate_on_containerID]
90-
[--export_to_cloudwatch] [--cloudwatch_sampling_duration_in_sec CLOUDWATCH_SAMPLING_DURATION_IN_SEC]
91-
92-
Display procmon data on docker container level
93-
94-
options:
95-
-h, --help show this help message and exit
96-
-v, --verbose show raw verbose logging info.
97-
--collect_signatures collect signatures of running containers and dump to: signatures.json
98-
-d DURATION, --duration DURATION
99-
Collection duration in seconds. Default is 0 (indefinitely)
100-
--aggregate_on_core Show a single aggregated record for each containerID + core. This option is mutually exclusive with '--
101-
aggregate_on_containerID'
102-
--aggregate_on_containerID
103-
Show a single aggregated record for each containerID. This option is mutually exclusive with '--
104-
aggregate_on_core'
105-
--export_to_cloudwatch
106-
Export collected data to cloudwatch. Expects the following AWS parameters to be configured in `aws cli`:
107-
aws_access_key_id, aws_secret_access_key, aws_region.
108-
--cloudwatch_sampling_duration_in_sec CLOUDWATCH_SAMPLING_DURATION_IN_SEC
109-
Duration between samples of data points sent to cloudwatch. Default is 10 (one sample every 10 seconds). The
110-
minimum duration is 1 second. Note: this argument is only effective when --export_to_cloudwatch is set.
111-
```
112-
113-
### Example output
114-
```
115-
---------------------------------------------------------------------------------
116-
Timestamp,containerID,PID,process,cgroupID,core,cycles,insts,cpi,l1i_mpi,l1d_hit_ratio,l1d_miss_ratio,l2_miss_ratio,l3_miss_ratio,local_bw,remote_bw,disk_reads,disk_writes,network_tx,network_rx,avg_q_len
117-
1676052363.966291,f775ddd0c164,4700,mlc,6824,8,3241000000,1446000000,2.24,0.00,0.00,1.00,1.00,0.41,10771.20,0.00,0.00,0.00,0.00,0.00,2.00
118-
1676052363.966381,f775ddd0c164,4700,mlc,6824,10,3240000000,1425000000,2.27,0.00,0.00,1.00,1.00,0.44,11249.92,0.00,0.00,0.00,0.00,0.00,0.00
119-
1676052363.966419,f775ddd0c164,4700,mlc,6824,9,3240000000,1439000000,2.25,0.00,0.00,1.00,1.00,0.41,11249.92,0.00,0.00,0.00,0.00,0.00,2.00
120-
1676052363.966453,f775ddd0c164,4700,mlc,6824,7,3238000000,1396000000,2.32,0.00,0.00,1.00,1.00,0.47,11010.56,0.00,0.00,0.00,0.00,0.00,2.00
121-
---------------------------------------------------------------------------------
122-
Timestamp,containerID,PID,process,cgroupID,core,cycles,insts,cpi,l1i_mpi,l1d_hit_ratio,l1d_miss_ratio,l2_miss_ratio,l3_miss_ratio,local_bw,remote_bw,disk_reads,disk_writes,network_tx,network_rx,avg_q_len
123-
1676052364.968383,f775ddd0c164,4700,mlc,6824,8,3093000000,1399000000,2.21,0.00,0.00,1.00,1.00,0.45,10622.72,0.00,0.00,0.00,0.00,0.00,1.00
124-
1676052364.968449,f775ddd0c164,4700,mlc,6824,10,3093000000,1371000000,2.26,0.00,0.00,1.00,1.00,0.43,11610.88,0.00,0.00,0.00,0.00,0.00,1.00
125-
1676052364.968496,f775ddd0c164,4700,mlc,6824,9,3093000000,1375000000,2.25,0.00,0.00,1.00,1.00,0.45,11610.88,0.00,0.00,0.00,0.00,0.00,1.00
126-
1676052364.968533,f775ddd0c164,4700,mlc,6824,7,3093000000,1341000000,2.31,0.00,0.00,1.00,1.00,0.46,11363.84,0.00,0.00,0.00,0.00,0.00,1.00
127-
```
128-
129-
### NN\_detect
130-
```
131-
usage: NN_detect.py [-h] [-p PID] [-c CID] [--outfile OUTFILE] [-s SYSTEM_WIDE_SIGNATURES_PATH | -r REF_SIGNATURE] [-d DISTANCE_RATIO]
132-
133-
Detect Noisy Neighbors for a given PID (process-level) or container ID (container-level).
134-
135-
options:
136-
-h, --help show this help message and exit
137-
-p PID, --pid PID PID (process-level)
138-
-c CID, --cid CID Container ID (container-level)
139-
--outfile OUTFILE Output file to save live-updated performance data
140-
-s SYSTEM_WIDE_SIGNATURES_PATH, --system_wide_signatures_path SYSTEM_WIDE_SIGNATURES_PATH
141-
path to signatures_*.csv CSV file with referernce signatures per container ID, as generated by dockermon.
142-
-r REF_SIGNATURE, --ref_signature REF_SIGNATURE
143-
The tool will use this signature as a baseline. Use the output of either procmon or dockermon to collect the signature. The first element in the signature is `cycles`. All live updated signatures will be compared
144-
to this reference signature. Use a standalone signature (when the process is the only process executing in the system), or any signature collected over a performance-acceptable duration.
145-
-d DISTANCE_RATIO, --distance_ratio DISTANCE_RATIO
146-
Acceptable ratio of change in signature from reference, default is 0.1. If the distance is higher than this value, the monitored workload will flagged as a noisy neighbor victim.
147-
```
148-
### Example output
149-
```
150-
-----------------------------------------------------------------
151-
Header: Timestamp,containerID,core,cycles,insts,cpi,l1i_mpi,l1d_hit_ratio,l1d_miss_ratio,l2_miss_ratio,l3_miss_ratio,local_bw,remote_bw,disk_reads,disk_writes,network_tx,network_rx,avg_q_len
152-
Reference Signature: [3097000000.0, 1305000000.0, 2.37, 0.0, 0.0, 1.0, 1.0, 0.41, 10925.44, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
153-
Detected Signature on core 7 : [3093000000.0, 1361000000.0, 2.27, 0.0, 0.0, 1.0, 1.0, 0.47, 11791.36, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0]
154-
Distance from reference: 6.0% ==> Performance is OK
155-
Detected Signature on core 8 : [3092000000.0, 1408000000.0, 2.2, 0.0, 0.0, 1.0, 1.0, 0.43, 11289.6, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0]
156-
Distance from reference: 7.89% ==> Performance is OK
157-
Detected Signature on core 10 : [3091000000.0, 1391000000.0, 2.22, 0.0, 0.0, 1.0, 1.0, 0.44, 11791.36, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
158-
Distance from reference: 6.59% ==> Performance is OK
159-
Detected Signature on core 9 : [3092000000.0, 1403000000.0, 2.2, 0.0, 0.0, 1.0, 1.0, 0.42, 12042.24, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0]
160-
Distance from reference: 7.51% ==> Performance is OK
161-
```
162-
=======
163-
## Units:
164-
| Metric | Unit |
165-
| -----------------| -------------|
166-
| cycles | RAW |
167-
| insts | RAW |
168-
| cpi | RAW |
169-
| l1i_mpi | Percentage |
170-
| l1d_hit_ratio | Percentage |
171-
| l1d_miss_ratio | Percentage |
172-
| l2_miss_ratio | Percentage |
173-
| l3_miss_ratio | Percentage |
174-
| local_bw | MB/sec |
175-
| remote_bw | MB/sec |
176-
| disk_reads | MB/sec |
177-
| disk_writes | MB/sec |
178-
| network_tx | MB/sec |
179-
| network_rx | MB/sec |
180-
| scheduled_count | RAW |
181-
| avg_q_len | RAW |
182-
| avg_q_latency | milliseconds |
72+
![basic_stats](https://raw.githubusercontent.com/wiki/intel/interferencedetector/NN_demo1.gif)
18373

18474
## Notes:
18575
** Interference Detector was developed using the following as references:
18676
1. github.com/iovisor/bcc/tools/llcstat.py (Apache 2.0)
18777
2. github.com/iovisor/bcc/tools/tcptop.py (Apache 2.0)
18878
3. github.com/iovisor/bcc/blob/master/examples/tracing/disksnoop.py (Apache 2.0)
18979
4. github.com/iovisor/bcc/blob/master/tools/runqlen.py (Apache 2.0)
190-
5. github.com/iovisor/bcc/blob/master/tools/runqlat.py (Apache 2.0)
191-
192-
** Interference Detector currently supports "Skylake", "Cascade Lake", "Ice Lake", and "Sapphire Rapids" platforms only. It also supports AWS metal instances where PMUs are available (e.g., r5.metal, m5.metal, m6i.metal, etc.). For AWS Single socket instances (r.g., c5.12xlarge, c6i.16xlarge), offcore counters are not available. Hence offcore metrics (e.g., local_bw, remote_bw) will be zeroed out.
193-
80+
5. github.com/iovisor/bcc/blob/master/tools/runqlat.py (Apache 2.0)

procmon/NN_detect.py

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -39,13 +39,13 @@ class bcolors:
3939
"-s",
4040
"--system_wide_signatures_path",
4141
type=str,
42-
help="path to signatures_*.csv CSV file with referernce signatures per container ID, as generated by dockermon.",
42+
help="path to signatures_*.csv CSV file with referernce signatures per container ID, as generated by cmon.",
4343
)
4444
group.add_argument(
4545
"-r",
4646
"--ref_signature",
4747
type=str,
48-
help="The tool will use this signature as a baseline. Use the output of either procmon or dockermon to collect the signature. The first element in the signature is `cycles`. All live updated signatures will be compared to this reference signature. Use a standalone signature (when the process is the only process executing in the system), or any signature collected over a performance-acceptable duration.",
48+
help="The tool will use this signature as a baseline. Use the output of either procmon or cmon to collect the signature. The first element in the signature is `cycles`. All live updated signatures will be compared to this reference signature. Use a standalone signature (when the process is the only process executing in the system), or any signature collected over a performance-acceptable duration.",
4949
)
5050
parser.add_argument(
5151
"-t",
@@ -168,7 +168,7 @@ def get_signatures_from_csv(cvs_signatures_path):
168168
dataframe = pandas.read_csv(cvs_signatures_path)
169169
except FileNotFoundError:
170170
print(
171-
"Signatures file not found. Please provie the path to signatures .csv file"
171+
"Signatures file not found. Please provide the path to signatures .csv file"
172172
)
173173
sys.exit(1)
174174
key_col_name = dataframe.columns[0]
@@ -303,9 +303,9 @@ def run_NN_detect(id_to_ref_signatures_dict):
303303
stderr=PIPE,
304304
)
305305
else:
306-
# Run dockermon
306+
# Run cmon
307307
proc = Popen(
308-
["python3", "dockermon.py"],
308+
["python3", "cmon.py"],
309309
stdin=PIPE,
310310
stdout=PIPE,
311311
stderr=PIPE,
@@ -317,13 +317,13 @@ def run_NN_detect(id_to_ref_signatures_dict):
317317

318318
while True:
319319
if not proc.stdout:
320-
print("Reading procmon's or dockermon's stdout failed. Exiting...")
320+
print("Reading procmon's or cmon's stdout failed. Exiting...")
321321
return
322322

323323
line = proc.stdout.readline().decode("utf-8").rstrip()
324324
if not line or "Exiting.." in line:
325325
error_message = line
326-
print("Calling procmon or dockermon failed. Exiting...", error_message)
326+
print("Calling procmon or cmon failed. Exiting...", error_message)
327327
return
328328

329329
parts = line.split(",")
@@ -338,7 +338,7 @@ def run_NN_detect(id_to_ref_signatures_dict):
338338

339339
elif (
340340
"------------" in line
341-
): # indicates new collection interval in procmon/dockermon
341+
): # indicates new collection interval in procmon/cmon
342342
# Clear console screen
343343
clear_screen()
344344
# Write to console

0 commit comments

Comments
 (0)