______ _
(_____ \ | |
_____) )__ ____ __| |
| ____/ _ \| _ \ / _ |
| | | |_| | | | ( (_| |
|_| \___/|_| |_|\____| - Compute Express Link (CXL) based Memory Pooling Systems
Pond: CXL-Based Memory Pooling Systems for Cloud Platforms
The preprint of the paper can be found here.
@InProceedings{pond.asplos23,
author = {Huaicheng Li and Daniel S. Berger and Lisa Hsu and Daniel Ernst and Pantea Zardoshti and Stanko Novakovic and Monish Shah and Samir Rajadnya and Scott Lee and Ishwar Agarwal and Mark D. Hill and Marcus Fontoura and Ricardo Bianchini},
title = "{Pond: CXL-Based Memory Pooling Systems for Cloud Platforms}",
booktitle = {Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
address = {Vancouver, BC Canada},
month = {March},
year = {2023}
}
This repo open-sources our approach for CXL simulation and evaluations (mainly some scripts). It is not a full-fledged open-source version of Pond design.
We mainly emulate the following two characteristics of Compute Express Link (CXL) attached DRAM:
- Latency: ~150ns
- No local CPU which can directly accesses it, i.e., CXL-memory treated as a "computeless/cpuless" node
The CXL latency is similar to the latency of one NUMA hop on modern 2S systems, thus, we simulate the CXL memory using the memory on the remote NUMA node and disables all the cores on the remote NUMA node to simulate the "computeless" node behavior.
In this repo, the CXL-simulation is mainly done via a few scripts, check
cxl-global.sh
and the run-xxx.sh
under each workload folder (e.g.,
cpu2017
, gapbs
, etc.).
These scripts dynamically adjust the system configuration to simulate a certain percentage of CXL memory in the system, and run workloads against such scenarios.
One major setup we are benchmarking is to adjust the percentage of CXL-memory being used to run a certain workload and observe the performance impacts (compared to pure local DRAM "ideal" cases). For example, the common ratios the scripts include "100/0" (the 100% local DRAM case, no CXL), "95/5", "90/10" (90% local DRAM + 10% CXL), "85/15", "80/20", "75/25", "50/50", "25/75", etc.
To provision the correct amount of local/CXL memory for the above split ratios, we
need to profile the peak memory usage of the target workload. This is usually
done via monitoring tools such as pidstat
(the RSS
field reported in
the memory usage).
To limit the size of local DRAM, we use memeater
to consume a desirable amount of local memory. This works under Linux First-Touch policy.
However, if you're experimenting with memory tiering systems, do not use it. Even
though memory pages in "memeater" are lock()
ed, they can still be migrated by the OS
(from one node to another).
Under folder cpu2017
, run-cpu2017.sh
is the main entry to run a series
of experiments under various split configurations. The script reads the
profiled information in a workload input file (e.g., wi.txt
), where the
first column is the workload name and the second column is the peak memory
consumption of the workload. Based on these, run-cpu2017.sh
will iterate
over a series of predefined split-ratios and run the experiments one by one.
The scripts writes the logs and outputs to rst
folder.
One could co-run profiling utilities such as emon
or Intel Vtune
together with the workload to collect architecture-level metrics for
performance analysis. Make sure Intel Vtune is installed first before running
the script.