A heap dumper and analyzer for CPython based on GDB.
The product consists of two parts:
- The dumper which uses GDB.
- The Flask-based UI for heap dump visualization.
It does not require any modification to the target program code, CPython code or installation. A restart is also not needed. In case the target is dockerized, the container also doesn't need any modification.
The dumper needs the following:
- GDB must be installed where the dumper runs (e.g. on the machine host), but is not needed near a target process (e.g. in a container).
- CPython 3.8 - 3.12.
- Docker CLI for working with Docker containers directly (e.g. calling
docker inspect
).
Only Linux is supported at the moment.
The dumper has been tested on x86 (x86_64) and ARM (AArch64) processors.
The dumper is compatible with a target process running on CPython 3.8 - 3.12.
The target process were tested in the following OSes:
- Alpine Linux;
- Ubuntu;
- Fedora;
- Debian.
Some popular libraries were tested:
- Django;
- FastAPI;
- Flask;
- SQLAlchemy
- Jupyter.
Find the PID of a running CPython process you're interested in.
Run:
$ python3 pyheap_dump --pid <pid> --file heap.pyheap
The heap file is written by the target process in its /tmp
, but is moved subsequently under the specified path.
If the target process belongs to a different user, use sudo
.
See
$ python3 pyheap_dump -h
for additional options.
The dumper also can be run in a Docker container.
If the target process is also running in a Docker container, it's possible to attach the dumper container directly to it:
docker run \
--rm \
--pid=container:<container_name_or_id> \
--cap-add=SYS_PTRACE \
--volume $(pwd):/heap-dumps \
ivanyu/pyheap-dumper:latest \
--pid 1 \
--file /heap-dumps/heap.pyheap
You can replace latest
with a release version.
If you need to run it against a process on the host, use --pid=host
instead.
PyHeap can attach to targets that are running in Linux namespaces. Docker containers is the most common example of this situation.
Note: Some Docker setups doesn't have real processes running on the same (virtual) machine where docker ...
control commands are executed. One example is WSL 2 + Docker Desktop on Windows. PyHeap doesn't work in such environments.
If you want to use PyHeap on the root process in a Docker container, use --docker-container
instead of --pid/-p
and specify the name or ID:
$ sudo python3 pyheap_dump --docker-container <container_name> --file heap.pyheap
If it's not the root process in the container, or you work with another container system (e.g. systemd-nspawn) or just generic Linux namespaces, you need to find the target PID. Please mind that this must be the PID from the dumper point of view: processes in namespaces can have their own PID numbers. For example, if you're about to run the dumper on a Linux host and the target process is running in a container, check the process list with ps
or top
on the host. Use --pid/-p
for the dumper.
If the target process is running under a different user (normal for Docker), you need to use sudo
with python3 pyheap_dump ...
.
PyHeap dumper will automatically transfer the heap file from the target namespace to the specified location.
The browser-based PyHeap UI is a convenient way to explore heap dumps. It can show threads, objects with the most retained heap. It allows exploring individual objects as well.
Running the PyHeap UI with Docker is simple:
docker run -it --rm \
--userns=host --user=$(id -u):$(id -g) \
-v ${PWD}:/pyheap-workdir \
--name pyheap-ui -p 5000:5000 \
ivanyu/pyheap-ui:latest \
heap.pyheap
and open http://127.0.0.1:5000.
You can replace latest
with a release version.
The images are published on Docker Hub.
You need a Python installation with Flask to run it. There's a Poetry environment for your convenience in pyheap-ui/.
To view the heap dump with the browser-based UI, go to pyheap-ui/ and run:
PYTHONPATH=src poetry run python -m pyheap_ui --file heap.pyheap
and open http://127.0.0.1:5000.
In case you cannot use the browser-based UI
Analyze the heap with the analyzer
module:
$ PYTHONPATH=src poetry run python -m analyzer retained-heap --file heap.pyheap
[2022-09-07 09:40:46,594] INFO Loading file heap.json.gz
[2022-09-07 09:40:46,633] INFO Loading file finished in 0.04 seconds
[2022-09-07 09:40:46,633] INFO Heap dump contains 18269 objects
[2022-09-07 09:40:46,646] INFO 1761 unknown objects filtered
[2022-09-07 09:40:46,681] INFO Indexing inbound references
[2022-09-07 09:40:46,695] INFO Inbound references indexed in 0.01 seconds
[2022-09-07 09:40:46,701] INFO Loaded retained heap cache
heap.json.gz.ce7ade900911c6edac5fe332a36d43d0a76ac103.retained_heap
Address | Object type | Retained heap size | String representation
--------------------------------------------------------------------------------
140494124474176 | dict | 1101494 | {'__name__': '__main__'
140494121988112 | str | 1000049 | xxxxxxxxxxxxxxxxxxxxxxx
140494125217792 | list | 100113 | ['xxxxxxxxxxxxxxxxxxxxx
94613255597520 | str | 100049 | xxxxxxxxxxxxxxxxxxxxxxx
140494126265024 | dict | 89546 | {'/usr/lib/python310.zi
140494124519104 | dict | 70465 | {'__name__': 'os', '__d
140494123404608 | dict | 64157 | {'__name__': 'typing',
140494126265984 | dict | 35508 | {'__name__': 'builtins'
140494125686720 | dict | 32920 | {94613227788704: <weakr
94613255487824 | ABCMeta | 32790 | <class 'collections.Use
140494125072000 | dict | 31566 | {'__module__': 'collect
140494124621856 | _Printer | 28111 | Type license() to see t
140494124550272 | dict | 28063 | {'_Printer__name': 'lic
140494105358656 | list | 27229 | ['A. HISTORY OF THE SOF
140494125744640 | frozenset | 25447 | frozenset({'_curses', '
140494124629056 | FileFinder | 22804 | FileFinder('/usr/lib/py
140494124679104 | dict | 22756 | {'_loaders': [('.cpytho
...
(in the repo root directory).
PyHeap uses GDB to attach to a running CPython process.
After the debugger is attached, a break point is set at the _PyEval_EvalFrameDefault
function inside CPython, which indicated the Python stack frame execution. It's a good spot to intervene into the CPython's normal job.
When the break point is hit by one of the threads, the Python script injector.py
is loaded and executed (as $dump_python_heap
function) in the context of the GDB's own Python interpreter. The main purpose of this script is to make the target CPython process to load the dumper_inferior.py
script and execute it in the context of the target process.
The dumper script uses the Python standard modules gc
and sys
to collect some information about heap objects and their sizes. It does some job to avoid irrelevant garbage created by itself to appear in the heap dump, but some traces of it will be there.
A dump is not a fair snapshot in time as some threads and the garbage collector continue working while it's being done.
Currently, the dumper sees objects traced by the CPython garbage collector and the objects they reference to (more precisely, the ones they return in their tp_traverse
).
Integration tests run on CI. However, end-to-end tests that use the real GDB cannot be run in GitHub Actions. You can run them locally using
make clean integration-tests
You need pyenv with Python 3.8, 3.9, 3.10, 3.11, and 3.12 installed and Poetry.