lzmalloc: High-Performance, Lock-Free & NUMA-Aware Memory Allocator

Technical Summary

lzmalloc is a POSIX-compliant, drop-in replacement memory allocator (malloc, free, calloc, realloc) engineered for high-concurrency Linux environments. Implemented in strict C11, it is designed with mechanical sympathy at its core, utilizing 64-byte cache-line alignment to mathematically eliminate cross-core false sharing.

The allocator minimizes CPU pipeline stalls through vDSO topology routing and maximizes physical memory reclamation via an active Virtual Memory Manager (VMM) using cache hysteresis and madvise RSS deflation.

Key Engineering Features

NUMA-Aware Routing: Prevents inter-socket latency penalties via vDSO (getcpu) topology detection.
Lock-Free Concurrency: Eliminates global locks using C11 atomics, Treiber stacks, and deferred atomic batching.
O(1) Metadata Resolution: Lock-free 2-level Radix Tree index for instantaneous pointer resolution.
Hardware Sympathy: Strict cache-line alignment (64-byte) for all control structures.
Active Security: Cryptographic pointer obfuscation (Safe Linking) and 64-byte chunk canaries to mitigate Use-After-Free (UAF) vulnerabilities.

Core Architecture: Triple-Hierarchy Memory Engines

The architecture operates on a segregated fit routing model, protected by cryptographic metadata and an O(1) lock-free index.

Thread-Local Heap (TLH): Cache-aligned routing matrix providing O(1) allocation for 88 logarithmic size classes. Employs intrusive, obfuscated free-lists and deferred atomic batching for cross-thread memory reclamation.
Slab Engine (<= 32KB): Geometric object partitioning within 2MB superblocks (Chunks). Utilizes lazy bump-pointer initialization and isolates metadata to the first cache line.
Span Engine (32KB - 1MB): Contiguous 4KB page allocator. Uses a 512-bit binned bitmap within 2MB chunks to eliminate internal fragmentation for medium-sized objects.
Virtual Memory Manager (VMM): Direct OS mmap router for large objects (> 1MB) and 2MB extent provisioning. Manages NUMA-isolated Treiber stacks for chunk caching and executes global RSS deflation (MADV_DONTNEED / MADV_FREE).
Radix Tree Index: A lock-free 2-level index (13-bit root, 14-bit leaves) enabling O(1) metadata resolution from arbitrary virtual addresses.

Benchmark Comparison: Allocator Latency & Throughput

Baseline evaluation against standard industry allocators under high-concurrency workloads.

Metric	glibc (ptmalloc)	jemalloc	mimalloc	lzmalloc V2.1
Slab Alloc Latency (ns)	80.48	14.88	14.14	21.81*
Slab Free Latency (ns)	12.34	13.93	10.61	11.22
Throughput (M ops/s)	17.01	9.39	26.80	24.18
Chaos Alloc Time (s)	7.35	2.46	3.24	11.88
Final RSS (After GC)	2.7 MB	1071 MB	209 MB	24.7 MB

*Note: Slab allocation latency typically oscillates between 21ns and 35ns depending on CPU frequency scaling and L1/L2 cache state. Chaos allocation time for medium objects is currently unoptimized (See Phase 2.1 Roadmap).

Technical Roadmap & Future Capabilities

Phase 2: v0.1.1 Core Optimization

2.1 Hardware-Accelerated Spans: Implement bit-leaping via __builtin_ctzll to reduce Chaos Alloc Time.
2.2 Software TLB: O(1) metadata resolution in TLH to stabilize allocation latency bounds.
2.3 Lock-free VMM: Transition to Treiber Stacks for NUMA pools to eliminate spinlock contention.

Phase 3: Advanced Systems Engineering

3.1 Asynchronous Janitor: Offload madvise and munmap syscalls to background threads.
3.2 rseq Integration: Implement per-core heaps using Linux Restartable Sequences (rseq).
3.3 Thread Adoption: Recycling of orphan heaps from terminated zombie threads.

Phase 4: Hardware Security & Modern Kernel

4.1 Memory Tagging: ARM MTE and Intel MPK integration for near-zero-overhead UAF protection.
4.2 Mesh-style Compaction: Virtual memory remapping for physical page merging.
4.3 Huge Page Awareness: Dynamic 2MB promotion to reduce Translation Lookaside Buffer (TLB) pressure.

Phase 5: Production Hardening

5.1 Fuzzing: Integration with LLVM libFuzzer and AddressSanitizer (ASan).
5.2 Specialized Backends: Support for CXL-attached memory and NVDIMM hardware.

Build, Deployment, and Injection Instructions

Local Compilation (CMake & LTO)

The build system is managed via CMake with a simplified Makefile wrapper enforcing Link-Time Optimization (LTO) and native architecture tuning (-march=native).

Release Build: make release (Outputs artifact to build/release/liblzmalloc.so)
Debug Build: make debug (Outputs artifact to build/debug/liblzmalloc.so)
Test Suite: make test

Runtime Injection via LD_PRELOAD

To inject lzmalloc into any dynamically linked POSIX application without recompilation, utilize the LD_PRELOAD environment variable:

LD_PRELOAD=./build/release/liblzmalloc.so ./your_target_executable

Docker Orchestration

A containerized matrix is provided via docker-compose.yml to ensure reproducible testing environments.

Interactive debugging: docker-compose run dev-sandbox
Automated tests: docker compose up ci-test-suite
Benchmarking: docker-compose up benchmark-suite

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
external		external
include		include
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
lz_config.h.in		lz_config.h.in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lzmalloc: High-Performance, Lock-Free & NUMA-Aware Memory Allocator

Technical Summary

Key Engineering Features

Core Architecture: Triple-Hierarchy Memory Engines

Benchmark Comparison: Allocator Latency & Throughput

Technical Roadmap & Future Capabilities

Build, Deployment, and Injection Instructions

Local Compilation (CMake & LTO)

Runtime Injection via LD_PRELOAD

Docker Orchestration

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

lzmalloc: High-Performance, Lock-Free & NUMA-Aware Memory Allocator

Technical Summary

Key Engineering Features

Core Architecture: Triple-Hierarchy Memory Engines

Benchmark Comparison: Allocator Latency & Throughput

Technical Roadmap & Future Capabilities

Build, Deployment, and Injection Instructions

Local Compilation (CMake & LTO)

Runtime Injection via LD_PRELOAD

Docker Orchestration

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages