High-performance thread-safe (No-GIL–friendly) data structures and parallel operations for Python 3.13+.
NOTE
ThreadFactory is designed and tested against Python 3.13+ in No-GIL mode.
This library will only function on 3.13 and higher.
Please see the benchmarks at the bottom of this page if interested there are more in the repository.
Jump to Benchmarks 🔥
- A thread-safe “multiset” collection that allows duplicates.
- Methods like
add
,remove
,discard
, etc. - Ideal for collections where duplicate elements matter.
- A thread-safe dictionary.
- Supports typical dict operations (
update
,popitem
, etc.). - Provides
map
,filter
, andreduce
for safe, bulk operations. - Freeze support: When frozen, the dictionary becomes read-only. Lock acquisition is skipped during reads, dramatically improving performance in high-read workloads.
- A thread-safe list supporting concurrent access and modification.
- Slice assignment, in-place operators (
+=
,*=
), and advanced operations (map
,filter
,reduce
). - Freeze support: Prevents structural modifications while enabling safe, lock-free reads (e.g.,
__getitem__
, iteration, and slicing). Ideal for caching and broadcast scenarios.
- A thread-safe FIFO queue built atop
collections.deque
. - Tested and outperforms deque alone by up to 64% in our benchmark.
- Supports
enqueue
,dequeue
,peek
,map
,filter
, andreduce
. - Raises
Empty
whendequeue
orpeek
is called on an empty queue. - Outperforms multiprocessing queues by over 400% in some cases — clone and run unit tests to see.
- A thread-safe LIFO stack.
- Supports
push
,pop
,peek
operations. - Ideal for last-in, first-out (LIFO) workloads.
- Built on
deque
for fast appends and pops. - Similar performance to ConcurrentQueue.
- A high-performance, thread-safe buffer using sharded deques for low-contention access.
- Designed to handle massive producer/consumer loads with better throughput than standard queues.
- Supports
enqueue
,dequeue
,peek
,clear
, and bulk operations (map
,filter
,reduce
). - Timestamp-based ordering ensures approximate FIFO behavior across shards.
- Outperforms
ConcurrentQueue
by up to 60% in mid-range concurrency in even thread Producer/Consumer configuration with 10 shards. - Automatically balances items across shards; ideal for parallel pipelines and low-latency workloads.
- Best used with
shard_count ≈ thread_count / 2
for optimal performance, but keep shards at or below 10.
- An unordered, thread-safe alternative to
ConcurrentBuffer
. - Optimized for high-concurrency scenarios where strict FIFO is not required.
- Uses fair circular scans seeded by bit-mixed monotonic clocks to distribute dequeues evenly.
- Benchmarks (10 producers / 20 consumers, 2M ops) show ~5.6% higher throughput than
ConcurrentBuffer
:- ConcurrentCollection: 108,235 ops/sec
- ConcurrentBuffer: 102,494 ops/sec
- Better scaling under thread contention.
- A thread-safe set implementation supporting all standard set algebra operations.
- Supports
add
,discard
,remove
, and all bitwise set operations (|
,&
,^
,-
) along with their in-place forms. - Provides
map
,filter
,reduce
, andbatch_update
to safely perform bulk transformations. - Freeze support: Once frozen, the set cannot be modified — but read operations become lock-free and extremely efficient.
- Ideal for workloads where the set is mutated during setup but then used repeatedly in a read-only context (e.g., filters, routing tables, permissions).
ThreadFactory provides a collection of parallel programming utilities inspired by .NET's Task Parallel Library (TPL).
- Executes a traditional
for
loop in parallel across multiple threads. - Accepts
start
,stop
, and abody
function to apply to each index. - Supports:
- Automatic chunking to balance load.
- Optional
local_init
/local_finalize
for per-thread local state. - Optional
stop_on_exception
to abort on the first error.
- Executes an
action
function on each item of an iterable in parallel. - Supports:
- Both pre-known-length and streaming iterables.
- Optional
chunk_size
to tune batch sizes. - Optional
stop_on_exception
to halt execution when an exception occurs. - Efficient when processing large datasets or streaming data without loading everything into memory.
- Executes multiple independent functions concurrently.
- Accepts an arbitrary number of functions as arguments.
- Returns a list of futures representing the execution of each function.
- Optionally waits for all functions to finish (or fail).
- Simplifies running unrelated tasks in parallel with easy error propagation.
- Parallel equivalent of Python’s built-in
map()
. - Applies a
transform
function to each item in an iterable concurrently. - Maintains the order of results.
- Automatically splits the work into chunks for efficient multi-threaded execution.
- Returns a fully materialized list of results.
- All utilities automatically default to
max_workers = os.cpu_count()
if unspecified. chunk_size
can be manually tuned or defaults to roughly4 × #workers
for balanced performance.- Exceptions raised inside tasks are properly propagated to the caller.
Full API reference and usage examples are available at:
➡️ https://threadfactory.readthedocs.io
# Clone the repository
git clone https://github.com/Synaptic724/ThreadFactory.git
cd threadfactory
# Create a Python 3.13+ virtual environment (No-GIL/Free concurrency recommended)
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
# Install the library in editable mode
pip install threadfactory
Below are benchmark results from live multi-threaded scenarios using 10–40 real threads,
with millions of operations processed under load.
These benchmarks aren't just numbers —
they are proof that ThreadFactory's concurrent collections outperform traditional Python structures by 2x–5x,
especially in the new No-GIL world Python 3.13+ is unlocking.
Performance under pressure.
Architecture built for the future.
These are just our Concurrent Datastructures and not even the real thing.
Threadfactory is coming soon...
All benchmark tests below are available if you clone the library and run the tests.
See the Benchmark Details 🚀 for more benchmark stats.
Queue Type | Time (sec) | Throughput (ops/sec) | Notes |
---|---|---|---|
multiprocessing.Queue |
119.99 | ~83,336 | Not suited for thread-only workloads, incurs unnecessary overhead. |
thread_factory.ConcurrentBuffer |
23.27 | ~429,651 | ⚡ Dominant here. Consistent and efficient under moderate concurrency. |
thread_factory.ConcurrentQueue |
37.87 | ~264,014 | Performs solidly. Shows stable behavior even at higher operation counts. |
collections.deque |
64.16 | ~155,876 | Suffers from contention. Simplicity comes at the cost of throughput. |
ConcurrentBuffer
outperformedmultiprocessing.Queue
by 96.72 seconds.ConcurrentBuffer
outperformedConcurrentQueue
by 14.6 seconds.ConcurrentBuffer
outperformedcollections.deque
by 40.89 seconds.
ConcurrentBuffer
continues to be the best performer under moderate concurrency.ConcurrentQueue
maintains a consistent performance but is outperformed byConcurrentBuffer
.- All queues emptied correctly (
final length = 0
).
Queue Type | Time (sec) | Throughput (ops/sec) | Notes |
---|---|---|---|
multiprocessing.Queue |
249.92 | ~80,020 | Severely limited by thread-unfriendly IPC locks. |
thread_factory.ConcurrentBuffer |
138.64 | ~144,270 | Solid under moderate producer-consumer balance. Benefits from shard windowing. |
thread_factory.ConcurrentBuffer |
173.89 | ~115,010 | Too many shards increased internal complexity, leading to lower throughput. |
thread_factory.ConcurrentQueue |
77.69 | ~257,450 | ⚡ Fastest overall. Ideal for large-scale multi-producer, multi-consumer scenarios. |
collections.deque |
190.91 | ~104,771 | Still usable, but scalability is poor compared to specialized implementations. |
ConcurrentBuffer
performs better with 10 shards than 20 shards at this concurrency level.ConcurrentQueue
continues to be the most stable performer under moderate-to-high thread counts.multiprocessing.Queue
remains unfit for threaded-only workloads due to its heavy IPC-oriented design.
- Shard count tuning in
ConcurrentBuffer
is crucial — too many shards can reduce performance. - Bit-flip balancing in
ConcurrentBuffer
helps under moderate concurrency but hits diminishing returns with excessive sharding. ConcurrentQueue
is proving to be the general-purpose winner for most balanced threaded workloads.- For ~40 threads,
ConcurrentBuffer
shows ~25% drop when doubling the number of shards due to increased dequeue complexity. - All queues emptied correctly (
final length = 0
).