Demonstrating Python 3.14t's Free-Threading Performance for LLM Preprocessing
This benchmark compares tokenization throughput across different thread counts, showing the dramatic speedup when Python's Global Interpreter Lock (GIL) is removed.
- Python 3.11 (with GIL): ~1x speedup regardless of thread count (GIL bottleneck)
- Python 3.14t (no-GIL): 6-8x speedup on 8-core systems (true parallelism)
python tokenizer_benchmark.py# Install Python 3.14t using uv
uvx [email protected] tokenizer_benchmark.py
# Or download from python.org and run:
python3.14t tokenizer_benchmark.py- Generates Dataset: Creates 10,000 synthetic text samples simulating real LLM preprocessing
- Tokenizes with tiktoken: Uses OpenAI's fast BPE tokenizer (cl100k_base encoding)
- Tests Multiple Thread Counts: Runs benchmarks with 1, 2, 4, 8, and 16 threads
- Measures Performance: Tracks tokens/sec, speedup ratios, and total time
- Creates Visualizations: Generates publication-ready charts for analysis
- Exports Results: Saves data to CSV and JSON for further analysis
benchmark_results.png- Visualization showing throughput and speedup curvesbenchmark_results.csv- Detailed results in spreadsheet formatbenchmark_results.json- Complete benchmark data in JSON format
- tiktoken (0.12.0): Fast BPE tokenizer for LLM preprocessing
- matplotlib: Visualization and plotting
- pandas: Data analysis and CSV export
- psutil: System information and CPU detection
- Tests thread counts: 1, 2, 4, 8, 16 (adapts to available CPU cores)
- Uses concurrent.futures.ThreadPoolExecutor for thread management
- Measures wall-clock time with time.perf_counter()
- Calculates speedup relative to single-threaded baseline
Tokenization is a critical bottleneck in LLM preprocessing pipelines:
- Required for every text sample before training/inference
- CPU-intensive (no I/O waits)
- Embarrassingly parallel (independent samples)
- Perfect candidate for multi-threading
The benchmark automatically generates a LinkedIn caption based on your results:
🚀 LLM preprocessing just got multi-core superpowers!
I benchmarked Python 3.14's free-threaded build (no-GIL) tokenizing
10,000 text samples with tiktoken.
Results: 7.5x speedup on 8 threads!
Peak throughput: 850,000 tokens/sec
The removal of the Global Interpreter Lock enables true parallel processing
for CPU-bound tasks like tokenization, preprocessing, and feature extraction.
This is a game-changer for ML/AI pipelines. The future of Python is parallel! 🐍⚡
#Python #MachineLearning #AI #LLM #Performance #GIL
- Adding more threads doesn't improve performance
- GIL allows only one thread to execute Python bytecode at a time
- Speedup stays close to 1.0x regardless of thread count
- Linear or near-linear speedup with thread count
- True parallel execution across all CPU cores
- 6-8x speedup on 8-core systems
- Dramatic improvement for CPU-bound workloads
Edit tokenizer_benchmark.py to customize:
# Change number of samples
num_samples = 50000 # Default: 10000
# Change thread counts to test
thread_counts = [1, 2, 4, 8, 16, 32] # Default: [1, 2, 4, 8, 16]
# Change tokenizer encoding
benchmark = TokenizerBenchmark(encoding_name="o200k_base") # Default: cl100k_basePython 3.14 was released on October 7, 2025, with official support for free-threaded builds (PEP 703, PEP 779).
Key Features:
- Optional no-GIL build (indicated by 't' suffix: 3.14t)
- True parallel execution on multi-core CPUs
- 2-4x speedup for CPU-bound multi-threaded tasks
- Uses biased reference counting for memory safety
Installation:
# Using uv (recommended)
uvx [email protected]
# Or download from python.org
https://www.python.org/downloads/Feel free to extend this benchmark:
- Add more tokenizers (sentencepiece, rs-bpe, kitoken)
- Test with real datasets (Wikipedia, code, multilingual text)
- Add memory profiling
- Measure CPU utilization per core
This benchmark is provided as-is for educational and demonstration purposes.