-
-
Notifications
You must be signed in to change notification settings - Fork 2
Performance Analysis
The Universal RNG Library demonstrates exceptional performance characteristics, with AVX2 batch implementations achieving 3-5x speedups over single-mode operation and competitive or superior performance compared to standard library implementations across most bit widths.
- Peak speedup: 4.6x at 128-bit width (AVX2 WyRand)
- Consistent 3-4x improvements across 64-256 bit ranges
- Diminishing returns at 512+ bit widths due to SIMD width limitations
- AVX2 Xoroshiro128++ (Batch): Best overall performance, peak 1355 M ops/sec
- AVX2 WyRand (Batch): Excellent consistency, strong 128-bit performance
- Xoroshiro128+ (Reference): Solid baseline, good single-mode performance
- std::mt19937_64: Competitive at lower bit widths, falls behind at scale
- AVX2 Single Modes: Significant underperformance vs. reference implementations
Performance Range: 264-973 M ops/sec
Best: AVX2 Xoroshiro128++ Batch (973 M ops/sec @ 32-bit)
Pattern: Batch modes excel, single modes competitive
Performance Range: 162-1355 M ops/sec
Best: AVX2 Xoroshiro128++ Batch (1355 M ops/sec @ 64-bit)
Pattern: Massive batch speedups (3-5x), clear SIMD advantage
Performance Range: 42-360 M ops/sec
Pattern: Batch advantage diminishes, single-mode competitive
Performance Range: 20-96 M ops/sec
Pattern: Algorithm choice becomes critical, memory bandwidth limiting
| Bit Width | AVX2 WyRand | std::mt19937_64 | Xoroshiro128+ |
|---|---|---|---|
| 16 | 3.3x | 1.2x | 1.0x |
| 32 | 3.4x | 1.3x | 1.0x |
| 64 | 4.6x | 0.7x | 0.9x |
| 128 | 4.5x | 2.1x | 0.9x |
| 256 | 3.4x | 0.8x | 0.8x |
| 512 | 3.3x | 0.8x | 0.8x |
| 1024 | 3.8x | 1.2x | 0.8x |
Speedup relative to single-mode baseline
Issue: AVX2 single implementations significantly underperform reference algorithms
Root Causes:
- Function pointer overhead in Universal RNG architecture
- Suboptimal scalar code paths within SIMD implementations
- Missing compiler optimizations in critical loops
Impact: 30-70% performance penalty vs. reference implementations
Issue: Diminishing returns at higher bit widths
Analysis:
- 256-bit AVX2 registers become constraining factor
- Memory bandwidth saturation at 1024-bit operations
- Cache coherency overhead increases with data size
// Current: Function pointer dispatch
auto generator = factory.create(algorithm_type);
result = generator->next();
// Target: Template-based direct dispatch
template<Algorithm A>
constexpr auto optimized_next() { /* direct implementation */ }
- Reduce memory copies in batch processing
- Implement aggressive loop unrolling
- Optimize register allocation for hot paths
- Cache-line aligned buffers for batch operations
- Prefetch instructions for large data sets
- NUMA-aware memory allocation for multi-threaded scenarios
Target Speedups:
- 2x theoretical improvement for 512-bit operations
- 4x potential for 1024-bit with AVX-512F
- Reduced instruction count for complex operations
# Target flags for maximum performance
-O3 -march=native -funroll-loops -ffast-math -flto
- 2x improvement in single-mode performance
- 1.5x boost in batch mode efficiency
- Parity or better with xoroshiro128+ across all bit widths
- AVX-512 implementations: 2-4x speedup potential
- GPU acceleration: 10-100x for large batch operations
- Algorithmic improvements: Better scaling characteristics
- CPU: Modern x86_64 with AVX2 support
- Compiler: GCC/Clang with optimization flags
- Measurement: High-resolution timing across multiple runs
- Validation: Statistical significance testing
- Throughput: Operations per second
- Latency: Single operation timing
- Variance: Performance consistency
- Memory: Cache utilization patterns
- Warmup phases to eliminate cold cache effects
- Multiple iteration averaging for statistical validity
- Cross-platform validation across different architectures
For applications prioritizing:
- Best for: Bulk random number generation
- Bit widths: 64-256 for optimal performance
- Expected gain: 3-5x over single-mode
- Best for: Interactive applications
- Current limitation: AVX2 single-mode optimization needed
- Workaround: Use standard xoroshiro128+ for now
- 16-32 bit: Good SIMD utilization
- 64-128 bit: Peak efficiency zone
- 256+ bit: Consider algorithm alternatives
Performance analysis based on comprehensive benchmarking suite | Updated: August 2025
There is currently data lost off the bottom off the page - a search party needs to be sent in to rescue!
PLEASE DO BEAR IN CONSTANT MIND ABOVE ALL ELSE: CURRENT STATE OF DEVELOPMENT THE C++ STD LIBRARY EMPLOYING MERSENNE TWISTER STILL OUTPERFORMS SINGLE CALCULATION OPERATIONS FOR NON-SIMD BOOSTED COMPUTERS. THESE LIBRARIES FULLY REQUIRE AT LEAST AVX2 MINIMUM TO BENEFIT OVER THE STD GENERATION METHODS WHEN CONSIDERING SINGLE NUMBER GENERATION TASKS.