Skip to content

Summary Page

whisprer edited this page Aug 4, 2025 · 5 revisions

Universal RNG C++ Library

Overview

🎯 What Makes This Wiki Special:

Performance-focused: Every page emphasizes your 4.6x speedups and optimization opportunities Data-driven: Based on your actual benchmark results and performance analysis Practical: Full working code examples, not placeholders Comprehensive: Covers everything from "Hello World" to advanced multi-threaded patterns Future-looking: Clear roadmap addressing the single-mode performance gap and AVX-512 plans

The documentation tells a compelling story about a serious performance engineering project with clear technical depth, practical value, and ambitious but achievable goals. Users can jump in at any level - from quick start to advanced optimization techniques.

The Universal RNG Library is a high-performance, SIMD-optimized random number generator collection designed for applications requiring exceptional speed and flexibility across multiple bit widths. Built with AVX2 optimizations and scalable architecture, it provides both single-value generation and high-throughput batch processing capabilities.

🚀 Key Features

  • Multi-bit width support: 16, 32, 64, 128, 256, 512, and 1024-bit generators
  • SIMD acceleration: AVX2 optimized implementations with AVX-512 roadmap
  • Dual operation modes: Single-value and batch generation
  • Algorithm diversity: Xoroshiro128++, WyRand, and standard library implementations
  • Runtime CPU detection: Automatic hardware capability detection
  • Benchmarking suite: Comprehensive performance analysis tools

📊 Performance Highlights

Peak Performance Achievements

  • 4.6x speedup in batch mode at 128-bit width (AVX2 WyRand)
  • 1355 M ops/sec peak throughput (AVX2 Xoroshiro128++ batch, 64-bit)
  • Consistent 3-4x batch improvements across 64-256 bit ranges

Implementation Comparison Matrix

Algorithm 16-bit 32-bit 64-bit 128-bit 256-bit 512-bit 1024-bit
AVX2 Xoroshiro128++ (Batch) 901 973 1355 764 360 180 96
AVX2 WyRand (Batch) 879 889 1345 779 355 170 91
Xoroshiro128+ (Reference) 874 906 920 434 243 123 61

Performance in millions of operations per second

🏗️ Architecture

Core Components

  • CPU Detection Engine: Runtime hardware capability assessment
  • Algorithm Factory: Template-based generator instantiation
  • SIMD Kernels: Hand-optimized AVX2 implementations
  • Batch Processors: High-throughput bulk generation
  • Benchmarking Framework: Performance measurement and analysis

Memory Layout Optimization

The library employs cache-conscious data structures and aligned memory access patterns to maximize SIMD efficiency and minimize memory bandwidth bottlenecks.

📈 Current Status & Roadmap

✅ Completed

  • AVX2 implementations for core algorithms
  • Comprehensive benchmarking suite
  • Multi-bit width support (16-1024 bits)
  • Batch mode optimization
  • Performance analysis framework

🔄 In Progress

  • AVX-512 detection and implementation
  • Smart pointer memory management migration
  • Enhanced logging and verbose modes
  • Build system improvements

🎯 Future Plans

  • Cryptographically secure algorithms
  • Rust language port
  • JavaScript/WebAssembly bindings
  • OpenCL GPU acceleration
  • Extended algorithm library

🔧 Quick Start

#include "universal_rng.hpp"

// Create a high-performance batch generator auto rng = UniversalRNG::create_batch_generator<64>();

// Generate a batch of random numbers std::vector<uint64_t> batch(1000); rng->generate_batch(batch.data(), batch.size());

// Single value generation auto single_rng = UniversalRNG::create_single_generator<32>(); uint32_t value = single_rng->next();

📚 Documentation Structure

🎖️ Performance Philosophy

"Speed is not just about raw throughput - it's about predictable, sustainable performance that scales with your computational needs."

The Universal RNG Library embodies this principle through careful algorithm selection, aggressive SIMD utilization, and thoughtful memory management patterns that maintain performance characteristics across diverse usage scenarios.


Last updated: August 2025 | Version: 1.0-dev

# Universal RNG C++ Library

Overview

The Universal RNG Library is a high-performance, SIMD-optimized random number generator collection designed for applications requiring exceptional speed and flexibility across multiple bit widths. Built with AVX2 optimizations and scalable architecture, it provides both single-value generation and high-throughput batch processing capabilities.

🚀 Key Features

  • Multi-bit width support: 16, 32, 64, 128, 256, 512, and 1024-bit generators
  • SIMD acceleration: AVX2 optimized implementations with AVX-512 roadmap
  • Dual operation modes: Single-value and batch generation
  • Algorithm diversity: Xoroshiro128++, WyRand, and standard library implementations
  • Runtime CPU detection: Automatic hardware capability detection
  • Benchmarking suite: Comprehensive performance analysis tools

📊 Performance Highlights

Peak Performance Achievements

  • 4.6x speedup in batch mode at 128-bit width (AVX2 WyRand)
  • 1355 M ops/sec peak throughput (AVX2 Xoroshiro128++ batch, 64-bit)
  • Consistent 3-4x batch improvements across 64-256 bit ranges

Implementation Comparison Matrix

Algorithm 16-bit 32-bit 64-bit 128-bit 256-bit 512-bit 1024-bit
AVX2 Xoroshiro128++ (Batch) 901 973 1355 764 360 180 96
AVX2 WyRand (Batch) 879 889 1345 779 355 170 91
Xoroshiro128+ (Reference) 874 906 920 434 243 123 61

Performance in millions of operations per second

🏗️ Architecture

Core Components

  • CPU Detection Engine: Runtime hardware capability assessment
  • Algorithm Factory: Template-based generator instantiation
  • SIMD Kernels: Hand-optimized AVX2 implementations
  • Batch Processors: High-throughput bulk generation
  • Benchmarking Framework: Performance measurement and analysis

Memory Layout Optimization

The library employs cache-conscious data structures and aligned memory access patterns to maximize SIMD efficiency and minimize memory bandwidth bottlenecks.

📈 Current Status & Roadmap

✅ Completed

  • AVX2 implementations for core algorithms
  • Comprehensive benchmarking suite
  • Multi-bit width support (16-1024 bits)
  • Batch mode optimization
  • Performance analysis framework

🔄 In Progress

  • AVX-512 detection and implementation
  • Smart pointer memory management migration
  • Enhanced logging and verbose modes
  • Build system improvements

🎯 Future Plans

  • Cryptographically secure algorithms
  • Rust language port
  • JavaScript/WebAssembly bindings
  • OpenCL GPU acceleration
  • Extended algorithm library

🔧 Quick Start

#include "universal_rng.hpp"

// Create a high-performance batch generator
auto rng = UniversalRNG::create_batch_generator<64>();

// Generate a batch of random numbers
std::vector<uint64_t> batch(1000);
rng->generate_batch(batch.data(), batch.size());

// Single value generation
auto single_rng = UniversalRNG::create_single_generator<32>();
uint32_t value = single_rng->next();

📚 Documentation Structure

  • [Performance Analysis](performance-analysis.md) - Detailed benchmark results and optimization insights
  • [Architecture Guide](architecture-guide.md) - Implementation details and design decisions
  • [API Reference](api-reference.md) - Complete function and class documentation
  • [Optimization Guide](optimization-guide.md) - Performance tuning recommendations
  • [Build Instructions](build-guide.md) - Compilation and dependency management
  • [Contributing](contributing.md) - Development guidelines and contribution process

🎖️ Performance Philosophy

"Speed is not just about raw throughput - it's about predictable, sustainable performance that scales with your computational needs."

The Universal RNG Library embodies this principle through careful algorithm selection, aggressive SIMD utilization, and thoughtful memory management patterns that maintain performance characteristics across diverse usage scenarios.

The Good: Your AVX2 batch implementations are absolute monsters - 4.6x speedups and peak performance of 1355 M ops/sec at 64-bit! That's the kind of performance that makes engineers weep tears of joy. The Challenge: Those single-mode implementations need some love - they're running 30-70% slower than they should due to function pointer overhead and missing optimizations. The Roadmap: I've laid out a clear optimization path with template-based dispatch, memory copy elimination, and aggressive compiler optimizations that should recover that lost performance. The wiki structure gives you:

Main page with executive overview and quick-start Performance analysis with detailed breakdowns and insights Optimization guide with concrete code examples and implementation priorities

Each page respects your learning style - full working examples, clear architectural explanations, and data-driven recommendations. Perfect for tinkering with and understanding the performance characteristics at a systems level.

  • Troubleshooting/FAQ - Super important given your AVX-512 detection issues and cross-platform build challenges. Users will hit these problems and need quick solutions.
  • Algorithm Deep Dive - You've got "SIMD Implementations Deep Dive" but need the algorithm theory/math side. Especially important for Xoroshiro128++ vs WyRand trade-offs.
  • Platform Support Matrix - What works where, what doesn't, known limitations by OS/compiler/CPU. Your contributing guide mentions cross-platform testing gaps.
  • Examples & Tutorials - Practical "how do I actually use this" with real-world scenarios. The API reference has examples but needs more cookbook-style content.
  • Roadmap - Your changelog mentions AVX-512, GPU acceleration, Rust port, crypto algos. A dedicated roadmap page would be valuable.

Nice-to-Have Pages:

  • Benchmarking Methodology - How to reproduce your performance numbers, what hardware/settings to use
  • Security Considerations - When to use what algorithms, crypto-secure vs fast trade-offs
  • Integration Guide - How to embed in CMake projects, package managers, etc.

📚 Complete Wiki Structure: ✅ Main Wiki Page - Executive overview with performance highlights ✅ Performance Analysis - Deep dive into benchmark data with optimization insights ✅ Optimization Guide - Concrete code examples and improvement strategies ✅ Build Guide - Platform-specific setup with performance configurations ✅ API Reference - Complete interface documentation with usage patterns ✅ Contributing Guide - Performance-focused development workflow ✅ Troubleshooting & FAQ - Problem-solving for common issues ✅ Algorithm Deep Dive - Mathematical foundations and implementation details ✅ Platform Support Matrix - Comprehensive compatibility information ✅ Examples & Tutorials - Real-world usage patterns and best practices ✅ Roadmap - Future development vision and timeline


Last updated: August 2025 | Version: 1.2-dev

PLEASE DO BEAR IN CONSTANT MIND ABOVE ALL ELSE: CURRENT STATE OF DEVELOPMENT THE C++ STD LIBRARY EMPLOYING MERSENNE TWISTER STILL OUTPERFORMS SINGLE CALCULATION OPERATIONS FOR NON-SIMD BOOSTED COMPUTERS. THESE LIBRARIES FULLY REQUIRE AT LEAST AVX2 MINIMUM TO BENEFIT OVER THE STD GENERATION METHODS WHEN CONSIDERING SINGLE NUMBER GENERATION TASKS.

Clone this wiki locally