Architecture Overview

🏗️ Architecture Overview

"The design pattern is quite elegant, with a dispatching mechanism that selects the best available SIMD implementation at runtime."

🎯 Core Design Philosophy

The Universal RNG Library embodies Conway's Law in reverse - instead of letting our organizational structure dictate our architecture, we designed the architecture to mirror the optimal communication patterns between hardware and software for maximum performance.

🔄 The Universal Dispatch System

Runtime Intelligence Flow

flowchart TD
    A[Application Start] --> B[CPU Feature Detection]
    B --> C{AVX-512 Available?}
    C -->|Yes| D[AVX-512 Implementation]
    C -->|No| E{AVX2 Available?}
    E -->|Yes| F[AVX2 Implementation]
    E -->|No| G{SSE2 Available?}
    G -->|Yes| H[SSE2 Implementation]
    G -->|No| I[Scalar Fallback]
D --&gt; J[Performance: 8x Parallel]
F --&gt; K[Performance: 4x Parallel]
H --&gt; L[Performance: 2x Parallel]
I --&gt; M[Performance: 1x Baseline]

J --&gt; N[Automatic Optimization]
K --&gt; N
L --&gt; N
M --&gt; N

Key Architectural Principles

🔍 Detect Once, Optimize Forever
- CPU feature detection happens at initialization
- Zero runtime overhead after selection
- Future-proof against new instruction sets
🎭 Polymorphic Performance
- Same API, different implementations
- Template-based dispatch eliminates virtual function overhead
- Smart pointer management ensures memory safety
📦 Batch-First Design
- Optimized for bulk generation scenarios
- SIMD implementations excel at parallel streams
- Single-value generation as optimized special case

🧠 Runtime Detection System

Cross-Platform CPU Feature Detection

// Platform-agnostic feature detection
class CPUFeatures {
public:
    enum class Feature {
        SSE2, AVX, AVX2, NEON,
        AVX512F, AVX512DQ, AVX512BW, AVX512VL,
        // ... and more
    };
bool hasFeature(Feature feature) const;
static std::unique_ptr&lt;CPUFeatures&gt; detect();

};

Detection Strategy by Platform

Platform	Method	Instruction
Windows	__cpuid	Native MSVC intrinsic
Linux	__builtin_cpu_supports	GCC built-in
macOS	sysctlbyname	System information
ARM	/proc/cpuinfo parsing	Feature flags

3. Cache Optimization

Prefetching - Strategic memory access patterns
Alignment - SIMD-friendly data layout
Locality - Minimize cache misses in batch generation
False Sharing - Thread-local buffers prevent contention

🔄 API Design Philosophy

Zero-Overhead Abstraction

// High-level interface
auto rng = universal_rng_new(seed, algorithm, precision);
uint64_t value = universal_rng_next_u64(rng);
// Compiles to optimal SIMD implementation with no runtime overhead
// after initial detection phase

Backward Compatibility

The C API provides seamless integration with existing codebases:

// Pure C interface
universal_rng_t* rng = universal_rng_new(42, RNG_ALGORITHM_XOROSHIRO, RNG_PRECISION_DOUBLE);
uint64_t random = universal_rng_next_u64(rng);
universal_rng_free(rng);

🎯 Why This Architecture Wins

1. Conway's Law Optimization

Architecture mirrors optimal hardware communication patterns
Minimal abstraction overhead
Direct mapping to SIMD capabilities

2. Future-Proof Design

New instruction sets require only new implementations
API remains stable across hardware generations
Automatic optimization without code changes

3. Performance Through Intelligence

Runtime detection eliminates guesswork
Template dispatch avoids virtual function overhead
SIMD implementations maximize parallel execution

4. Memory Safety Without Cost

Smart pointers prevent leaks
RAII ensures cleanup
Zero-overhead abstractions maintain performance

Next: ⚡ SIMD Implementations - Deep dive into each optimization level

# 🏗️ Architecture Overview

"The design pattern is quite elegant, with a dispatching mechanism that selects the best available SIMD implementation at runtime."

🎯 Core Design Philosophy

The Universal RNG Library embodies Conway's Law in reverse - instead of letting our organizational structure dictate our architecture, we designed the architecture to mirror the optimal communication patterns between hardware and software for maximum performance.

🔄 The Universal Dispatch System

Runtime Intelligence Flow

flowchart TD
    A[Application Start] --> B[CPU Feature Detection]
    B --> C{AVX-512 Available?}
    C -->|Yes| D[AVX-512 Implementation]
    C -->|No| E{AVX2 Available?}
    E -->|Yes| F[AVX2 Implementation]
    E -->|No| G{SSE2 Available?}
    G -->|Yes| H[SSE2 Implementation]
    G -->|No| I[Scalar Fallback]
    
    D --> J[Performance: 8x Parallel]
    F --> K[Performance: 4x Parallel]
    H --> L[Performance: 2x Parallel]
    I --> M[Performance: 1x Baseline]
    
    J --> N[Automatic Optimization]
    K --> N
    L --> N
    M --> N

Key Architectural Principles

🔍 Detect Once, Optimize Forever
- CPU feature detection happens at initialization
- Zero runtime overhead after selection
- Future-proof against new instruction sets
🎭 Polymorphic Performance
- Same API, different implementations
- Template-based dispatch eliminates virtual function overhead
- Smart pointer management ensures memory safety
📦 Batch-First Design
- Optimized for bulk generation scenarios
- SIMD implementations excel at parallel streams
- Single-value generation as optimized special case

🧠 Runtime Detection System

Cross-Platform CPU Feature Detection

// Platform-agnostic feature detection
class CPUFeatures {
public:
    enum class Feature {
        SSE2, AVX, AVX2, NEON,
        AVX512F, AVX512DQ, AVX512BW, AVX512VL,
        // ... and more
    };
    
    bool hasFeature(Feature feature) const;
    static std::unique_ptr<CPUFeatures> detect();
};

Detection Strategy by Platform

Platform	Method	Instruction
Windows	`__cpuid`	Native MSVC intrinsic
Linux	`__builtin_cpu_supports`	GCC built-in
macOS	`sysctlbyname`	System information
ARM	`/proc/cpuinfo` parsing	Feature flags

The Selection Algorithm

ImplType detect_best_impl() {
    #if defined(USE_OPENCL)
        if (gpu_available()) return ImplType::OpenCL;
    #elif defined(USE_AVX512)
        if (has_avx512f() && has_avx512dq()) return ImplType::AVX512;
    #elif defined(USE_AVX2)
        if (has_avx2()) return ImplType::AVX2;
    #elif defined(USE_NEON)
        if (has_neon()) return ImplType::NEON;
    #elif defined(USE_SSE2)
        if (has_sse2()) return ImplType::SSE2;
    #else
        return ImplType::Scalar;
    #endif
}

⚡ SIMD Implementation Hierarchy

Consistent Design Pattern

Every SIMD implementation follows the same architectural blueprint:

template<typename Impl, size_t BufferSize = RNG_PARALLEL_STREAMS>
class BufferedRNG : public RNGBase {
public:
    explicit BufferedRNG(uint64_t seed) : impl_(seed), buffer_pos_(BufferSize) {}
    
    uint64_t next_u64() override {
        if (buffer_pos_ >= BufferSize) {
            refill_buffer();
            buffer_pos_ = 0;
        }
        return buffer_[buffer_pos_++];
    }
    
private:
    Impl impl_;                                    // Actual SIMD implementation
    std::array<uint64_t, BufferSize> buffer_;     // Pre-generated values
    size_t buffer_pos_;                           // Current position
};

Parallelism Scaling

Implementation	Parallel Streams	Buffer Size	Target Hardware
Scalar	1	1	Any CPU
SSE2	2	2	Intel Pentium 4+
AVX2	4	4	Intel Haswell+
AVX-512	8	8	Intel Skylake-X+
NEON	2	2	ARM Cortex-A
OpenCL	1024+	10000+	GPU

🎲 Algorithm Integration

Dual-Algorithm Support

The architecture seamlessly supports multiple RNG algorithms with identical interfaces:

namespace rng {
    namespace xoroshiro {
        class Xoroshiro128ppFactory : public RNGFactory { /* ... */ };
        class Xoroshiro128ppScalar : public RNGBase { /* ... */ };
        class Xoroshiro128ppAVX2 { /* ... */ };
        // ... all SIMD variants
    }
    
    namespace wyrand {
        class WyRandFactory : public RNGFactory { /* ... */ };
        class WyRandScalar : public RNGBase { /* ... */ };
        class WyRandAVX2 { /* ... */ };
        // ... all SIMD variants
    }
}

Algorithm Characteristics

Algorithm	Speed	Quality	Period	Use Case
Xoroshiro128++	Fast	Excellent	2^128 - 1	C++ std library compatible
WyRand	Faster	Superior	2^64	High-quality scientific applications

🔧 Memory Management Strategy

Modern C++ RAII Excellence

// Aligned memory allocation with automatic cleanup
template<typename T>
class AlignedBuffer {
public:
    AlignedBuffer(size_t count, size_t alignment = 64) {
        #if defined(_MSC_VER)
            m_data = reinterpret_cast<T*>(_aligned_malloc(count * sizeof(T), alignment));
        #else
            // Custom aligned allocation for GCC/Clang
        #endif
    }
    
    ~AlignedBuffer() {
        #if defined(_MSC_VER)
            _aligned_free(m_data);
        #else
            // Custom cleanup
        #endif
    }
    
    T* data() { return m_data; }
};

Smart Pointer Hierarchy

std::unique_ptr - Single ownership of RNG implementations
std::shared_ptr - Shared state in multi-threaded scenarios
RAII Wrappers - Automatic cleanup of aligned memory
Move Semantics - Zero-copy transfers between implementations

🖥️ OpenCL GPU Architecture

Device Selection Strategy

// Prioritized GPU vendor selection
const std::vector<std::string> preferred_vendors = {
    "NVIDIA Corporation",     // First choice: CUDA cores
    "Intel(R) Corporation",   // Second choice: Integrated GPUs
    "Advanced Micro Devices, Inc."  // Third choice: Radeon
};

Kernel Organization

// Initialization kernel
__kernel void xoroshiro128pp_init(__global ulong2* states, ulong seed, uint num_streams);

// Generation kernel  
__kernel void xoroshiro128pp_generate(__global ulong2* states, __global ulong* results, uint num_streams);

// Precision conversion kernels
__kernel void convert_uint64_to_double(__global ulong* input, __global double* output, uint num_streams);

📊 Performance Optimization Techniques

1. Batch Processing Strategy

void generate_batch(uint64_t* dest, size_t count) override {
    size_t pos = 0;
    
    // Use remaining buffer values
    while (pos < count && buffer_pos_ < BufferSize) {
        dest[pos++] = buffer_[buffer_pos_++];
    }
    
    // Direct generation for large batches
    if ((count - pos) >= BufferSize) {
        size_t direct_count = (count - pos) / BufferSize * BufferSize;
        impl_.generate_batch(dest + pos, direct_count);
        pos += direct_count;
    }
    
    // Fill buffer for remaining values
    if (pos < count) {
        refill_buffer();
        buffer_pos_ = 0;
        while (pos < count) {
            dest[pos++] = buffer_[buffer_pos_++];
        }
    }
}

2. SIMD-Specific Optimizations

Technique	AVX2	AVX-512	OpenCL
Parallel Streams	4	8	1024+
Memory Alignment	32-byte	64-byte	GPU optimal
Rotation Optimization	`_mm256_or_si256`	`_mm512_or_si512`	Bit manipulation
State Updates	Vectorized	Advanced masking	Workgroup local

3. Cache Optimization

Prefetching - Strategic memory access patterns
Alignment - SIMD-friendly data layout
Locality - Minimize cache misses in batch generation
False Sharing - Thread-local buffers prevent contention

🔄 API Design Philosophy

Zero-Overhead Abstraction

// High-level interface
auto rng = universal_rng_new(seed, algorithm, precision);
uint64_t value = universal_rng_next_u64(rng);

// Compiles to optimal SIMD implementation with no runtime overhead
// after initial detection phase

Backward Compatibility

The C API provides seamless integration with existing codebases:

// Pure C interface
universal_rng_t* rng = universal_rng_new(42, RNG_ALGORITHM_XOROSHIRO, RNG_PRECISION_DOUBLE);
uint64_t random = universal_rng_next_u64(rng);
universal_rng_free(rng);

🎯 Why This Architecture Wins

1. Conway's Law Optimization

Architecture mirrors optimal hardware communication patterns
Minimal abstraction overhead
Direct mapping to SIMD capabilities

2. Future-Proof Design

New instruction sets require only new implementations
API remains stable across hardware generations
Automatic optimization without code changes

3. Performance Through Intelligence

Runtime detection eliminates guesswork
Template dispatch avoids virtual function overhead
SIMD implementations maximize parallel execution

4. Memory Safety Without Cost

Smart pointers prevent leaks
RAII ensures cleanup
Zero-overhead abstractions maintain performance

Next: [⚡ SIMD Implementations](SIMD-Implementations) - Deep dive into each optimization level

There is currently data lost off the bottom off the page - a search party needs to be sent in to rescue!

PLEASE DO BEAR IN CONSTANT MIND ABOVE ALL ELSE: CURRENT STATE OF DEVELOPMENT THE C++ STD LIBRARY EMPLOYING MERSENNE TWISTER STILL OUTPERFORMS SINGLE CALCULATION OPERATIONS FOR NON-SIMD BOOSTED COMPUTERS. THESE LIBRARIES FULLY REQUIRE AT LEAST AVX2 MINIMUM TO BENEFIT OVER THE STD GENERATION METHODS WHEN CONSIDERING SINGLE NUMBER GENERATION TASKS.

Uh oh!

Architecture Overview

🏗️ Architecture Overview

🎯 Core Design Philosophy

🔄 The Universal Dispatch System

Runtime Intelligence Flow

Key Architectural Principles

🧠 Runtime Detection System

Cross-Platform CPU Feature Detection

Detection Strategy by Platform

3. Cache Optimization

🔄 API Design Philosophy

Zero-Overhead Abstraction

Backward Compatibility

🎯 Why This Architecture Wins

1. Conway's Law Optimization

2. Future-Proof Design

3. Performance Through Intelligence

4. Memory Safety Without Cost

🎯 Core Design Philosophy

🔄 The Universal Dispatch System

Runtime Intelligence Flow

Key Architectural Principles

🧠 Runtime Detection System

Cross-Platform CPU Feature Detection

Detection Strategy by Platform

The Selection Algorithm

⚡ SIMD Implementation Hierarchy

Consistent Design Pattern

Parallelism Scaling

🎲 Algorithm Integration

Dual-Algorithm Support

Algorithm Characteristics

🔧 Memory Management Strategy

Modern C++ RAII Excellence

Smart Pointer Hierarchy

🖥️ OpenCL GPU Architecture

Device Selection Strategy

Kernel Organization

📊 Performance Optimization Techniques

1. Batch Processing Strategy

2. SIMD-Specific Optimizations

3. Cache Optimization

🔄 API Design Philosophy

Zero-Overhead Abstraction

Backward Compatibility

🎯 Why This Architecture Wins

1. Conway's Law Optimization

2. Future-Proof Design

3. Performance Through Intelligence

4. Memory Safety Without Cost

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally