Skip to content

Architecture Overview

whisprer edited this page Aug 4, 2025 · 1 revision

πŸ—οΈ Architecture Overview

"The design pattern is quite elegant, with a dispatching mechanism that selects the best available SIMD implementation at runtime."


🎯 Core Design Philosophy

The Universal RNG Library embodies Conway's Law in reverse - instead of letting our organizational structure dictate our architecture, we designed the architecture to mirror the optimal communication patterns between hardware and software for maximum performance.

πŸ”„ The Universal Dispatch System

Runtime Intelligence Flow

flowchart TD
    A[Application Start] --> B[CPU Feature Detection]
    B --> C{AVX-512 Available?}
    C -->|Yes| D[AVX-512 Implementation]
    C -->|No| E{AVX2 Available?}
    E -->|Yes| F[AVX2 Implementation]
    E -->|No| G{SSE2 Available?}
    G -->|Yes| H[SSE2 Implementation]
    G -->|No| I[Scalar Fallback]
D --> J[Performance: 8x Parallel]
F --> K[Performance: 4x Parallel]
H --> L[Performance: 2x Parallel]
I --> M[Performance: 1x Baseline]

J --> N[Automatic Optimization]
K --> N
L --> N
M --> N

Key Architectural Principles

  1. πŸ” Detect Once, Optimize Forever

    • CPU feature detection happens at initialization
    • Zero runtime overhead after selection
    • Future-proof against new instruction sets
  2. 🎭 Polymorphic Performance

    • Same API, different implementations
    • Template-based dispatch eliminates virtual function overhead
    • Smart pointer management ensures memory safety
  3. πŸ“¦ Batch-First Design

    • Optimized for bulk generation scenarios
    • SIMD implementations excel at parallel streams
    • Single-value generation as optimized special case

🧠 Runtime Detection System

Cross-Platform CPU Feature Detection

// Platform-agnostic feature detection
class CPUFeatures {
public:
    enum class Feature {
        SSE2, AVX, AVX2, NEON,
        AVX512F, AVX512DQ, AVX512BW, AVX512VL,
        // ... and more
    };
bool hasFeature(Feature feature) const;
static std::unique_ptr<CPUFeatures> detect();

};

Detection Strategy by Platform

Platform Method Instruction
Windows __cpuid Native MSVC intrinsic
Linux __builtin_cpu_supports GCC built-in
macOS sysctlbyname System information
ARM /proc/cpuinfo parsing Feature flags

3. Cache Optimization

  • Prefetching - Strategic memory access patterns
  • Alignment - SIMD-friendly data layout
  • Locality - Minimize cache misses in batch generation
  • False Sharing - Thread-local buffers prevent contention

πŸ”„ API Design Philosophy

Zero-Overhead Abstraction

// High-level interface
auto rng = universal_rng_new(seed, algorithm, precision);
uint64_t value = universal_rng_next_u64(rng);

// Compiles to optimal SIMD implementation with no runtime overhead // after initial detection phase

Backward Compatibility

The C API provides seamless integration with existing codebases:

// Pure C interface
universal_rng_t* rng = universal_rng_new(42, RNG_ALGORITHM_XOROSHIRO, RNG_PRECISION_DOUBLE);
uint64_t random = universal_rng_next_u64(rng);
universal_rng_free(rng);

🎯 Why This Architecture Wins

1. Conway's Law Optimization

  • Architecture mirrors optimal hardware communication patterns
  • Minimal abstraction overhead
  • Direct mapping to SIMD capabilities

2. Future-Proof Design

  • New instruction sets require only new implementations
  • API remains stable across hardware generations
  • Automatic optimization without code changes

3. Performance Through Intelligence

  • Runtime detection eliminates guesswork
  • Template dispatch avoids virtual function overhead
  • SIMD implementations maximize parallel execution

4. Memory Safety Without Cost

  • Smart pointers prevent leaks
  • RAII ensures cleanup
  • Zero-overhead abstractions maintain performance

Next: ⚑ SIMD Implementations - Deep dive into each optimization level

# πŸ—οΈ Architecture Overview

"The design pattern is quite elegant, with a dispatching mechanism that selects the best available SIMD implementation at runtime."


🎯 Core Design Philosophy

The Universal RNG Library embodies Conway's Law in reverse - instead of letting our organizational structure dictate our architecture, we designed the architecture to mirror the optimal communication patterns between hardware and software for maximum performance.

πŸ”„ The Universal Dispatch System

Runtime Intelligence Flow

flowchart TD
    A[Application Start] --> B[CPU Feature Detection]
    B --> C{AVX-512 Available?}
    C -->|Yes| D[AVX-512 Implementation]
    C -->|No| E{AVX2 Available?}
    E -->|Yes| F[AVX2 Implementation]
    E -->|No| G{SSE2 Available?}
    G -->|Yes| H[SSE2 Implementation]
    G -->|No| I[Scalar Fallback]
    
    D --> J[Performance: 8x Parallel]
    F --> K[Performance: 4x Parallel]
    H --> L[Performance: 2x Parallel]
    I --> M[Performance: 1x Baseline]
    
    J --> N[Automatic Optimization]
    K --> N
    L --> N
    M --> N
Loading

Key Architectural Principles

  1. πŸ” Detect Once, Optimize Forever

    • CPU feature detection happens at initialization
    • Zero runtime overhead after selection
    • Future-proof against new instruction sets
  2. 🎭 Polymorphic Performance

    • Same API, different implementations
    • Template-based dispatch eliminates virtual function overhead
    • Smart pointer management ensures memory safety
  3. πŸ“¦ Batch-First Design

    • Optimized for bulk generation scenarios
    • SIMD implementations excel at parallel streams
    • Single-value generation as optimized special case

🧠 Runtime Detection System

Cross-Platform CPU Feature Detection

// Platform-agnostic feature detection
class CPUFeatures {
public:
    enum class Feature {
        SSE2, AVX, AVX2, NEON,
        AVX512F, AVX512DQ, AVX512BW, AVX512VL,
        // ... and more
    };
    
    bool hasFeature(Feature feature) const;
    static std::unique_ptr<CPUFeatures> detect();
};

Detection Strategy by Platform

Platform Method Instruction
Windows __cpuid Native MSVC intrinsic
Linux __builtin_cpu_supports GCC built-in
macOS sysctlbyname System information
ARM /proc/cpuinfo parsing Feature flags

The Selection Algorithm

ImplType detect_best_impl() {
    #if defined(USE_OPENCL)
        if (gpu_available()) return ImplType::OpenCL;
    #elif defined(USE_AVX512)
        if (has_avx512f() && has_avx512dq()) return ImplType::AVX512;
    #elif defined(USE_AVX2)
        if (has_avx2()) return ImplType::AVX2;
    #elif defined(USE_NEON)
        if (has_neon()) return ImplType::NEON;
    #elif defined(USE_SSE2)
        if (has_sse2()) return ImplType::SSE2;
    #else
        return ImplType::Scalar;
    #endif
}

⚑ SIMD Implementation Hierarchy

Consistent Design Pattern

Every SIMD implementation follows the same architectural blueprint:

template<typename Impl, size_t BufferSize = RNG_PARALLEL_STREAMS>
class BufferedRNG : public RNGBase {
public:
    explicit BufferedRNG(uint64_t seed) : impl_(seed), buffer_pos_(BufferSize) {}
    
    uint64_t next_u64() override {
        if (buffer_pos_ >= BufferSize) {
            refill_buffer();
            buffer_pos_ = 0;
        }
        return buffer_[buffer_pos_++];
    }
    
private:
    Impl impl_;                                    // Actual SIMD implementation
    std::array<uint64_t, BufferSize> buffer_;     // Pre-generated values
    size_t buffer_pos_;                           // Current position
};

Parallelism Scaling

Implementation Parallel Streams Buffer Size Target Hardware
Scalar 1 1 Any CPU
SSE2 2 2 Intel Pentium 4+
AVX2 4 4 Intel Haswell+
AVX-512 8 8 Intel Skylake-X+
NEON 2 2 ARM Cortex-A
OpenCL 1024+ 10000+ GPU

🎲 Algorithm Integration

Dual-Algorithm Support

The architecture seamlessly supports multiple RNG algorithms with identical interfaces:

namespace rng {
    namespace xoroshiro {
        class Xoroshiro128ppFactory : public RNGFactory { /* ... */ };
        class Xoroshiro128ppScalar : public RNGBase { /* ... */ };
        class Xoroshiro128ppAVX2 { /* ... */ };
        // ... all SIMD variants
    }
    
    namespace wyrand {
        class WyRandFactory : public RNGFactory { /* ... */ };
        class WyRandScalar : public RNGBase { /* ... */ };
        class WyRandAVX2 { /* ... */ };
        // ... all SIMD variants
    }
}

Algorithm Characteristics

Algorithm Speed Quality Period Use Case
Xoroshiro128++ Fast Excellent 2^128 - 1 C++ std library compatible
WyRand Faster Superior 2^64 High-quality scientific applications

πŸ”§ Memory Management Strategy

Modern C++ RAII Excellence

// Aligned memory allocation with automatic cleanup
template<typename T>
class AlignedBuffer {
public:
    AlignedBuffer(size_t count, size_t alignment = 64) {
        #if defined(_MSC_VER)
            m_data = reinterpret_cast<T*>(_aligned_malloc(count * sizeof(T), alignment));
        #else
            // Custom aligned allocation for GCC/Clang
        #endif
    }
    
    ~AlignedBuffer() {
        #if defined(_MSC_VER)
            _aligned_free(m_data);
        #else
            // Custom cleanup
        #endif
    }
    
    T* data() { return m_data; }
};

Smart Pointer Hierarchy

  • std::unique_ptr - Single ownership of RNG implementations
  • std::shared_ptr - Shared state in multi-threaded scenarios
  • RAII Wrappers - Automatic cleanup of aligned memory
  • Move Semantics - Zero-copy transfers between implementations

πŸ–₯️ OpenCL GPU Architecture

Device Selection Strategy

// Prioritized GPU vendor selection
const std::vector<std::string> preferred_vendors = {
    "NVIDIA Corporation",     // First choice: CUDA cores
    "Intel(R) Corporation",   // Second choice: Integrated GPUs
    "Advanced Micro Devices, Inc."  // Third choice: Radeon
};

Kernel Organization

// Initialization kernel
__kernel void xoroshiro128pp_init(__global ulong2* states, ulong seed, uint num_streams);

// Generation kernel  
__kernel void xoroshiro128pp_generate(__global ulong2* states, __global ulong* results, uint num_streams);

// Precision conversion kernels
__kernel void convert_uint64_to_double(__global ulong* input, __global double* output, uint num_streams);

πŸ“Š Performance Optimization Techniques

1. Batch Processing Strategy

void generate_batch(uint64_t* dest, size_t count) override {
    size_t pos = 0;
    
    // Use remaining buffer values
    while (pos < count && buffer_pos_ < BufferSize) {
        dest[pos++] = buffer_[buffer_pos_++];
    }
    
    // Direct generation for large batches
    if ((count - pos) >= BufferSize) {
        size_t direct_count = (count - pos) / BufferSize * BufferSize;
        impl_.generate_batch(dest + pos, direct_count);
        pos += direct_count;
    }
    
    // Fill buffer for remaining values
    if (pos < count) {
        refill_buffer();
        buffer_pos_ = 0;
        while (pos < count) {
            dest[pos++] = buffer_[buffer_pos_++];
        }
    }
}

2. SIMD-Specific Optimizations

Technique AVX2 AVX-512 OpenCL
Parallel Streams 4 8 1024+
Memory Alignment 32-byte 64-byte GPU optimal
Rotation Optimization _mm256_or_si256 _mm512_or_si512 Bit manipulation
State Updates Vectorized Advanced masking Workgroup local

3. Cache Optimization

  • Prefetching - Strategic memory access patterns
  • Alignment - SIMD-friendly data layout
  • Locality - Minimize cache misses in batch generation
  • False Sharing - Thread-local buffers prevent contention

πŸ”„ API Design Philosophy

Zero-Overhead Abstraction

// High-level interface
auto rng = universal_rng_new(seed, algorithm, precision);
uint64_t value = universal_rng_next_u64(rng);

// Compiles to optimal SIMD implementation with no runtime overhead
// after initial detection phase

Backward Compatibility

The C API provides seamless integration with existing codebases:

// Pure C interface
universal_rng_t* rng = universal_rng_new(42, RNG_ALGORITHM_XOROSHIRO, RNG_PRECISION_DOUBLE);
uint64_t random = universal_rng_next_u64(rng);
universal_rng_free(rng);

🎯 Why This Architecture Wins

1. Conway's Law Optimization

  • Architecture mirrors optimal hardware communication patterns
  • Minimal abstraction overhead
  • Direct mapping to SIMD capabilities

2. Future-Proof Design

  • New instruction sets require only new implementations
  • API remains stable across hardware generations
  • Automatic optimization without code changes

3. Performance Through Intelligence

  • Runtime detection eliminates guesswork
  • Template dispatch avoids virtual function overhead
  • SIMD implementations maximize parallel execution

4. Memory Safety Without Cost

  • Smart pointers prevent leaks
  • RAII ensures cleanup
  • Zero-overhead abstractions maintain performance

Next: [⚑ SIMD Implementations](SIMD-Implementations) - Deep dive into each optimization level

PLEASE DO BEAR IN CONSTANT MIND ABOVE ALL ELSE: CURRENT STATE OF DEVELOPMENT THE C++ STD LIBRARY EMPLOYING MERSENNE TWISTER STILL OUTPERFORMS SINGLE CALCULATION OPERATIONS FOR NON-SIMD BOOSTED COMPUTERS. THESE LIBRARIES FULLY REQUIRE AT LEAST AVX2 MINIMUM TO BENEFIT OVER THE STD GENERATION METHODS WHEN CONSIDERING SINGLE NUMBER GENERATION TASKS.

Clone this wiki locally