-
-
Notifications
You must be signed in to change notification settings - Fork 2
Architecture Overview
"The design pattern is quite elegant, with a dispatching mechanism that selects the best available SIMD implementation at runtime."
The Universal RNG Library embodies Conway's Law in reverse - instead of letting our organizational structure dictate our architecture, we designed the architecture to mirror the optimal communication patterns between hardware and software for maximum performance.
flowchart TD
A[Application Start] --> B[CPU Feature Detection]
B --> C{AVX-512 Available?}
C -->|Yes| D[AVX-512 Implementation]
C -->|No| E{AVX2 Available?}
E -->|Yes| F[AVX2 Implementation]
E -->|No| G{SSE2 Available?}
G -->|Yes| H[SSE2 Implementation]
G -->|No| I[Scalar Fallback]
D --> J[Performance: 8x Parallel]
F --> K[Performance: 4x Parallel]
H --> L[Performance: 2x Parallel]
I --> M[Performance: 1x Baseline]
J --> N[Automatic Optimization]
K --> N
L --> N
M --> N
-
π Detect Once, Optimize Forever
- CPU feature detection happens at initialization
- Zero runtime overhead after selection
- Future-proof against new instruction sets
-
π Polymorphic Performance
- Same API, different implementations
- Template-based dispatch eliminates virtual function overhead
- Smart pointer management ensures memory safety
-
π¦ Batch-First Design
- Optimized for bulk generation scenarios
- SIMD implementations excel at parallel streams
- Single-value generation as optimized special case
// Platform-agnostic feature detection
class CPUFeatures {
public:
enum class Feature {
SSE2, AVX, AVX2, NEON,
AVX512F, AVX512DQ, AVX512BW, AVX512VL,
// ... and more
};
bool hasFeature(Feature feature) const;
static std::unique_ptr<CPUFeatures> detect();
};
| Platform | Method | Instruction |
|---|---|---|
| Windows | __cpuid | Native MSVC intrinsic |
| Linux | __builtin_cpu_supports | GCC built-in |
| macOS | sysctlbyname | System information |
| ARM | /proc/cpuinfo parsing | Feature flags |
- Prefetching - Strategic memory access patterns
- Alignment - SIMD-friendly data layout
- Locality - Minimize cache misses in batch generation
- False Sharing - Thread-local buffers prevent contention
// High-level interface
auto rng = universal_rng_new(seed, algorithm, precision);
uint64_t value = universal_rng_next_u64(rng);
// Compiles to optimal SIMD implementation with no runtime overhead
// after initial detection phase
The C API provides seamless integration with existing codebases:
// Pure C interface
universal_rng_t* rng = universal_rng_new(42, RNG_ALGORITHM_XOROSHIRO, RNG_PRECISION_DOUBLE);
uint64_t random = universal_rng_next_u64(rng);
universal_rng_free(rng);
- Architecture mirrors optimal hardware communication patterns
- Minimal abstraction overhead
- Direct mapping to SIMD capabilities
- New instruction sets require only new implementations
- API remains stable across hardware generations
- Automatic optimization without code changes
- Runtime detection eliminates guesswork
- Template dispatch avoids virtual function overhead
- SIMD implementations maximize parallel execution
- Smart pointers prevent leaks
- RAII ensures cleanup
- Zero-overhead abstractions maintain performance
Next: β‘ SIMD Implementations - Deep dive into each optimization level
# ποΈ Architecture Overview"The design pattern is quite elegant, with a dispatching mechanism that selects the best available SIMD implementation at runtime."
The Universal RNG Library embodies Conway's Law in reverse - instead of letting our organizational structure dictate our architecture, we designed the architecture to mirror the optimal communication patterns between hardware and software for maximum performance.
flowchart TD
A[Application Start] --> B[CPU Feature Detection]
B --> C{AVX-512 Available?}
C -->|Yes| D[AVX-512 Implementation]
C -->|No| E{AVX2 Available?}
E -->|Yes| F[AVX2 Implementation]
E -->|No| G{SSE2 Available?}
G -->|Yes| H[SSE2 Implementation]
G -->|No| I[Scalar Fallback]
D --> J[Performance: 8x Parallel]
F --> K[Performance: 4x Parallel]
H --> L[Performance: 2x Parallel]
I --> M[Performance: 1x Baseline]
J --> N[Automatic Optimization]
K --> N
L --> N
M --> N
-
π Detect Once, Optimize Forever
- CPU feature detection happens at initialization
- Zero runtime overhead after selection
- Future-proof against new instruction sets
-
π Polymorphic Performance
- Same API, different implementations
- Template-based dispatch eliminates virtual function overhead
- Smart pointer management ensures memory safety
-
π¦ Batch-First Design
- Optimized for bulk generation scenarios
- SIMD implementations excel at parallel streams
- Single-value generation as optimized special case
// Platform-agnostic feature detection
class CPUFeatures {
public:
enum class Feature {
SSE2, AVX, AVX2, NEON,
AVX512F, AVX512DQ, AVX512BW, AVX512VL,
// ... and more
};
bool hasFeature(Feature feature) const;
static std::unique_ptr<CPUFeatures> detect();
};| Platform | Method | Instruction |
|---|---|---|
| Windows | __cpuid |
Native MSVC intrinsic |
| Linux | __builtin_cpu_supports |
GCC built-in |
| macOS | sysctlbyname |
System information |
| ARM |
/proc/cpuinfo parsing |
Feature flags |
ImplType detect_best_impl() {
#if defined(USE_OPENCL)
if (gpu_available()) return ImplType::OpenCL;
#elif defined(USE_AVX512)
if (has_avx512f() && has_avx512dq()) return ImplType::AVX512;
#elif defined(USE_AVX2)
if (has_avx2()) return ImplType::AVX2;
#elif defined(USE_NEON)
if (has_neon()) return ImplType::NEON;
#elif defined(USE_SSE2)
if (has_sse2()) return ImplType::SSE2;
#else
return ImplType::Scalar;
#endif
}Every SIMD implementation follows the same architectural blueprint:
template<typename Impl, size_t BufferSize = RNG_PARALLEL_STREAMS>
class BufferedRNG : public RNGBase {
public:
explicit BufferedRNG(uint64_t seed) : impl_(seed), buffer_pos_(BufferSize) {}
uint64_t next_u64() override {
if (buffer_pos_ >= BufferSize) {
refill_buffer();
buffer_pos_ = 0;
}
return buffer_[buffer_pos_++];
}
private:
Impl impl_; // Actual SIMD implementation
std::array<uint64_t, BufferSize> buffer_; // Pre-generated values
size_t buffer_pos_; // Current position
};| Implementation | Parallel Streams | Buffer Size | Target Hardware |
|---|---|---|---|
| Scalar | 1 | 1 | Any CPU |
| SSE2 | 2 | 2 | Intel Pentium 4+ |
| AVX2 | 4 | 4 | Intel Haswell+ |
| AVX-512 | 8 | 8 | Intel Skylake-X+ |
| NEON | 2 | 2 | ARM Cortex-A |
| OpenCL | 1024+ | 10000+ | GPU |
The architecture seamlessly supports multiple RNG algorithms with identical interfaces:
namespace rng {
namespace xoroshiro {
class Xoroshiro128ppFactory : public RNGFactory { /* ... */ };
class Xoroshiro128ppScalar : public RNGBase { /* ... */ };
class Xoroshiro128ppAVX2 { /* ... */ };
// ... all SIMD variants
}
namespace wyrand {
class WyRandFactory : public RNGFactory { /* ... */ };
class WyRandScalar : public RNGBase { /* ... */ };
class WyRandAVX2 { /* ... */ };
// ... all SIMD variants
}
}| Algorithm | Speed | Quality | Period | Use Case |
|---|---|---|---|---|
| Xoroshiro128++ | Fast | Excellent | 2^128 - 1 | C++ std library compatible |
| WyRand | Faster | Superior | 2^64 | High-quality scientific applications |
// Aligned memory allocation with automatic cleanup
template<typename T>
class AlignedBuffer {
public:
AlignedBuffer(size_t count, size_t alignment = 64) {
#if defined(_MSC_VER)
m_data = reinterpret_cast<T*>(_aligned_malloc(count * sizeof(T), alignment));
#else
// Custom aligned allocation for GCC/Clang
#endif
}
~AlignedBuffer() {
#if defined(_MSC_VER)
_aligned_free(m_data);
#else
// Custom cleanup
#endif
}
T* data() { return m_data; }
};-
std::unique_ptr- Single ownership of RNG implementations -
std::shared_ptr- Shared state in multi-threaded scenarios - RAII Wrappers - Automatic cleanup of aligned memory
- Move Semantics - Zero-copy transfers between implementations
// Prioritized GPU vendor selection
const std::vector<std::string> preferred_vendors = {
"NVIDIA Corporation", // First choice: CUDA cores
"Intel(R) Corporation", // Second choice: Integrated GPUs
"Advanced Micro Devices, Inc." // Third choice: Radeon
};// Initialization kernel
__kernel void xoroshiro128pp_init(__global ulong2* states, ulong seed, uint num_streams);
// Generation kernel
__kernel void xoroshiro128pp_generate(__global ulong2* states, __global ulong* results, uint num_streams);
// Precision conversion kernels
__kernel void convert_uint64_to_double(__global ulong* input, __global double* output, uint num_streams);void generate_batch(uint64_t* dest, size_t count) override {
size_t pos = 0;
// Use remaining buffer values
while (pos < count && buffer_pos_ < BufferSize) {
dest[pos++] = buffer_[buffer_pos_++];
}
// Direct generation for large batches
if ((count - pos) >= BufferSize) {
size_t direct_count = (count - pos) / BufferSize * BufferSize;
impl_.generate_batch(dest + pos, direct_count);
pos += direct_count;
}
// Fill buffer for remaining values
if (pos < count) {
refill_buffer();
buffer_pos_ = 0;
while (pos < count) {
dest[pos++] = buffer_[buffer_pos_++];
}
}
}| Technique | AVX2 | AVX-512 | OpenCL |
|---|---|---|---|
| Parallel Streams | 4 | 8 | 1024+ |
| Memory Alignment | 32-byte | 64-byte | GPU optimal |
| Rotation Optimization | _mm256_or_si256 |
_mm512_or_si512 |
Bit manipulation |
| State Updates | Vectorized | Advanced masking | Workgroup local |
- Prefetching - Strategic memory access patterns
- Alignment - SIMD-friendly data layout
- Locality - Minimize cache misses in batch generation
- False Sharing - Thread-local buffers prevent contention
// High-level interface
auto rng = universal_rng_new(seed, algorithm, precision);
uint64_t value = universal_rng_next_u64(rng);
// Compiles to optimal SIMD implementation with no runtime overhead
// after initial detection phaseThe C API provides seamless integration with existing codebases:
// Pure C interface
universal_rng_t* rng = universal_rng_new(42, RNG_ALGORITHM_XOROSHIRO, RNG_PRECISION_DOUBLE);
uint64_t random = universal_rng_next_u64(rng);
universal_rng_free(rng);- Architecture mirrors optimal hardware communication patterns
- Minimal abstraction overhead
- Direct mapping to SIMD capabilities
- New instruction sets require only new implementations
- API remains stable across hardware generations
- Automatic optimization without code changes
- Runtime detection eliminates guesswork
- Template dispatch avoids virtual function overhead
- SIMD implementations maximize parallel execution
- Smart pointers prevent leaks
- RAII ensures cleanup
- Zero-overhead abstractions maintain performance
Next: [β‘ SIMD Implementations](SIMD-Implementations) - Deep dive into each optimization level
There is currently data lost off the bottom off the page - a search party needs to be sent in to rescue!
PLEASE DO BEAR IN CONSTANT MIND ABOVE ALL ELSE: CURRENT STATE OF DEVELOPMENT THE C++ STD LIBRARY EMPLOYING MERSENNE TWISTER STILL OUTPERFORMS SINGLE CALCULATION OPERATIONS FOR NON-SIMD BOOSTED COMPUTERS. THESE LIBRARIES FULLY REQUIRE AT LEAST AVX2 MINIMUM TO BENEFIT OVER THE STD GENERATION METHODS WHEN CONSIDERING SINGLE NUMBER GENERATION TASKS.