This guide provides comprehensive information about the parallel and batch processing capabilities in libstats. The library implements a dual API approach: intelligent auto-dispatch for most users and explicit strategy control for power users who need precise performance control.
- Simplicity First: Clean, intuitive API for 95% of use cases
- Power User Access: Explicit strategy selection for advanced performance tuning
- Performance Preservation: Zero regression in optimized code paths
- Intelligent Dispatch: Automatic strategy selection based on data characteristics and system capabilities
- Thread Safety: All batch operations are fully thread-safe
Strategy selection uses a threshold hierarchy tuned per architecture at startup:
- Below SIMD threshold (~8 elements):
SCALAR— dispatch overhead exceeds benefit - Below parallel threshold (varies by distribution):
VECTORIZED— VectorOps batch path - Above parallel threshold:
PARALLELorWORK_STEALING— threading pays off
The parallel threshold is distribution-specific: Gaussian (exp/erf heavy) parallelizes sooner than Uniform (trivial arithmetic). Thresholds are refined at startup based on measured thread overhead and SIMD efficiency for the current machine.
An optional performance history layer can override the threshold decision when it has high-confidence learned data for a specific distribution and batch size.
These methods automatically select the optimal processing strategy based on your data and system capabilities:
// Probability Density Function (PDF)
void getProbability(std::span<const double> values, std::span<double> results,
const PerformanceHint& hint = {}) const;
// Log Probability Density Function (Log PDF)
void getLogProbability(std::span<const double> values, std::span<double> results,
const PerformanceHint& hint = {}) const;
// Cumulative Distribution Function (CDF)
void getCumulativeProbability(std::span<const double> values, std::span<double> results,
const PerformanceHint& hint = {}) const;You can provide optional hints to guide strategy selection:
PerformanceHint hint;
hint.strategy = PerformanceHint::PreferredStrategy::AUTO; // Default: automatic
hint.strategy = PerformanceHint::PreferredStrategy::MINIMIZE_LATENCY; // Favor speed
hint.strategy = PerformanceHint::PreferredStrategy::MAXIMIZE_THROUGHPUT; // Favor parallelism
hint.strategy = PerformanceHint::PreferredStrategy::FORCE_SCALAR; // Force scalar processing
hint.strategy = PerformanceHint::PreferredStrategy::FORCE_VECTORIZED; // Force vectorized batch path
hint.strategy = PerformanceHint::PreferredStrategy::FORCE_PARALLEL; // Force parallel processingThese methods give you direct control over the processing strategy:
// Probability Density Function with explicit strategy
void getProbabilityWithStrategy(std::span<const double> values, std::span<double> results,
stats::detail::Strategy strategy) const;
// Log Probability Density Function with explicit strategy
void getLogProbabilityWithStrategy(std::span<const double> values, std::span<double> results,
stats::detail::Strategy strategy) const;
// Cumulative Distribution Function with explicit strategy
void getCumulativeProbabilityWithStrategy(std::span<const double> values, std::span<double> results,
stats::detail::Strategy strategy) const;enum class Strategy {
SCALAR, // Element-by-element loop (batch < simd_min threshold)
VECTORIZED, // VectorOps batch path — SIMD-accelerated for Gaussian,
// Exponential, Gamma (PDF/LogPDF), Uniform (CDF), and ChiSquared
PARALLEL, // Multi-threaded via ParallelUtils::parallelFor
WORK_STEALING // Work-stealing pool for irregular or variable workloads
};#include "libstats.h"
// Create a distribution
auto result = libstats::GaussianDistribution::create(0.0, 1.0);
if (result.isOk()) {
auto gaussian = std::move(result.value);
// Prepare data
std::vector<double> input_values(10000);
std::vector<double> pdf_results(10000);
std::vector<double> cdf_results(10000);
// Fill input with test data
std::iota(input_values.begin(), input_values.end(), -5.0);
// Auto-dispatch will select optimal strategy (likely PARALLEL for 10k elements)
gaussian.getProbability(std::span<const double>(input_values),
std::span<double>(pdf_results));
gaussian.getCumulativeProbability(std::span<const double>(input_values),
std::span<double>(cdf_results));
std::cout << "Processed " << input_values.size() << " values" << std::endl;
}// For latency-sensitive applications
PerformanceHint low_latency;
low_latency.strategy = PerformanceHint::PreferredStrategy::MINIMIZE_LATENCY;
gaussian.getProbability(std::span<const double>(small_batch),
std::span<double>(results), low_latency);
// For throughput-focused applications
PerformanceHint high_throughput;
high_throughput.strategy = PerformanceHint::PreferredStrategy::MAXIMIZE_THROUGHPUT;
gaussian.getProbability(std::span<const double>(large_batch),
std::span<double>(results), high_throughput);// Power user: explicitly control the processing strategy
using Strategy = stats::detail::Strategy;
std::vector<double> input(5000);
std::vector<double> results(5000);
// Force scalar processing (useful for debugging or comparison)
gaussian.getProbabilityWithStrategy(std::span<const double>(input),
std::span<double>(results), Strategy::SCALAR);
// Force parallel processing (useful when you know your data characteristics)
gaussian.getProbabilityWithStrategy(std::span<const double>(input),
std::span<double>(results), Strategy::PARALLEL);
// Use work-stealing for irregular workloads
gaussian.getProbabilityWithStrategy(std::span<const double>(input),
std::span<double>(results), Strategy::WORK_STEALING);#include <chrono>
// Benchmark different strategies
auto benchmark_strategy = [&](Strategy strategy, const std::string& name) {
auto start = std::chrono::high_resolution_clock::now();
gaussian.getProbabilityWithStrategy(std::span<const double>(test_data),
std::span<double>(results), strategy);
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << name << ": " << duration.count() << " μs" << std::endl;
};
benchmark_strategy(Strategy::SCALAR, "Scalar");
benchmark_strategy(Strategy::VECTORIZED, "Vectorized");
benchmark_strategy(Strategy::PARALLEL, "Parallel");
benchmark_strategy(Strategy::WORK_STEALING, "Work-Stealing");| Batch Size | Recommended Strategy | Rationale |
|---|---|---|
| < 8 elements | SCALAR |
Dispatch overhead exceeds any gain |
| 8 – ~1000 elements | VECTORIZED |
VectorOps batch path; SIMD-accelerated for Gaussian |
| ~1000 – 10000 elements | PARALLEL |
Threading benefit outweighs overhead |
| 10000+ elements | WORK_STEALING |
Dynamic load balancing for large/irregular workloads |
Exact thresholds are architecture-specific (Gaussian parallel threshold ~2500 on Intel AVX,
higher on machines with greater thread overhead). Query ./build/tools/system_inspector
for thresholds on your machine.
From simd_verification on MacBook Pro 9,1 (i7-3820QM, SSE2+AVX, no FMA), v1.2.0:
| Distribution | Operation | SIMD Speedup | Notes |
|---|---|---|---|
| Uniform | ~54x | Trivial bounds check; gain is batch-path amortization | |
| Uniform | CDF | ~25x | Linear interior vectorized; boundary scalar fixup |
| Gaussian | LogPDF | ~52x | Purely affine, no transcendentals |
| Gaussian | ~12x | Uses vector_exp | |
| Gaussian | CDF | ~10x | Uses vector_erf |
| Exponential | LogPDF | ~21x | Purely affine |
| Exponential | ~10x | vector_exp + scalar fixup | |
| Exponential | CDF | ~10x | vector_exp + scalar fixup |
| Gamma | ~10x | vector_log pipeline | |
| Gamma | LogPDF | ~7x | vector_log pipeline |
| Chi-squared | ~10x | Delegates to Gamma batch path | |
| Student's t | ~7x | One vector_log in the log-space pipeline | |
| Student's t | LogPDF | ~8x | Log-space SIMD pipeline |
| Beta | ~5x | Two vector_log calls plus boundary fixup | |
| Beta | LogPDF | ~5x | Two-log SIMD pipeline |
| Gamma/Poisson/Student's t/Beta | CDF | 1–2x | Dominated by scalar special functions |
| Overall | all 54 | ~2.5x | Geometric mean across the full current suite |
Higher speedups seen on AVX2/FMA (Kaby Lake: 4.45x+ overall) and different profile on NEON M1. AVX-512 (AMD Ryzen Zen 4) exercises 8-wide vectors; transcendentals currently delegate to AVX.
SIMD vectorization status: Exponential (all 3 ops), Gamma (PDF/LogPDF), and Uniform (CDF)
have full SIMD batch paths via BatchUnsafeImpl using the compute+fixup pattern. Poisson,
Discrete, and Uniform PDF/LogPDF remain scalar in their hot paths — those require vector_floor
or vector_blend primitives not yet in VectorOps.
ChiSquared delegates all batch operations to GammaDistribution (Delegation pattern), so
it inherits Gamma's SIMD performance automatically. Student's t and Beta use direct
log-space SIMD batch paths for PDF/LogPDF while keeping scalar special-function CDF implementations.
The library automatically detects and uses available system capabilities:
// Query detected capabilities (for informational purposes)
std::cout << "SIMD level: " << libstats::cpu::best_simd_level() << std::endl;
std::cout << "Vector width: " << libstats::cpu::optimal_double_width() << std::endl;
std::cout << "CPU cores: " << std::thread::hardware_concurrency() << std::endl;For optimal performance in production applications:
int main() {
// Initialize performance systems to eliminate cold-start delays
libstats::initialize_performance_systems();
// Your application code here...
auto gaussian = libstats::GaussianDistribution::create(0.0, 1.0);
// Batch operations will have optimal performance from the start
}All batch processing methods are fully thread-safe:
// Safe to call from multiple threads simultaneously
std::vector<std::thread> threads;
for (int i = 0; i < std::thread::hardware_concurrency(); ++i) {
threads.emplace_back([&gaussian, i]() {
std::vector<double> local_input(1000);
std::vector<double> local_results(1000);
// Fill with thread-specific data
std::iota(local_input.begin(), local_input.end(), i * 1000.0);
// Thread-safe batch processing
gaussian.getProbability(std::span<const double>(local_input),
std::span<double>(local_results));
});
}
for (auto& t : threads) {
t.join();
}All nine distributions support parallelBatchFit for fitting distribution parameters to
multiple independent datasets simultaneously, using the shared detail::batchFitParallel
helper from include/core/parallel_batch_fit.h.
// Fit Gaussian parameters to 100 independent datasets in parallel
std::vector<std::vector<double>> datasets(100);
for (auto& ds : datasets) {
ds.resize(1000);
// ... fill each dataset ...
}
std::vector<stats::GaussianDistribution> fitted(100);
stats::GaussianDistribution::parallelBatchFit(datasets, fitted);
// fitted[i] now holds parameters estimated from datasets[i]Available on all 9 distributions: Gaussian, Exponential, Uniform, Discrete, Poisson,
Gamma, Chi-squared, Student's t, and Beta all implement parallelBatchFit.
The implementation uses the global ThreadPool for datasets ≥ 8 (configurable via
dispatch_table::BATCH_FIT_MIN). Smaller counts run serially. Exceptions from individual
fits are captured per-chunk and re-thrown after all futures complete; a serial fallback
runs if the thread pool itself fails to initialise.
- Strengths: Extremely fast SIMD processing due to simple integer operations
- Cache Behavior: Excellent cache locality for repeated discrete values
- Optimal Strategy: VECTORIZED for most batch sizes
PoissonDistribution poisson(3.0);
std::vector<double> discrete_values = {0, 1, 2, 3, 4, 5, 1, 2, 3, 4}; // Repeated values
std::vector<double> probabilities(discrete_values.size());
// VECTORIZED will be extremely fast due to cache-friendly discrete lookups
poisson.getProbability(std::span<const double>(discrete_values),
std::span<double>(probabilities));- Mathematical Complexity: Varies by distribution type
- Memory Patterns: Sequential access patterns optimize well
- Optimal Strategy: Depends on computational complexity
// Uniform: Extremely simple (range checks) - SIMD dominates
UniformDistribution uniform(0.0, 1.0);
// Gaussian: Moderate complexity - balanced SIMD/Parallel trade-off
GaussianDistribution gaussian(0.0, 1.0);
// Gamma: High complexity - Parallel often better than SIMD
GammaDistribution gamma(2.0, 1.5);// The library validates inputs automatically
std::vector<double> input(1000);
std::vector<double> results(500); // Wrong size!
try {
gaussian.getProbability(std::span<const double>(input),
std::span<double>(results));
} catch (const std::invalid_argument& e) {
std::cout << "Error: " << e.what() << std::endl;
// "Input and output spans must have the same size"
}// Empty inputs are handled gracefully
std::vector<double> empty_input;
std::vector<double> empty_results;
// No-op, returns immediately
gaussian.getProbability(std::span<const double>(empty_input),
std::span<double>(empty_results));// Use safe factory methods for exception-free operation
auto safe_gaussian = libstats::GaussianDistribution::create(mean, std_dev);
if (safe_gaussian.isOk()) {
// Process successfully
safe_gaussian.value.getProbability(input_span, results_span);
} else {
std::cout << "Error: " << safe_gaussian.message << std::endl;
}// Avoid very small batches in loops - combine when possible
// ❌ Inefficient: Many small batch calls
for (const auto& small_batch : many_small_batches) {
distribution.getProbability(small_batch, results);
}
// ✅ Efficient: Single large batch call
std::vector<double> combined_input;
for (const auto& batch : many_small_batches) {
combined_input.insert(combined_input.end(), batch.begin(), batch.end());
}
distribution.getProbability(std::span<const double>(combined_input), combined_results);// ✅ Reuse vectors to avoid allocation overhead
std::vector<double> reusable_input(max_batch_size);
std::vector<double> reusable_results(max_batch_size);
for (size_t batch_start = 0; batch_start < total_size; batch_start += batch_size) {
size_t current_batch_size = std::min(batch_size, total_size - batch_start);
// Resize without reallocation (if capacity is sufficient)
reusable_input.resize(current_batch_size);
reusable_results.resize(current_batch_size);
// Fill input data...
distribution.getProbability(std::span<const double>(reusable_input),
std::span<double>(reusable_results));
}// For real-time applications where latency matters
PerformanceHint real_time;
real_time.strategy = PerformanceHint::PreferredStrategy::MINIMIZE_LATENCY;
// For batch processing where throughput matters
PerformanceHint batch_processing;
batch_processing.strategy = PerformanceHint::PreferredStrategy::MAXIMIZE_THROUGHPUT;// Benchmark your specific data patterns and sizes
auto profile_batch_sizes = [&](const std::vector<size_t>& sizes) {
for (size_t size : sizes) {
std::vector<double> test_input(size);
std::vector<double> test_results(size);
// Fill with representative data...
auto start = std::chrono::high_resolution_clock::now();
distribution.getProbability(std::span<const double>(test_input),
std::span<double>(test_results));
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
double throughput = static_cast<double>(size) / duration.count(); // elements per μs
std::cout << "Size " << size << ": " << throughput << " elements/μs" << std::endl;
}
};
profile_batch_sizes({100, 1000, 10000, 100000});If you're upgrading from older versions of libstats:
// ❌ Old deprecated API (removed)
// distribution.getProbabilityBatch(values, results, count);
// ✅ New auto-dispatch API
distribution.getProbability(std::span<const double>(values, count),
std::span<double>(results, count));
// ✅ Or explicit strategy API
distribution.getProbabilityWithStrategy(std::span<const double>(values, count),
std::span<double>(results, count),
stats::detail::Strategy::VECTORIZED);// ❌ Inefficient: Individual calls in a loop
std::vector<double> results(input.size());
for (size_t i = 0; i < input.size(); ++i) {
results[i] = distribution.getProbability(input[i]);
}
// ✅ Efficient: Single batch call
distribution.getProbability(std::span<const double>(input),
std::span<double>(results));// To understand which strategy was selected, use explicit strategy methods for comparison
std::vector<double> input(5000);
std::vector<double> auto_results(5000);
std::vector<double> explicit_results(5000);
// Auto-dispatch
auto auto_start = std::chrono::high_resolution_clock::now();
distribution.getProbability(std::span<const double>(input), std::span<double>(auto_results));
auto auto_end = std::chrono::high_resolution_clock::now();
auto auto_time = std::chrono::duration_cast<std::chrono::microseconds>(auto_end - auto_start);
// Try different explicit strategies to see which matches auto-dispatch performance
auto test_strategy = [&](stats::detail::Strategy strategy, const std::string& name) {
auto start = std::chrono::high_resolution_clock::now();
distribution.getProbabilityWithStrategy(std::span<const double>(input),
std::span<double>(explicit_results), strategy);
auto end = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << name << ": " << time.count() << " μs";
if (std::abs(time.count() - auto_time.count()) < auto_time.count() * 0.1) {
std::cout << " ← Likely auto-selected strategy";
}
std::cout << std::endl;
};
std::cout << "Auto-dispatch: " << auto_time.count() << " μs" << std::endl;
test_strategy(stats::detail::Strategy::SCALAR, "SCALAR");
test_strategy(stats::detail::Strategy::VECTORIZED, "VECTORIZED");
test_strategy(stats::detail::Strategy::PARALLEL, "PARALLEL");// Verify that all strategies produce identical results
bool verify_strategies(const std::vector<double>& input) {
std::vector<double> scalar_results(input.size());
std::vector<double> vectorized_results(input.size());
std::vector<double> parallel_results(input.size());
distribution.getProbabilityWithStrategy(std::span<const double>(input),
std::span<double>(scalar_results),
stats::detail::Strategy::SCALAR);
distribution.getProbabilityWithStrategy(std::span<const double>(input),
std::span<double>(vectorized_results),
stats::detail::Strategy::VECTORIZED);
distribution.getProbabilityWithStrategy(std::span<const double>(input),
std::span<double>(parallel_results),
stats::detail::Strategy::PARALLEL);
// Check for numerical differences
for (size_t i = 0; i < input.size(); ++i) {
if (std::abs(scalar_results[i] - vectorized_results[i]) > 1e-12 ||
std::abs(scalar_results[i] - parallel_results[i]) > 1e-12) {
std::cout << "Mismatch at index " << i << ": scalar=" << scalar_results[i]
<< ", vectorized=" << vectorized_results[i] << ", parallel=" << parallel_results[i] << std::endl;
return false;
}
}
return true;
}The libstats parallel and batch processing system provides:
- Simple Interface: Auto-dispatch handles optimization automatically for most users
- Power User Control: Explicit strategy selection for performance-critical applications
- Excellent Performance: Significant speedups through SIMD and parallel processing
- Thread Safety: Full concurrent access support
- Cross-Platform: Works on Windows, macOS, and Linux with automatic capability detection
For most applications, the auto-dispatch methods provide optimal performance with minimal complexity. Power users can leverage explicit strategy control for fine-tuned performance optimization in specialized scenarios.
The system is designed to scale from small embedded applications to high-performance computing workloads, automatically adapting to available system resources while maintaining mathematical accuracy and thread safety.
Document Version: 1.2
Last Updated: 2026-06-07 (v1.2.0: Phase labels removed; parallelBatchFit section added)
Covers: Complete parallel/batch processing API and usage patterns
Replaces: batch-processing-refactoring.md, deprecated_batch_cleanup_process.md