Implementation of signal detection algorithms on CUDA. Demonstrates 10-50x speedup over CPU for large-scale time-series analysis.
CUSUM Detection - Cumulative sum for change-point detection
Threshold Detection - Adaptive baseline with parallel computation
Streaming Processing - Memory-efficient chunked processing
| Operation | CPU | GPU | Speedup |
|---|---|---|---|
| Threshold detection (10M samples) | 450ms | 25ms | 18x |
| CUSUM (10M samples) | 800ms | 60ms | 13x |
| Baseline estimation | 120ms | 8ms | 15x |
Throughput: 400M samples/sec on RTX 3090
Memory: Constant O(n) with memory pooling
pip install cupy-cuda11x numpy
# Verify CUDA
python -c "import cupy; print(cupy.cuda.runtime.getDeviceCount())"from cuda_signal_processor import CUDASignalProcessor, DetectionConfig
config = DetectionConfig(threshold_sigma=3.0)
processor = CUDASignalProcessor(config)
# Benchmark
results = processor.benchmark(signal_length=10_000_000)
print(f"Speedup: {results['speedup']:.1f}x")
# Process signal
signal = np.random.randn(10_000_000)
events, elapsed = processor.threshold_detect_gpu(signal)CPU GPU
┌──────────┐ ┌──────────┐
│ Signal │──transfer─────►│ Device │
│ (NumPy) │ │ Memory │
└──────────┘ └──────────┘
│
┌─────▼──────┐
│ Parallel │
│ Kernels │
└─────┬──────┘
│
┌──────────┐ ┌─────▼──────┐
│ Results │◄───transfer────│ Results │
│ (NumPy) │ │ (GPU) │
└──────────┘ └────────────┘
- Page, E.S. (1954). "Continuous Inspection Schemes"
- Basseville & Nikiforov (1993). "Detection of Abrupt Changes"
python cuda_signal_processor.pyExpected output:
=== GPU Accelerated Signal Processing ===
Signal length: 10,000,000 samples
GPU time: 24.56 ms
CPU time: 445.32 ms
Speedup: 18.1x
Throughput: 407.1 M samples/sec
GPU memory: 152.3 MB
- Multi-GPU support for distributed processing
- Custom CUDA kernels (CuPy → PyCUDA)
- INT8 quantization for 4x memory reduction
- Real-time stream processing from hardware
License: MIT | Python: 3.8+ | CUDA: 11.0+