apache · guan404ming · Feb 2, 2026 · Jan 31, 2026 · Jan 31, 2026 · Feb 1, 2026
diff --git a/docs/qdp/python-api.md b/docs/qdp/python-api.md
@@ -0,0 +1,200 @@
+# QDP Python API (qumat_qdp)
+
+Public Python API for QDP: GPU-accelerated encoding, benchmark helpers, and a batch iterator for training or evaluation loops.
+
+## Overview
+
+The **qumat_qdp** package wraps the native extension `_qdp` and adds:
+
+- **Encoding:** `QdpEngine` and `QuantumTensor` for encoding classical data into quantum states and zero-copy DLPack integration.
+- **Benchmark:** `QdpBenchmark` for throughput/latency runs (full pipeline in Rust, GIL released).
+- **Data loader:** `QuantumDataLoader` for iterating encoded batches one at a time (`for qt in loader:`).
+
+Import from the package:
+
+```python
+from qumat_qdp import (
+    QdpEngine,
+    QuantumTensor,
+    QdpBenchmark,
+    ThroughputResult,
+    LatencyResult,
+    QuantumDataLoader,
+    run_throughput_pipeline_py,
+)
+```
+
+**Requirements:** Linux with NVIDIA GPU (CUDA). Loader and pipeline helpers are stubs on other platforms and raise `RuntimeError`.
+
+---
+
+## Encoding API
+
+### QdpEngine
+
+GPU encoder. Constructor and main methods:
+
+**`QdpEngine(device_id=0, precision="float32")`**
+
+- `device_id` (int): CUDA device ID.
+- `precision` (str): `"float32"` or `"float64"`.
+- Raises `RuntimeError` on init failure or unsupported precision.
+
+**`encode(data, num_qubits, encoding_method="amplitude") -> QuantumTensor`**
+
+- `data`: list of floats, 1D/2D NumPy array (float64, C-contiguous), PyTorch tensor (CPU/CUDA), or file path (`.parquet`, `.arrow`, `.feather`, `.npy`, `.pt`, `.pth`, `.pb`).
+- `num_qubits` (int): Number of qubits.
+- `encoding_method` (str): `"amplitude"` | `"angle"` | `"basis"` | `"iqp"` | `"iqp-z"`.
+- Returns a DLPack-compatible tensor; use `torch.from_dlpack(qtensor)`. Shape `[batch_size, 2^num_qubits]`.
+
+**`create_synthetic_loader(total_batches, batch_size=64, num_qubits=16, encoding_method="amplitude", seed=None)`**
+
+- Returns an iterator that yields one `QuantumTensor` per batch. GIL is released during each encode. Linux/CUDA only.
+
+### QuantumTensor
+
+DLPack wrapper for a GPU quantum state.
+
+- **`__dlpack__(stream=None)`:** Returns a DLPack PyCapsule (single use).
+- **`__dlpack_device__()`:** Returns `(device_type, device_id)`; CUDA is `(2, gpu_id)`.
+
+If not consumed, memory is freed when the object is dropped; if consumed (e.g. by PyTorch), ownership transfers to the consumer.
+
+---
+
+## Benchmark API
+
+Runs the full encode pipeline in Rust (warmup + timed loop) with GIL released. No Python-side loop.
+
+### QdpBenchmark
+
+Builder; chain methods then call `run_throughput()` or `run_latency()`.
+
+**Constructor:** `QdpBenchmark(device_id=0)`
+
+**Chainable methods:**
+
+| Method | Description |
+|--------|-------------|
+| `qubits(n)` | Number of qubits. |
+| `encoding(method)` | `"amplitude"` \| `"angle"` \| `"basis"`. |
+| `batches(total, size=64)` | Total batches and batch size. |
+| `prefetch(n)` | No-op (API compatibility). |
+| `warmup(n)` | Warmup batch count. |
+
+**`run_throughput() -> ThroughputResult`**
+
+- Requires `qubits` and `batches` to be set.
+- Returns `ThroughputResult` with `duration_sec`, `vectors_per_sec`.
+- Raises `ValueError` if config missing; `RuntimeError` if pipeline unavailable.
+
+**`run_latency() -> LatencyResult`**
+
+- Same pipeline; returns `LatencyResult` with `duration_sec`, `latency_ms_per_vector`.
+
+### Result types
+
+| Type | Fields |
+|------|--------|
+| `ThroughputResult` | `duration_sec`, `vectors_per_sec` |
+| `LatencyResult` | `duration_sec`, `latency_ms_per_vector` |
+
+### Example
+
+```python
+from qumat_qdp import QdpBenchmark, ThroughputResult, LatencyResult
+
+result = (
+    QdpBenchmark(device_id=0)
+    .qubits(16)
+    .encoding("amplitude")
+    .batches(100, size=64)
+    .warmup(2)
+    .run_throughput()
+)
+print(result.vectors_per_sec)
+
+lat = (
+    QdpBenchmark(device_id=0)
+    .qubits(16)
+    .encoding("amplitude")
+    .batches(100, size=64)
+    .run_latency()
+)
+print(lat.latency_ms_per_vector)
+```
+
+---
+
+## Data Loader API
+
+Iterate over encoded batches one at a time. Each batch is a `QuantumTensor`; encoding runs in Rust with GIL released per batch.
+
+### QuantumDataLoader
+
+Builder for a synthetic-data loader. Calling `iter(loader)` (or `for qt in loader`) creates the Rust-backed iterator.
+
+**Constructor:**
+`QuantumDataLoader(device_id=0, num_qubits=16, batch_size=64, total_batches=100, encoding_method="amplitude", seed=None)`
+
+**Chainable methods:**
+
+| Method | Description |
+|--------|-------------|
+| `qubits(n)` | Number of qubits. |
+| `encoding(method)` | `"amplitude"` \| `"angle"` \| `"basis"`. |
+| `batches(total, size=64)` | Total batches and batch size. |
+| `source_synthetic(total_batches=None)` | Synthetic data (default); optional override for total batches. |
+| `seed(s)` | RNG seed for reproducibility. |
+
+**Iteration:** `for qt in loader:` yields `QuantumTensor` of shape `[batch_size, 2^num_qubits]`. Consume once per tensor, e.g. `torch.from_dlpack(qt)`.
+
+### Example
+
+```python
+from qumat_qdp import QuantumDataLoader
+import torch
+
+loader = (
+    QuantumDataLoader(device_id=0)
+    .qubits(16)
+    .encoding("amplitude")
+    .batches(100, size=64)
+    .source_synthetic()
+)
+
+for qt in loader:
+    batch = torch.from_dlpack(qt)  # [batch_size, 2^16]
+    # use batch ...
+```
+
+---
+
+## Low-level: run_throughput_pipeline_py
+
+Runs the full pipeline in Rust with GIL released. Used by `QdpBenchmark`; can be called directly.
+
+**Signature:**
+`run_throughput_pipeline_py(device_id=0, num_qubits=16, batch_size=64, total_batches=100, encoding_method="amplitude", warmup_batches=0, seed=None) -> tuple[float, float, float]`
+
+**Returns:** `(duration_sec, vectors_per_sec, latency_ms_per_vector)`.
+
+**Raises:** `RuntimeError` on failure or when not available (e.g. non-Linux).
+
+---
+
+## Backward compatibility
+
+`benchmark/api.py` and `benchmark/loader.py` re-export from `qumat_qdp`. Prefer:
+
+- `from qumat_qdp import QdpBenchmark, ThroughputResult, LatencyResult`
+- `from qumat_qdp import QuantumDataLoader`
+
+Benchmark scripts add the project root to `sys.path`, so from the `qdp-python` directory you can run:
+
+```bash
+uv run python benchmark/run_pipeline_baseline.py
+uv run python benchmark/benchmark_loader_throughput.py
+```
+
+without setting `PYTHONPATH`.
diff --git a/qdp/qdp-core/src/gpu/encodings/amplitude.rs b/qdp/qdp-core/src/gpu/encodings/amplitude.rs
@@ -316,7 +316,7 @@ impl AmplitudeEncoder {
     /// streaming mechanics, while this method focuses on the amplitude
     /// encoding kernel logic.
     #[cfg(target_os = "linux")]
-    fn encode_async_pipeline(
+    pub(crate) fn encode_async_pipeline(
         device: &Arc<CudaDevice>,
         host_data: &[f64],
         _num_qubits: usize,
@@ -548,4 +548,27 @@ impl AmplitudeEncoder {
 
         Ok(inv_norm)
     }
+
+    /// Run dual-stream pipeline for amplitude encoding (exposed for Python / benchmark).
+    #[cfg(target_os = "linux")]
+    pub(crate) fn run_amplitude_dual_stream_pipeline(
+        device: &Arc<CudaDevice>,
+        host_data: &[f64],
+        num_qubits: usize,
+    ) -> Result<()> {
+        Preprocessor::validate_input(host_data, num_qubits)?;
+        let state_len = 1 << num_qubits;
+        let state_vector = GpuStateVector::new(device, num_qubits, Precision::Float64)?;
+        let norm = Preprocessor::calculate_l2_norm(host_data)?;
+        let inv_norm = 1.0 / norm;
+        Self::encode_async_pipeline(
+            device,
+            host_data,
+            num_qubits,
+            state_len,
+            inv_norm,
+            &state_vector,
+        )?;
+        Ok(())
+    }
 }
diff --git a/qdp/qdp-core/src/lib.rs b/qdp/qdp-core/src/lib.rs
@@ -35,6 +35,16 @@ mod profiling;
 pub use error::{MahoutError, Result, cuda_error_to_string};
 pub use gpu::memory::Precision;
 
+// Throughput/latency pipeline runner: single path using QdpEngine and encode_batch in Rust.
+#[cfg(target_os = "linux")]
+mod pipeline_runner;
+
+#[cfg(target_os = "linux")]
+pub use pipeline_runner::{
+    DataSource, PipelineConfig, PipelineIterator, PipelineRunResult, run_latency_pipeline,
+    run_throughput_pipeline,
+};
+
 use std::sync::Arc;
 
 use crate::dlpack::DLManagedTensor;
@@ -45,6 +55,7 @@ use cudarc::driver::CudaDevice;
 ///
 /// Manages GPU context and dispatches encoding tasks.
 /// Provides unified interface for device management, memory allocation, and DLPack.
+#[derive(Clone)]
 pub struct QdpEngine {
     device: Arc<CudaDevice>,
     precision: Precision,
@@ -110,6 +121,15 @@ impl QdpEngine {
         &self.device
     }
 
+    /// Block until all GPU work on the default stream has completed.
+    /// Used by the generic pipeline and other callers that need to sync before timing.
+    #[cfg(target_os = "linux")]
+    pub fn synchronize(&self) -> Result<()> {
+        self.device
+            .synchronize()
+            .map_err(|e| MahoutError::Cuda(format!("CUDA device synchronize failed: {:?}", e)))
+    }
+
     /// Encode multiple samples in a single fused kernel (most efficient)
     ///
     /// Allocates one large GPU buffer and launches a single batch kernel.
@@ -148,6 +168,37 @@ impl QdpEngine {
         Ok(dlpack_ptr)
     }
 
+    /// Run dual-stream pipeline for encoding (H2D + kernel overlap). Exposes gpu::pipeline::run_dual_stream_pipeline.
+    /// Currently supports amplitude encoding (1D host_data). Does not return a tensor;
+    /// use for throughput measurement or when the encoded state is not needed.
+    ///
+    /// # Arguments
+    /// * `host_data` - 1D input data (e.g. single sample for amplitude)
+    /// * `num_qubits` - Number of qubits
+    /// * `encoding_method` - Strategy (currently only "amplitude" supported for this path)
+    #[cfg(target_os = "linux")]
+    pub fn run_dual_stream_encode(
+        &self,
+        host_data: &[f64],
+        num_qubits: usize,
+        encoding_method: &str,
+    ) -> Result<()> {
+        crate::profile_scope!("Mahout::RunDualStreamEncode");
+        match encoding_method.to_lowercase().as_str() {
+            "amplitude" => {
+                gpu::encodings::amplitude::AmplitudeEncoder::run_amplitude_dual_stream_pipeline(
+                    &self.device,
+                    host_data,
+                    num_qubits,
+                )
+            }
+            _ => Err(MahoutError::InvalidInput(format!(
+                "run_dual_stream_encode supports only 'amplitude' for now, got '{}'",
+                encoding_method
+            ))),
+        }
+    }
+
     /// Streaming Parquet encoder with multi-threaded IO
     ///
     /// Uses Producer-Consumer pattern: IO thread reads Parquet while GPU processes data.