Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
200 changes: 200 additions & 0 deletions docs/qdp/python-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
# QDP Python API (qumat_qdp)

Public Python API for QDP: GPU-accelerated encoding, benchmark helpers, and a batch iterator for training or evaluation loops.

## Overview

The **qumat_qdp** package wraps the native extension `_qdp` and adds:

- **Encoding:** `QdpEngine` and `QuantumTensor` for encoding classical data into quantum states and zero-copy DLPack integration.
- **Benchmark:** `QdpBenchmark` for throughput/latency runs (full pipeline in Rust, GIL released).
- **Data loader:** `QuantumDataLoader` for iterating encoded batches one at a time (`for qt in loader:`).

Import from the package:

```python
from qumat_qdp import (
QdpEngine,
QuantumTensor,
QdpBenchmark,
ThroughputResult,
LatencyResult,
QuantumDataLoader,
run_throughput_pipeline_py,
)
```

**Requirements:** Linux with NVIDIA GPU (CUDA). Loader and pipeline helpers are stubs on other platforms and raise `RuntimeError`.

---

## Encoding API

### QdpEngine

GPU encoder. Constructor and main methods:

**`QdpEngine(device_id=0, precision="float32")`**

- `device_id` (int): CUDA device ID.
- `precision` (str): `"float32"` or `"float64"`.
- Raises `RuntimeError` on init failure or unsupported precision.

**`encode(data, num_qubits, encoding_method="amplitude") -> QuantumTensor`**

- `data`: list of floats, 1D/2D NumPy array (float64, C-contiguous), PyTorch tensor (CPU/CUDA), or file path (`.parquet`, `.arrow`, `.feather`, `.npy`, `.pt`, `.pth`, `.pb`).
- `num_qubits` (int): Number of qubits.
- `encoding_method` (str): `"amplitude"` | `"angle"` | `"basis"` | `"iqp"` | `"iqp-z"`.
- Returns a DLPack-compatible tensor; use `torch.from_dlpack(qtensor)`. Shape `[batch_size, 2^num_qubits]`.

**`create_synthetic_loader(total_batches, batch_size=64, num_qubits=16, encoding_method="amplitude", seed=None)`**

- Returns an iterator that yields one `QuantumTensor` per batch. GIL is released during each encode. Linux/CUDA only.

### QuantumTensor

DLPack wrapper for a GPU quantum state.

- **`__dlpack__(stream=None)`:** Returns a DLPack PyCapsule (single use).
- **`__dlpack_device__()`:** Returns `(device_type, device_id)`; CUDA is `(2, gpu_id)`.

If not consumed, memory is freed when the object is dropped; if consumed (e.g. by PyTorch), ownership transfers to the consumer.

---

## Benchmark API

Runs the full encode pipeline in Rust (warmup + timed loop) with GIL released. No Python-side loop.

### QdpBenchmark

Builder; chain methods then call `run_throughput()` or `run_latency()`.

**Constructor:** `QdpBenchmark(device_id=0)`

**Chainable methods:**

| Method | Description |
|--------|-------------|
| `qubits(n)` | Number of qubits. |
| `encoding(method)` | `"amplitude"` \| `"angle"` \| `"basis"`. |
| `batches(total, size=64)` | Total batches and batch size. |
| `prefetch(n)` | No-op (API compatibility). |
| `warmup(n)` | Warmup batch count. |

**`run_throughput() -> ThroughputResult`**

- Requires `qubits` and `batches` to be set.
- Returns `ThroughputResult` with `duration_sec`, `vectors_per_sec`.
- Raises `ValueError` if config missing; `RuntimeError` if pipeline unavailable.

**`run_latency() -> LatencyResult`**

- Same pipeline; returns `LatencyResult` with `duration_sec`, `latency_ms_per_vector`.

### Result types

| Type | Fields |
|------|--------|
| `ThroughputResult` | `duration_sec`, `vectors_per_sec` |
| `LatencyResult` | `duration_sec`, `latency_ms_per_vector` |

### Example

```python
from qumat_qdp import QdpBenchmark, ThroughputResult, LatencyResult

result = (
QdpBenchmark(device_id=0)
.qubits(16)
.encoding("amplitude")
.batches(100, size=64)
.warmup(2)
.run_throughput()
)
print(result.vectors_per_sec)

lat = (
QdpBenchmark(device_id=0)
.qubits(16)
.encoding("amplitude")
.batches(100, size=64)
.run_latency()
)
print(lat.latency_ms_per_vector)
```

---

## Data Loader API

Iterate over encoded batches one at a time. Each batch is a `QuantumTensor`; encoding runs in Rust with GIL released per batch.

### QuantumDataLoader

Builder for a synthetic-data loader. Calling `iter(loader)` (or `for qt in loader`) creates the Rust-backed iterator.

**Constructor:**
`QuantumDataLoader(device_id=0, num_qubits=16, batch_size=64, total_batches=100, encoding_method="amplitude", seed=None)`

**Chainable methods:**

| Method | Description |
|--------|-------------|
| `qubits(n)` | Number of qubits. |
| `encoding(method)` | `"amplitude"` \| `"angle"` \| `"basis"`. |
| `batches(total, size=64)` | Total batches and batch size. |
| `source_synthetic(total_batches=None)` | Synthetic data (default); optional override for total batches. |
| `seed(s)` | RNG seed for reproducibility. |

**Iteration:** `for qt in loader:` yields `QuantumTensor` of shape `[batch_size, 2^num_qubits]`. Consume once per tensor, e.g. `torch.from_dlpack(qt)`.

### Example

```python
from qumat_qdp import QuantumDataLoader
import torch

loader = (
QuantumDataLoader(device_id=0)
.qubits(16)
.encoding("amplitude")
.batches(100, size=64)
.source_synthetic()
)

for qt in loader:
batch = torch.from_dlpack(qt) # [batch_size, 2^16]
# use batch ...
```

---

## Low-level: run_throughput_pipeline_py

Runs the full pipeline in Rust with GIL released. Used by `QdpBenchmark`; can be called directly.

**Signature:**
`run_throughput_pipeline_py(device_id=0, num_qubits=16, batch_size=64, total_batches=100, encoding_method="amplitude", warmup_batches=0, seed=None) -> tuple[float, float, float]`

**Returns:** `(duration_sec, vectors_per_sec, latency_ms_per_vector)`.

**Raises:** `RuntimeError` on failure or when not available (e.g. non-Linux).

---

## Backward compatibility

`benchmark/api.py` and `benchmark/loader.py` re-export from `qumat_qdp`. Prefer:

- `from qumat_qdp import QdpBenchmark, ThroughputResult, LatencyResult`
- `from qumat_qdp import QuantumDataLoader`

Benchmark scripts add the project root to `sys.path`, so from the `qdp-python` directory you can run:

```bash
uv run python benchmark/run_pipeline_baseline.py
uv run python benchmark/benchmark_loader_throughput.py
```

without setting `PYTHONPATH`.
25 changes: 24 additions & 1 deletion qdp/qdp-core/src/gpu/encodings/amplitude.rs
Original file line number Diff line number Diff line change
Expand Up @@ -316,7 +316,7 @@ impl AmplitudeEncoder {
/// streaming mechanics, while this method focuses on the amplitude
/// encoding kernel logic.
#[cfg(target_os = "linux")]
fn encode_async_pipeline(
pub(crate) fn encode_async_pipeline(
device: &Arc<CudaDevice>,
host_data: &[f64],
_num_qubits: usize,
Expand Down Expand Up @@ -548,4 +548,27 @@ impl AmplitudeEncoder {

Ok(inv_norm)
}

/// Run dual-stream pipeline for amplitude encoding (exposed for Python / benchmark).
#[cfg(target_os = "linux")]
pub(crate) fn run_amplitude_dual_stream_pipeline(
device: &Arc<CudaDevice>,
host_data: &[f64],
num_qubits: usize,
) -> Result<()> {
Preprocessor::validate_input(host_data, num_qubits)?;
let state_len = 1 << num_qubits;
let state_vector = GpuStateVector::new(device, num_qubits, Precision::Float64)?;
let norm = Preprocessor::calculate_l2_norm(host_data)?;
let inv_norm = 1.0 / norm;
Self::encode_async_pipeline(
device,
host_data,
num_qubits,
state_len,
inv_norm,
&state_vector,
)?;
Ok(())
}
}
51 changes: 51 additions & 0 deletions qdp/qdp-core/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,16 @@ mod profiling;
pub use error::{MahoutError, Result, cuda_error_to_string};
pub use gpu::memory::Precision;

// Throughput/latency pipeline runner: single path using QdpEngine and encode_batch in Rust.
#[cfg(target_os = "linux")]
mod pipeline_runner;

#[cfg(target_os = "linux")]
pub use pipeline_runner::{
DataSource, PipelineConfig, PipelineIterator, PipelineRunResult, run_latency_pipeline,
run_throughput_pipeline,
};

use std::sync::Arc;

use crate::dlpack::DLManagedTensor;
Expand All @@ -45,6 +55,7 @@ use cudarc::driver::CudaDevice;
///
/// Manages GPU context and dispatches encoding tasks.
/// Provides unified interface for device management, memory allocation, and DLPack.
#[derive(Clone)]
pub struct QdpEngine {
device: Arc<CudaDevice>,
precision: Precision,
Expand Down Expand Up @@ -110,6 +121,15 @@ impl QdpEngine {
&self.device
}

/// Block until all GPU work on the default stream has completed.
/// Used by the generic pipeline and other callers that need to sync before timing.
#[cfg(target_os = "linux")]
pub fn synchronize(&self) -> Result<()> {
self.device
.synchronize()
.map_err(|e| MahoutError::Cuda(format!("CUDA device synchronize failed: {:?}", e)))
}

/// Encode multiple samples in a single fused kernel (most efficient)
///
/// Allocates one large GPU buffer and launches a single batch kernel.
Expand Down Expand Up @@ -148,6 +168,37 @@ impl QdpEngine {
Ok(dlpack_ptr)
}

/// Run dual-stream pipeline for encoding (H2D + kernel overlap). Exposes gpu::pipeline::run_dual_stream_pipeline.
/// Currently supports amplitude encoding (1D host_data). Does not return a tensor;
/// use for throughput measurement or when the encoded state is not needed.
///
/// # Arguments
/// * `host_data` - 1D input data (e.g. single sample for amplitude)
/// * `num_qubits` - Number of qubits
/// * `encoding_method` - Strategy (currently only "amplitude" supported for this path)
#[cfg(target_os = "linux")]
pub fn run_dual_stream_encode(
&self,
host_data: &[f64],
num_qubits: usize,
encoding_method: &str,
) -> Result<()> {
crate::profile_scope!("Mahout::RunDualStreamEncode");
match encoding_method.to_lowercase().as_str() {
"amplitude" => {
gpu::encodings::amplitude::AmplitudeEncoder::run_amplitude_dual_stream_pipeline(
&self.device,
host_data,
num_qubits,
)
}
_ => Err(MahoutError::InvalidInput(format!(
"run_dual_stream_encode supports only 'amplitude' for now, got '{}'",
encoding_method
))),
}
}

/// Streaming Parquet encoder with multi-threaded IO
///
/// Uses Producer-Consumer pattern: IO thread reads Parquet while GPU processes data.
Expand Down
Loading