mahout

Core Concepts

This page explains the core concepts behind QDP (Quantum Data Plane): the architecture, the supported encoding methods, GPU memory management, DLPack zero-copy integration, and key performance characteristics.


1. Architecture overview

At a high level, QDP is organized as layered components:

Data flow (conceptual):

Python (qumat.qdp)  →  _qdp (PyO3)  →  qdp-core (Rust)  →  qdp-kernels (CUDA)
        │                        │              │                 │
        └──── torch.from_dlpack(qtensor) ◄──────┴── DLPack DLManagedTensor (GPU ptr)

2. What QDP produces: a GPU-resident state vector

QDP encodes classical data into a state vector $\vert\psi\rangle$ for $n$ qubits.

QDP supports two output precisions:


3. Encoder selection and validation

QDP uses a strategy pattern for encoding methods:

All encoders perform input validation (at minimum):

Note: $n=30$ is already very large—just the output state for a single sample is on the order of $2^{30}$ complex numbers.


4. Encoding methods (amplitude, basis, angle)

4.1 Amplitude encoding

Goal: represent a real-valued feature vector $x$ as quantum amplitudes:

\[\vert\psi\rangle = \sum_{i=0}^{2^{n}-1} \frac{x_i}{\|x\|_2}\,\vert i\rangle\]

Key properties in QDP:

When to use it:

Trade-offs:

4.2 Basis encoding

Goal: map an integer index $i$ into a computational basis state $\vert i\rangle$.

For $n$ qubits with $0 \le i < 2^n$:

Key properties in QDP:

When to use it:

4.3 Angle encoding

Goal: map one real angle per qubit to a product state. For angles $x_0, \ldots, x_{n-1}$ (one per qubit):

\[\vert\psi(x)\rangle = \bigotimes_{k=0}^{n-1} \bigl( \cos(x_k)\,\vert 0\rangle + \sin(x_k)\,\vert 1\rangle \bigr)\]

So each basis state $\vert i\rangle$ gets a real amplitude $\prod_{k} \bigl( (i_k=0 \Rightarrow \cos(x_k)),\ (i_k=1 \Rightarrow \sin(x_k)) \bigr)$ (no phase; state is real-valued in the computational basis).

Key properties in QDP:

When to use it:

Trade-offs:


5. GPU memory management

QDP is designed to keep the encoded states on the GPU and to avoid unnecessary allocations/copies where possible.

5.1 Output state vector allocation

For each encoded sample, QDP allocates a state vector of size $2^n$. Memory usage grows exponentially:

QDP performs pre-flight checks before large allocations to fail fast with an OOM-aware message (e.g., suggesting smaller num_qubits or batch size).

5.2 Pinned host memory and streaming pipelines

For high-throughput IO → GPU workflows (especially streaming from Parquet), QDP uses:

Streaming Parquet encoding is implemented as a producer/consumer pipeline:

In the current implementation, streaming Parquet supports:

5.3 Asynchronous copy/compute overlap (dual streams)

For large workloads, QDP uses multiple CUDA streams and CUDA events to overlap:

This reduces time spent waiting on PCIe transfers and can improve throughput substantially for large batches.


6. DLPack zero-copy integration

QDP exposes results using the DLPack protocol, which allows frameworks like PyTorch to consume GPU memory without copying.

Conceptually:

  1. Rust allocates GPU memory for the state vector.
  2. Rust wraps it into a DLPack DLManagedTensor.
  3. Python returns an object that implements __dlpack__.
  4. PyTorch calls torch.from_dlpack(qtensor) and takes ownership via DLPack’s deleter.

Important details:


7. Performance characteristics and practical guidance

7.1 What makes QDP fast

7.2 Choosing parameters wisely

7.3 Profiling

If you need to understand where time is spent (copy vs compute), QDP supports NVTX-based profiling. See Observability.