NVTX (NVIDIA Tools Extension) provides performance markers visible in Nsight Systems. This project uses zero-cost macros that compile to no-ops when the observability feature is disabled.
Default builds exclude NVTX for zero overhead. Enable profiling with:
cd mahout/qdp
cargo build -p qdp-core --example nvtx_profile --features observability --release
./target/release/examples/nvtx_profile
Expected output:
=== NVTX Profiling Example ===
✓ Engine initialized
✓ Created test data: 1024 elements
Starting encoding (NVTX markers will appear in Nsight Systems)...
Expected NVTX markers:
- Mahout::Encode
- CPU::L2Norm
- GPU::Alloc
- GPU::H2DCopy
- GPU::KernelLaunch
- GPU::Synchronize
- DLPack::Wrap
✓ Encoding succeeded
✓ DLPack pointer: 0x558114be6250
✓ Memory freed
=== Test Complete ===
nsys profile --trace=cuda,nvtx -o report ./target/release/examples/nvtx_profile
This generates report.nsys-rep and report.sqlite.
Open the report in Nsight Systems GUI:
nsys-ui report.nsys-rep
In the GUI timeline view, you will see:
CPU::L2NormGPU::Alloc, GPU::H2DCopy, GPU::KernelMahout::EncodeView summary statistics:
nsys stats report.nsys-rep
Example NVTX Range Summary output:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Style Range
-------- --------------- --------- ------------ ------------ ---------- ---------- ----------- -------- --------------
50.0 11,207,505 1 11,207,505.0 11,207,505.0 11,207,505 11,207,505 0.0 StartEnd Mahout::Encode
48.0 10,759,758 1 10,759,758.0 10,759,758.0 10,759,758 10,759,758 0.0 StartEnd GPU::Alloc
1.8 413,753 1 413,753.0 413,753.0 413,753 413,753 0.0 StartEnd CPU::L2Norm
0.1 15,873 1 15,873.0 15,873.0 15,873 15,873 0.0 StartEnd GPU::H2DCopy
0.0 317 1 317.0 317.0 317 317 0.0 StartEnd GPU::KernelLaunch
The output shows:
CUDA API Summary shows detailed CUDA call statistics:
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name ——– ————— ——— ———– ———– ——– ———- ———– ——————– 99.2 11,760,277 2 5,880,138.5 5,880,138.5 2,913 11,757,364 8,311,652.0 cuMemAllocAsync 0.4 45,979 2 22,989.5 22,989.5 7,938 38,041 21,286.0 cuMemcpyHtoDAsync_v2 0.1 14,722 1 14,722.0 14,722.0 14,722 14,722 0.0 cuEventCreate 0.1 13,100 3 4,366.7 3,512.0 861 8,727 4,002.0 cuStreamSynchronize 0.1 9,468 11 860.7 250.0 114 4,671 1,453.3 cuCtxSetCurrent 0.1 6,479 1 6,479.0 6,479.0 6,479 6,479 0.0 cuEventDestroy_v2 0.0 4,599 2 2,299.5 2,299.5 1,773 2,826 744.6 cuMemFreeAsync
cuMemAllocAsync)cuMemcpyHtoDAsync_v2)cuStreamSynchronize)The following markers are tracked:
Mahout::Encode - Complete encoding workflowCPU::L2Norm - L2 normalization on CPUGPU::Alloc - GPU memory allocationGPU::H2DCopy - Host-to-device memory copyGPU::KernelLaunch - CPU-side kernel launchGPU::Synchronize - CPU waiting for GPU completionDLPack::Wrap - Conversion to DLPack pointerThe project provides zero-cost macros in qdp-core/src/profiling.rs:
// Profile a scope (automatically pops on exit)
crate::profile_scope!("MyOperation");
// Mark a point in time
crate::profile_mark!("Checkpoint");
When observability feature is disabled, these macros compile to no-ops with zero runtime cost.
Source code: qdp-core/examples/nvtx_profile.rs
NVTX markers not appearing:
--features observability is used during buildnsys warnings: Warnings about CPU sampling are normal and can be ignored. They do not affect NVTX marker recording.