mahout

NVTX Profiling Guide

Overview

NVTX (NVIDIA Tools Extension) provides performance markers visible in Nsight Systems. This project uses zero-cost macros that compile to no-ops when the observability feature is disabled.

Build with NVTX

Default builds exclude NVTX for zero overhead. Enable profiling with:

cd mahout/qdp
cargo build -p qdp-core --example nvtx_profile --features observability --release

Run Example

./target/release/examples/nvtx_profile

Expected output:

=== NVTX Profiling Example ===

✓ Engine initialized
✓ Created test data: 1024 elements

Starting encoding (NVTX markers will appear in Nsight Systems)...
Expected NVTX markers:
  - Mahout::Encode
  - CPU::L2Norm
  - GPU::Alloc
  - GPU::H2DCopy
  - GPU::KernelLaunch
  - GPU::Synchronize
  - DLPack::Wrap

✓ Encoding succeeded
✓ DLPack pointer: 0x558114be6250
✓ Memory freed

=== Test Complete ===

Profile with Nsight Systems

nsys profile --trace=cuda,nvtx -o report ./target/release/examples/nvtx_profile

This generates report.nsys-rep and report.sqlite.

Viewing Results

GUI View (Nsight Systems)

Open the report in Nsight Systems GUI:

nsys-ui report.nsys-rep

In the GUI timeline view, you will see:

Command Line Statistics

View summary statistics:

nsys stats report.nsys-rep

Example NVTX Range Summary output:

Time (%)  Total Time (ns)  Instances    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)   Style        Range
--------  ---------------  ---------  ------------  ------------  ----------  ----------  -----------  --------  --------------
   50.0       11,207,505          1  11,207,505.0  11,207,505.0  11,207,505  11,207,505          0.0  StartEnd  Mahout::Encode
   48.0       10,759,758          1  10,759,758.0  10,759,758.0  10,759,758  10,759,758          0.0  StartEnd  GPU::Alloc
    1.8          413,753          1     413,753.0     413,753.0     413,753     413,753          0.0  StartEnd  CPU::L2Norm
    0.1           15,873          1      15,873.0      15,873.0      15,873      15,873          0.0  StartEnd  GPU::H2DCopy
    0.0              317          1         317.0         317.0         317         317          0.0  StartEnd  GPU::KernelLaunch

The output shows:

CUDA API Summary shows detailed CUDA call statistics:

Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name ——– ————— ——— ———– ———– ——– ———- ———– ——————– 99.2 11,760,277 2 5,880,138.5 5,880,138.5 2,913 11,757,364 8,311,652.0 cuMemAllocAsync 0.4 45,979 2 22,989.5 22,989.5 7,938 38,041 21,286.0 cuMemcpyHtoDAsync_v2 0.1 14,722 1 14,722.0 14,722.0 14,722 14,722 0.0 cuEventCreate 0.1 13,100 3 4,366.7 3,512.0 861 8,727 4,002.0 cuStreamSynchronize 0.1 9,468 11 860.7 250.0 114 4,671 1,453.3 cuCtxSetCurrent 0.1 6,479 1 6,479.0 6,479.0 6,479 6,479 0.0 cuEventDestroy_v2 0.0 4,599 2 2,299.5 2,299.5 1,773 2,826 744.6 cuMemFreeAsync

NVTX Markers

The following markers are tracked:

Using Profiling Macros

The project provides zero-cost macros in qdp-core/src/profiling.rs:

// Profile a scope (automatically pops on exit)
crate::profile_scope!("MyOperation");

// Mark a point in time
crate::profile_mark!("Checkpoint");

When observability feature is disabled, these macros compile to no-ops with zero runtime cost.

Example Location

Source code: qdp-core/examples/nvtx_profile.rs

Troubleshooting

NVTX markers not appearing:

nsys warnings: Warnings about CPU sampling are normal and can be ignored. They do not affect NVTX marker recording.