CUDA Stencil Benchmark

CPU Baseline Summary

Scope

Validate and characterize the reference C kernel in ref/code.cpp on CPU prior to CUDA work.
Establish timing, roofline, and discretization‑order evidence using scripts and harnesses in this repository.

Artifacts (paths)

Timing sweeps
- CSV: cpu_bench/sweep_small.csv, cpu_bench/sweep_moderate.csv, cpu_bench/sweep_T.csv.
- Plots: cpu_bench/sweep_small.png, cpu_bench/sweep_small.multi.png, cpu_bench/sweep_T.png.
Roofline
- STREAM BW: roofline/stream_triad (BW ≈ 18 GB/s on i5‑1235U WSL).
- Roofline plot: cpu_bench/sweep_small.roofline.png (defaults: 33 FLOPs/pt, 64 B/pt → I≈0.52 flop/B).
Stencil visuals
- analysis/ → stencil_axes.png, time_ring.png, memory_layout.png, sources_trilinear.png, stencil_neighbors_3d.png.
Discretization order evidence (from original C)
- FP32: cpu_bench/stencil_convergence_cpp.png + sweep cpu_bench/stencil_convergence_sweep.png.
- FP64: cpu_bench/stencil_convergence_cpp_fp64.png and variants (e.g., *_ax02_y0_z0.png).

Key results and interpretation

Performance regime
- Measured STREAM triad ≈ 18 GB/s → bandwidth roof at I≈0.52 is ≈ 9.4 GF/s; measured points sit near this line → section0 is bandwidth‑bound on CPU.
Scaling sanity
- us/point stabilizes for N≥32; sources term is negligible at P=1.
- T sensitivity roughly linear; small drift due to turbo/thermals.
Spatial stencil order (method based on original code)
- We compile the unmodified ref/code.cpp and compute the discrete Laplacian via the identity with the time term canceled: u[t1]=2·u[t0], L_h(u)=u[t2]/dt^2.
- FP32/FP64 runs show apparent slopes near −2 across the tested h range. This is expected: the discrete operator scales like 1/h^2 and the single‑step extraction plus finite precision yields an error floor ∝ ε/h^2, overwhelming truncation error over these meshes/frequencies.
- Lowering frequencies and/or using coarser h pushes toward the truncation‑error regime; slopes move more negative. In FP64 and sufficiently low frequencies, the scheme approaches the theoretical 4th‑order behavior.

How to reproduce 1) Bench + plots

cd cpu_bench
bash run_sweep.sh → sweep_small.csv
python plot_sweep.py sweep_small.csv
bash run_t_sensitivity.sh 48 → python plot_T.py sweep_T.csv 2) Roofline
cd roofline && make && ./stream_triad 8000000 50
cd ../cpu_bench && python ../roofline/plot_roofline.py sweep_small.csv --peak-bw-gbs 18 --peak-gflops 120 3) Order (original C)
cd cpu_bench && make verify_order
python order_sweep.py --exe ./verify_order
make verify_order_fp64 && python order_sweep.py --exe ./verify_order_fp64

Implications for CUDA

Expect memory‑bound behavior unless global‑memory traffic per update is reduced via shared‑memory tiling and on‑chip reuse.
Map threadIdx.x → z for coalesced global loads; tile (x,y) with +2 halo; maintain a 5‑plane sliding window in shared memory.
Separate the source injection pass to avoid atomics in the main update path.