High-performance CUDA kernel generation and benchmarking framework
View the Project on GitHub jasonlarkin/cuda-stencil-benchmark
CPU Baseline Summary
Scope
ref/code.cpp on CPU prior to CUDA work.Artifacts (paths)
cpu_bench/sweep_small.csv, cpu_bench/sweep_moderate.csv, cpu_bench/sweep_T.csv.cpu_bench/sweep_small.png, cpu_bench/sweep_small.multi.png, cpu_bench/sweep_T.png.roofline/stream_triad (BW ≈ 18 GB/s on i5‑1235U WSL).cpu_bench/sweep_small.roofline.png (defaults: 33 FLOPs/pt, 64 B/pt → I≈0.52 flop/B).analysis/ → stencil_axes.png, time_ring.png, memory_layout.png, sources_trilinear.png, stencil_neighbors_3d.png.cpu_bench/stencil_convergence_cpp.png + sweep cpu_bench/stencil_convergence_sweep.png.cpu_bench/stencil_convergence_cpp_fp64.png and variants (e.g., *_ax02_y0_z0.png).Key results and interpretation
ref/code.cpp and compute the discrete Laplacian via the identity with the time term canceled: u[t1]=2·u[t0], L_h(u)=u[t2]/dt^2.h range. This is expected: the discrete operator scales like 1/h^2 and the single‑step extraction plus finite precision yields an error floor ∝ ε/h^2, overwhelming truncation error over these meshes/frequencies.h pushes toward the truncation‑error regime; slopes move more negative. In FP64 and sufficiently low frequencies, the scheme approaches the theoretical 4th‑order behavior.How to reproduce 1) Bench + plots
cd cpu_benchbash run_sweep.sh → sweep_small.csvpython plot_sweep.py sweep_small.csvbash run_t_sensitivity.sh 48 → python plot_T.py sweep_T.csv
2) Rooflinecd roofline && make && ./stream_triad 8000000 50cd ../cpu_bench && python ../roofline/plot_roofline.py sweep_small.csv --peak-bw-gbs 18 --peak-gflops 120
3) Order (original C)cd cpu_bench && make verify_orderpython order_sweep.py --exe ./verify_ordermake verify_order_fp64 && python order_sweep.py --exe ./verify_order_fp64Implications for CUDA
threadIdx.x → z for coalesced global loads; tile (x,y) with +2 halo; maintain a 5‑plane sliding window in shared memory.