High-performance CUDA kernel generation and benchmarking framework
View the Project on GitHub jasonlarkin/cuda-stencil-benchmark
Roofline Method (CPU baseline)
Goal
Assumptions for this stencil
Measure bandwidth ceiling 1) Build triad:
cd roofline && make
2) Run a few sizes, take the best:./stream_triad 8000000 50 (adjust N to sweep working set; report BW)Estimate compute ceiling
Produce roofline from bench CSV 1) Generate CSV via safe sweep:
cd cpu_bench && bash run_sweep.sh
2) Plot roofline (replace the ceilings with your measured values):python ../roofline/plot_roofline.py sweep_small.csv --peak-bw-gbs 40 --peak-gflops 200
3) Output: cpu_bench/sweep_small.roofline.pngInterpretation
Next (GPU roofline)
cudaMemcpy or a device triad), set SM compute peak (from nvidia-smi --query-gpu=clocks,fp32), and replot using the same arithmetic intensity (or a revised one if the CUDA kernel changes bytes/update via shared-memory tiling).