CUDA Stencil Benchmark

High-performance CUDA kernel generation and benchmarking framework

View the Project on GitHub jasonlarkin/cuda-stencil-benchmark

Performance Results and Analysis

3D Finite-Difference Stencil Kernels

Test Configuration

Kernel Performance Comparison

000_baseline (Z-Coalesced)

001_tiled (X-Y Shared Tile)

002_sliding (X-Y Shared + 5-Plane Z Sliding Window)

Key Insights

  1. Memory-Bound Behavior: The kernel operates at arithmetic intensity ≈ 0.52 flop/byte, making it memory-bound on T4.

  2. Coalescing Critical: Z-coalesced baseline outperforms tiled variants because z is the unit-stride dimension in the memory layout u[t][x][y][z].

  3. Optimization Tradeoffs:
    • Full (x,y) tiling trades bandwidth for reuse in the wrong dimension
    • Sliding window adds synchronization overhead without sufficient benefit
    • Baseline’s simple z-coalesced pattern is optimal for this data layout
  4. Correctness-First Validation: The workflow successfully identified correctness issues (001_tiled) before performance optimization, preventing incorrect optimizations.

  5. Roofline Methodology: Performance characterization using roofline analysis provides quantitative understanding of memory-bound vs compute-bound behavior.

Future Optimization Directions

Matrix Multiplication Baseline

Task Configuration

Status

Analysis

The matmul task serves as a test case for workflow generality, demonstrating that the LLM-guided kernel generation framework can be applied beyond stencil computations to other computational patterns.