3D Finite-Difference Stencil Kernels
Test Configuration
- Hardware: T4 GPU (sm_75)
- Problem Size: 48×48×48 grid, T=10 timesteps
- Total Points: 1,105,920 points per timestep
- Device Bandwidth: ~124.3 GB/s (DtoD memcpy)
- Roofline Ceiling: ~64.6 GF/s (AI ≈ 0.52 flop/byte)
000_baseline (Z-Coalesced)
- Update Time: 0.00052s
- Performance: ~70.4 GF/s (~0.47 ns/point)
- Correctness: PASS (checksum diff: 0.000001)
- Status: Best performing kernel
- Analysis:
- Z-coalesced access pattern maximizes memory bandwidth utilization
- Achieves ~92% of roofline ceiling on T4
- Memory-bound kernel, near optimal for this architecture
001_tiled (X-Y Shared Tile)
- Update Time: 0.00065s
- Performance: ~56.3 GF/s (~0.59 ns/point)
- Correctness: FAIL (checksum diff: 0.002, source injection parity issue)
- Status: Slower than baseline, correctness issue
- Analysis:
- Full (x,y) plane tiling degrades coalescing when z is unit-stride
- Source injection parity issue identified during validation
- Demonstrates importance of correctness-first approach
002_sliding (X-Y Shared + 5-Plane Z Sliding Window)
- Update Time: 0.00058s
- Performance: ~63.1 GF/s (~0.52 ns/point)
- Correctness: PASS (checksum diff: 0.000001)
- Status: Correct but slower than baseline
- Analysis:
- Restores correctness compared to 001_tiled
- Additional synchronization overhead from sliding window
- Strided memory access reduces performance
- Tradeoff: correctness restored but bandwidth loss
Key Insights
-
Memory-Bound Behavior: The kernel operates at arithmetic intensity ≈ 0.52 flop/byte, making it memory-bound on T4.
-
Coalescing Critical: Z-coalesced baseline outperforms tiled variants because z is the unit-stride dimension in the memory layout u[t][x][y][z].
- Optimization Tradeoffs:
- Full (x,y) tiling trades bandwidth for reuse in the wrong dimension
- Sliding window adds synchronization overhead without sufficient benefit
- Baseline’s simple z-coalesced pattern is optimal for this data layout
-
Correctness-First Validation: The workflow successfully identified correctness issues (001_tiled) before performance optimization, preventing incorrect optimizations.
- Roofline Methodology: Performance characterization using roofline analysis provides quantitative understanding of memory-bound vs compute-bound behavior.
Future Optimization Directions
- 003_zcoalesce_xycache: Combine z-coalesced access with lightweight x/y halo caching
- Vectorized Loads: Explore float2/float4 vectorized z loads
- Launch Tuning: Optimize block geometry for occupancy and memory throughput
- Multi-Architecture: Port to L4/A100 and repeat roofline analysis
Matrix Multiplication Baseline
Task Configuration
- Dimensions: M=256, N=256, K=524288 (large K dimension)
- Purpose: Explore workflow generality with simpler computational kernel
Status
- CPU Reference: Implemented (naive triple loop)
- CUDA Baseline: Implemented (simple thread-per-element)
- Benchmarking: Infrastructure complete
- Results: Pending performance characterization
Analysis
The matmul task serves as a test case for workflow generality, demonstrating that the LLM-guided kernel generation framework can be applied beyond stencil computations to other computational patterns.