CUDA Stencil Benchmark

High-performance CUDA kernel generation and benchmarking framework for GPU optimization with correctness validation and performance characterization using LLM-guided code generation.

Core Documentation

GOAL.md: Project objectives and success criteria
DESIGN.md: System architecture and design decisions
WORKFLOW.md: LLM-guided kernel generation workflow
RESULTS.md: Performance results and analysis

Technical Documentation

MATHEMATICAL_FOUNDATIONS.md: Mathematical foundations, equation identification, stability analysis, and correctness testing methodology
MODIFIED_WAVENUMBER_ANALYSIS.md: Detailed modified wavenumber analysis of the 4th-order finite difference scheme
cpu_baseline.md: CPU baseline characterization and validation
roofline.md: Roofline methodology for performance analysis
stencil_order.md: Stencil discretization order verification
GIT_WORKFLOW.md: Git branching strategy and development workflow

Overview

This framework demonstrates a systematic approach to generating and optimizing CUDA kernels using language models, with emphasis on:

Correctness-First Validation: Automated numerical parity verification
Performance Analysis: Roofline methodology for quantitative characterization
Iterative Optimization: Feedback-driven workflow for continuous improvement
Extensible Design: Support for multiple computational patterns

Key Results

Baseline Performance: ~70.4 GF/s on T4 GPU, reaching ~92% of roofline ceiling
Correctness Validation: Automated parity testing identified optimization issues
Iterative Development: Multiple kernel attempts demonstrate systematic workflow
Workflow Generality: Extended to matrix multiplication to explore broader applicability

For detailed results and analysis, see RESULTS.md.

Getting Started

See the main README.md for build instructions and quick start guide.