CUDA Stencil Benchmark

High-performance CUDA kernel generation and benchmarking framework

View the Project on GitHub jasonlarkin/cuda-stencil-benchmark

System Design

Architecture Overview

The framework follows a three-phase workflow:

  1. Specification Phase: Define task requirements, interface contracts, and correctness criteria
  2. Generation Phase: Use LLM prompts to generate CUDA kernel implementations
  3. Validation Phase: Test correctness and measure performance, feeding results back to generation

Core Components

1. Task Specification (tasks/)

2. LLM Prompts (prompts/)

3. Kernel Interface (include/)

4. CPU Benchmark (cpu_bench/)

5. CUDA Infrastructure (cuda/)

6. Testing Framework (tests/)

7. Analysis Tools (analysis/)

Data Flow

Task Spec → LLM Prompt → Generated Kernel
                                    ↓
                            Compilation Test
                                    ↓
                            Correctness Test (vs CPU)
                                    ↓
                            Performance Benchmark
                                    ↓
                            Analysis & Feedback
                                    ↓
                            Next Iteration

Key Design Decisions

1. Correctness-First Approach

2. Modular Kernel Attempts

3. Automated Testing

4. Roofline Methodology

5. Pure C++/CUDA (No PyTorch)

Extension Points

Adding New Tasks

  1. Create task directory in tasks/
  2. Define specification and shapes
  3. Update prompts if needed
  4. Add task-specific tests

Adding New Kernels

  1. Create kernel directory in cuda/kernels/ (once CUDA infrastructure is migrated)
  2. Implement Kernel_cuda() function
  3. Add to build system
  4. Run correctness and performance tests

Custom Analysis

  1. Add scripts to analysis/
  2. Integrate with benchmark output
  3. Generate visualizations or reports