CUDA Stencil Benchmark

High-performance CUDA kernel generation and benchmarking framework

View the Project on GitHub jasonlarkin/cuda-stencil-benchmark

LLM-Guided Kernel Generation Workflow

Overview

This document describes the iterative workflow for generating and optimizing CUDA kernels using language models. The workflow emphasizes correctness-first validation and systematic performance optimization.

Workflow Diagram

LLM-Guided CUDA Kernel Generation Workflow

Workflow Phases

1. Specification

Inputs:

Output: Complete task definition with correctness criteria and performance targets.

2. Prompt Construction

Components:

Output: Complete prompt for language model.

3. Kernel Generation

Process:

  1. Send prompt to language model
  2. Receive generated CUDA kernel code
  3. Save to cuda/kernels/XXX_name/kernel.cu

Output: CUDA kernel source file.

4. Compilation Test

Process:

cd cuda
make ATTEMPT=XXX_name ARCH=sm_75

Outcomes:

5. Correctness Test

Process:

VERIFY=1 ./bench_cuda_XXX_name

Method:

Criteria:

Outcomes:

6. Performance Benchmark

Process:

CSV=1 ./bench_cuda_XXX_name > results.csv
./dev_bw  # Measure device bandwidth

Metrics:

7. Performance Analysis

Roofline Analysis:

  1. Calculate arithmetic intensity (AI = FLOPs / Bytes)
  2. Measure device bandwidth
  3. Compute roofline ceiling
  4. Plot measured performance vs roofline

Interpretation:

8. Feedback Loop

Analysis:

  1. Compare performance vs baseline
  2. Identify bottlenecks (memory vs compute)
  3. Document optimization insights

Prompt Updates:

  1. Add successful optimization patterns
  2. Document failure modes to avoid
  3. Update few-shot examples
  4. Refine task instructions

Next Iteration: Generate new kernel variant or refine existing kernel.

Example: Iterative Kernel Development

Initial Generation

Compilation

Correctness Test

Regeneration

Correctness Test (Retry)

Performance Benchmark

Feedback

Implementation Status

Implemented Components

Phase 1: Specification - COMPLETE

Phase 2: Prompt Construction - COMPLETE

Phase 5: Correctness Test - COMPLETE

Phase 6: Performance Benchmark - COMPLETE

Phase 7: Performance Analysis - COMPLETE

Manual Components

Phase 3: Kernel Generation - MANUAL

Phase 4: Compilation Test - MANUAL

Phase 8: Feedback Loop - MANUAL

Pending Components

CUDA Infrastructure - PENDING

Key Principles

  1. Correctness-First: Verify numerical parity before performance optimization
  2. Systematic Testing: Automated correctness and performance validation
  3. Iterative Improvement: Use test results to inform prompt updates (manual)
  4. Performance Characterization: Roofline analysis guides optimization strategy

Success Metrics