System Design

Architecture Overview

The framework follows a three-phase workflow:

Specification Phase: Define task requirements, interface contracts, and correctness criteria
Generation Phase: Use LLM prompts to generate CUDA kernel implementations
Validation Phase: Test correctness and measure performance, feeding results back to generation

Core Components

1. Task Specification (`tasks/`)

Purpose: Define kernel requirements and interface contracts
Format: YAML for shapes/config, Markdown for specifications
Examples: fctd3d/ (3D stencil), matmul/ (matrix multiplication)

2. LLM Prompts (`prompts/`)

Purpose: Guide language model to generate correct and optimized kernels
Components:
- system.md: System-level constraints and priorities
- task_template.md: Task-specific generation template
- few_shots.md: Example kernels for few-shot learning

3. Kernel Interface (`include/`)

Purpose: Standardized interface between CPU reference and CUDA kernels
Key Structures:
- dataobj: Multi-dimensional array representation
- profiler: Timing instrumentation
- Function signatures: Kernel() (CPU), Kernel_cuda() (CUDA)

4. CPU Benchmark (`cpu_bench/`)

Purpose: Reference implementation and baseline performance
Components:
- Benchmark harness
- Performance sweeps
- Roofline analysis
- Correctness verification

5. CUDA Infrastructure (`cuda/`)

Purpose: CUDA kernel development and benchmarking
Status: Complete
Components:
- Benchmark harness (bench_cuda.cpp)
- Kernel attempts (kernels/XXX_name/)
- Build system (Makefile)
- Device bandwidth measurement (dev_bw.cu)
- Endpoint-agnostic benchmark runner (run_benchmark.sh)

6. Testing Framework (`tests/`)

Purpose: Automated correctness and performance validation
Categories:
- Correctness: Numerical parity verification
- Performance: Regression detection
- Integration: End-to-end workflow validation

7. Analysis Tools (`analysis/`)

Purpose: Performance comparison and visualization
Tools:
- CPU vs GPU comparison
- Stencil visualization
- Performance plotting

Data Flow

Task Spec → LLM Prompt → Generated Kernel
                                    ↓
                            Compilation Test
                                    ↓
                            Correctness Test (vs CPU)
                                    ↓
                            Performance Benchmark
                                    ↓
                            Analysis & Feedback
                                    ↓
                            Next Iteration

Key Design Decisions

1. Correctness-First Approach

Verify numerical parity before performance optimization
Prevents incorrect optimizations
Establishes trust in generated code

2. Modular Kernel Attempts

Each optimization attempt in separate directory
Enables A/B comparison
Preserves history of optimization attempts

3. Automated Testing

Correctness tests run automatically
Performance regression detection
CI/CD ready structure

4. Roofline Methodology

Quantitative performance characterization
Identifies memory-bound vs compute-bound kernels
Guides optimization strategy

5. Pure C++/CUDA (No PyTorch)

Reduces dependencies
Focuses on kernel optimization
Easier to integrate into various frameworks

Extension Points

Adding New Tasks

Create task directory in tasks/
Define specification and shapes
Update prompts if needed
Add task-specific tests

Adding New Kernels

Create kernel directory in cuda/kernels/ (once CUDA infrastructure is migrated)
Implement Kernel_cuda() function
Add to build system
Run correctness and performance tests

Custom Analysis

Add scripts to analysis/
Integrate with benchmark output
Generate visualizations or reports

System Design

Architecture Overview

Core Components

1. Task Specification (tasks/)

2. LLM Prompts (prompts/)

3. Kernel Interface (include/)

4. CPU Benchmark (cpu_bench/)

5. CUDA Infrastructure (cuda/)

6. Testing Framework (tests/)

7. Analysis Tools (analysis/)

Data Flow

Key Design Decisions

1. Correctness-First Approach

2. Modular Kernel Attempts

3. Automated Testing

4. Roofline Methodology

5. Pure C++/CUDA (No PyTorch)

Extension Points

Adding New Tasks

Adding New Kernels

Custom Analysis

1. Task Specification (`tasks/`)

2. LLM Prompts (`prompts/`)

3. Kernel Interface (`include/`)

4. CPU Benchmark (`cpu_bench/`)

5. CUDA Infrastructure (`cuda/`)

6. Testing Framework (`tests/`)

7. Analysis Tools (`analysis/`)