CUDA Stencil Benchmark

High-performance CUDA kernel generation and benchmarking framework

View the Project on GitHub jasonlarkin/cuda-stencil-benchmark

Mathematical Foundations and Correctness Testing

Overview

This document describes the mathematical foundations of the 3D finite-difference stencil computation, the methods used to identify and characterize the underlying partial differential equation, and how these foundations inform our definition of correctness and testing methodology.

Equation Identification

Underlying PDE: 3D Wave Equation

The reference implementation solves the 3D wave equation:

\[\frac{\partial^2 u}{\partial t^2} = c^2 \nabla^2 u + f(x,y,z,t)\]

Where:

Numerical Scheme

Spatial Discretization: 4th-order finite difference

Temporal Discretization: 2nd-order finite difference

Evidence from Code Analysis

The update equation in ref/code.cpp implements:

u[t2] = dt² * (spatial_terms - temporal_terms) / m

Where:

Stability Analysis

Von Neumann Stability Method

Von Neumann stability analysis determines whether a finite difference scheme remains bounded over time by:

  1. Fourier Decomposition: Express the solution as a sum of Fourier modes
  2. Amplification Factor: Determine how each mode evolves: $u^{n+1} = G(k) u^n$
  3. Stability Criterion: Ensure $ G(k) \leq 1$ for all wavenumbers $k$

Stability Results

Critical CFL Number: 0.5 (empirically determined)

Key Finding: The scheme follows standard 4th-order finite difference stability limits, consistent with theoretical expectations for the 3D wave equation.

Implications for Testing

  1. CFL Constraint: All correctness tests must use CFL < 0.5
  2. Stability Validation: Long-time integration tests verify boundedness
  3. Parameter Space: Testing must cover stable and near-unstable regimes

Accuracy Analysis

Modified Wavenumber Analysis

The finite difference scheme introduces numerical dispersion:

Discretization Order Verification

Empirical convergence analysis confirms:

This is verified through:

Defining Correctness

Levels of Correctness

  1. Numerical Parity: CUDA kernel produces identical results to CPU reference
    • Tolerance: Checksum difference < 1e-5
    • Method: Per-voxel comparison after identical timesteps
    • Rationale: Same mathematical scheme should produce same results
  2. Physical Correctness: Solution satisfies wave equation properties
    • Energy Conservation: Total energy remains bounded (for closed systems)
    • Wave Propagation: Solutions propagate at correct speed
    • Boundary Behavior: Appropriate boundary condition handling
  3. Stability: Solution remains bounded for stable CFL numbers
    • Long-time Integration: No exponential growth for CFL < 0.5
    • Error Accumulation: Errors grow at expected rate (not catastrophically)
  4. Convergence: Solution converges to analytical solution as grid refines
    • Spatial Convergence: 4th-order rate
    • Temporal Convergence: 2nd-order rate

Correctness Testing Methodology

1. Parity Testing

Purpose: Verify CUDA kernel implements the same numerical scheme as CPU reference.

Method:

Criteria:

Rationale: If the numerical scheme is identical, floating-point differences should be minimal (within roundoff).

2. Analytical Solution Validation

Purpose: Verify the scheme solves the correct PDE with expected accuracy.

Method:

Criteria:

Rationale: Correct PDE + correct scheme → correct solution (up to discretization error).

3. Stability Testing

Purpose: Verify the scheme remains stable for appropriate parameter ranges.

Method:

Criteria:

Rationale: Unstable schemes produce unphysical results, violating correctness.

4. Order Verification

Purpose: Confirm the scheme achieves expected discretization order.

Method:

Criteria:

Rationale: Order verification confirms the scheme matches theoretical expectations.

Testing Philosophy

Correctness-First Approach

Principle: Verify numerical correctness before optimizing performance.

Rationale:

  1. Correctness is Non-Negotiable: An incorrect but fast kernel is useless
  2. Early Detection: Catch numerical errors before investing in optimization
  3. Baseline Establishment: Correct baseline enables meaningful performance comparison

Workflow:

  1. Generate kernel → Compile → Correctness Test → Performance Benchmark
  2. If correctness fails: Fix numerical issues, regenerate, retest
  3. If correctness passes: Proceed to performance analysis

Iterative Refinement

Principle: Use test results to refine kernel generation and validation.

Process:

  1. Initial Test: Basic parity check
  2. Failure Analysis: Identify root cause (memory access, boundary conditions, source injection)
  3. Prompt Update: Incorporate lessons learned into LLM prompts
  4. Regeneration: Generate improved kernel variant
  5. Re-test: Verify fix and check for regressions

Example: Kernel 001 (tiled) failed correctness due to source injection parity issue. Analysis revealed shared memory tiling broke atomic source updates. Prompt updated to emphasize source injection semantics, leading to corrected Kernel 002.

Relationship to Performance

Correctness Enables Performance Analysis

Valid Performance Metrics Require Correctness:

Example: Kernel 001 was faster than baseline but failed correctness. The “performance gain” was invalid because the kernel computed the wrong result.

Performance-Correctness Trade-offs

No Trade-offs Allowed:

Architecture-Specific Considerations:

Analysis Tools

The analysis/ directory contains Python scripts for performing mathematical analysis:

Von Neumann Stability Analysis

Script: analysis/von_neumann_stability.py

Performs theoretical von Neumann stability analysis to determine the stability condition for the finite difference scheme.

Usage:

python analysis/von_neumann_stability.py --c 1.0 --h 0.01

Outputs:

Key Results:

Modified Wavenumber Analysis

Script: analysis/modified_wavenumber.py

Analyzes the accuracy of the finite difference scheme in wavenumber space, determining phase and group velocity errors.

Usage:

python analysis/modified_wavenumber.py --h 0.01 0.02 0.05

Outputs:

Key Results:

Order Verification

Script: analysis/verify_order.py

Verifies the discretization order through grid refinement studies.

Usage:

python analysis/verify_order.py

Outputs:

Example Analysis Results

Example plots and documentation from the analysis process are available:

Von Neumann Stability Analysis Plots:

CFL = 0.1 (stable) CFL = 0.1 (stable regime)

CFL = 0.5 (critical) CFL = 0.5 (critical CFL)

CFL = 0.577 (theoretical limit) CFL = 0.577 (theoretical 2nd-order 3D limit)

CFL = 1.0 (unstable) CFL = 1.0 (unstable regime)

Stability Region Analysis Stability Region Analysis

Modified Wavenumber Analysis:

Fine grid (h = 0.01) Fine grid (h = 0.01)

Medium grid (h = 0.05) Medium grid (h = 0.05)

Detailed Analysis Documentation

Note: Some initial theoretical analysis results were later found to be suspect and were corrected through real numerical testing. The final validated results are documented in this file and used in the correctness testing methodology.

Summary

The mathematical foundations provide:

  1. Equation Identification: Confirms we’re solving the 3D wave equation with 4th-order spatial, 2nd-order temporal discretization
  2. Stability Analysis: Establishes CFL < 0.5 constraint for all testing
  3. Accuracy Characterization: Defines expected convergence rates and resolution requirements
  4. Correctness Definition: Multi-level approach (parity, physical, stability, convergence)
  5. Testing Methodology: Systematic validation at each level
  6. Workflow Integration: Correctness-first approach ensures valid performance analysis
  7. Analysis Tools: Python scripts for theoretical and numerical analysis

These foundations ensure that CUDA kernel optimization proceeds from a solid mathematical and numerical basis, with correctness validation at every step.