Skip to content

Add initial end-to-end CUDA FGMRES solver path#2825

Draft
LwhJesse wants to merge 5 commits into
su2code:developfrom
LwhJesse:gpu/initial-cuda-fgmres
Draft

Add initial end-to-end CUDA FGMRES solver path#2825
LwhJesse wants to merge 5 commits into
su2code:developfrom
LwhJesse:gpu/initial-cuda-fgmres

Conversation

@LwhJesse
Copy link
Copy Markdown

@LwhJesse LwhJesse commented Jun 1, 2026

Proposed Changes

This PR adds an initial end-to-end CUDA FGMRES linear solve path on top of the existing CUDA BSR SpMV path.

It intentionally bundles the minimal pieces required for a reviewable GPU linear-solve slice, rather than sending the intermediate infrastructure-only pieces separately. The scope is limited to one GPU Krylov solver path (FGMRES), one simple GPU preconditioner path (JACOBI), and the vector operations and dispatch/lifecycle changes strictly required to make that path run.

Concretely, this PR:

  • caches the cuSPARSE SpMV resources needed by the solver path
  • adds CUDA FGMRES scaffolding and internal dispatch while keeping the public solver entry point unchanged
  • adds the CUDA vector primitives needed by the solver path
  • implements an initial CUDA FGMRES solve path
  • adds a simple CUDA Jacobi preconditioner path
  • keeps cuSPARSE for SpMV
  • keeps cuBLAS for dot / norm
  • uses custom CUDA kernels for vector-vector operations

This PR does not attempt to add more GPU Krylov solvers, more advanced GPU preconditioners, remove the current host-driven Krylov control flow, or perform broader cache / portability / cleanup work beyond this minimal slice.

Related Work

This PR follows the review direction discussed in #2822, where the request was to show a working end-to-end GPU linear solve path before splitting out additional infrastructure work.

It also follows the implementation preferences discussed in #2816:

  • cuSPARSE for SpMV
  • cuBLAS for dot / norm
  • custom CUDA kernels for vector-vector operations

Suggested review order:

  1. 53bacf193f Cache CUDA SpMV cuSPARSE resources
  2. 08fde80e1e Add CUDA FGMRES and Jacobi scaffolding
  3. fde2c145cf Implement CUDA vector primitives
  4. 2b4f9d8716 Implement CUDA FGMRES solve path
  5. 9c344ee793 Implement CUDA Jacobi preconditioner

Validation

Validated locally with:

  • python3.12 -m pre_commit run --all-files
  • serial CUDA build compilation
  • mixed-precision CUDA build compilation
  • serial CPU build compilation
  • OpenMP CPU build compilation
  • CPU/GPU numerical comparison on 6 representative cases, each tested with LINEAR_SOLVER_PREC=NONE and LINEAR_SOLVER_PREC=JACOBI
  • nsys profiling
  • ncu profiling

Representative cases used for validation:

  • periodic2d_sector
  • udf_lam_flatplate_s
  • udf_lam_flatplate_m
  • udf_lam_flatplate_l
  • udf_test_11_probes_s
  • udf_test_11_probes_m

In short: this branch compiles, the end-to-end CUDA FGMRES path runs successfully on the tested cases, and the GPU-side results are numerically consistent with the CPU-side results. Across the tested cases, the CPU and GPU residual histories either match exactly or differ only at floating-point roundoff level.

Performance was also checked on the same representative cases against both a serial CPU build and a 20-thread OpenMP CPU build. The GPU path is faster than the serial CPU baseline on the medium and large cases tested here. Against the 20-thread OpenMP CPU baseline, it is not beneficial on the smallest cases, but still shows a clear speedup on the medium and large cases tested here.

The simple Jacobi path is numerically valid, but is not yet a net performance win on these cases.

PR Checklist

  • I am submitting my contribution to the develop branch.
  • My contribution generates no new compiler warnings (try with --warnlevel=3 when using meson).
  • My contribution is commented and consistent with SU2 style (https://su2code.github.io/docs_v7/Style-Guide/).
  • I used the pre-commit hook to prevent dirty commits and used pre-commit run --all to format old commits.
  • I have added a test case that demonstrates my contribution, if necessary.
  • I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp), if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant