Add initial end-to-end CUDA FGMRES solver path#2825
Draft
LwhJesse wants to merge 5 commits into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed Changes
This PR adds an initial end-to-end CUDA FGMRES linear solve path on top of the existing CUDA BSR SpMV path.
It intentionally bundles the minimal pieces required for a reviewable GPU linear-solve slice, rather than sending the intermediate infrastructure-only pieces separately. The scope is limited to one GPU Krylov solver path (
FGMRES), one simple GPU preconditioner path (JACOBI), and the vector operations and dispatch/lifecycle changes strictly required to make that path run.Concretely, this PR:
cuSPARSEfor SpMVcuBLASfordot/normThis PR does not attempt to add more GPU Krylov solvers, more advanced GPU preconditioners, remove the current host-driven Krylov control flow, or perform broader cache / portability / cleanup work beyond this minimal slice.
Related Work
This PR follows the review direction discussed in #2822, where the request was to show a working end-to-end GPU linear solve path before splitting out additional infrastructure work.
It also follows the implementation preferences discussed in #2816:
cuSPARSEfor SpMVcuBLASfordot/normSuggested review order:
53bacf193fCache CUDA SpMV cuSPARSE resources08fde80e1eAdd CUDA FGMRES and Jacobi scaffoldingfde2c145cfImplement CUDA vector primitives2b4f9d8716Implement CUDA FGMRES solve path9c344ee793Implement CUDA Jacobi preconditionerValidation
Validated locally with:
python3.12 -m pre_commit run --all-filesLINEAR_SOLVER_PREC=NONEandLINEAR_SOLVER_PREC=JACOBInsysprofilingncuprofilingRepresentative cases used for validation:
periodic2d_sectorudf_lam_flatplate_sudf_lam_flatplate_mudf_lam_flatplate_ludf_test_11_probes_sudf_test_11_probes_mIn short: this branch compiles, the end-to-end CUDA FGMRES path runs successfully on the tested cases, and the GPU-side results are numerically consistent with the CPU-side results. Across the tested cases, the CPU and GPU residual histories either match exactly or differ only at floating-point roundoff level.
Performance was also checked on the same representative cases against both a serial CPU build and a 20-thread OpenMP CPU build. The GPU path is faster than the serial CPU baseline on the medium and large cases tested here. Against the 20-thread OpenMP CPU baseline, it is not beneficial on the smallest cases, but still shows a clear speedup on the medium and large cases tested here.
The simple Jacobi path is numerically valid, but is not yet a net performance win on these cases.
PR Checklist
pre-commit run --allto format old commits.