English | 简体中文
This repository is a CUDA SGEMM case study presented as a technical whitepaper and kernel academy. It starts from readable FP32 baselines, climbs through tiled, bank-conflict-aware, double-buffer, and guarded Tensor Core WMMA paths, then frames every performance claim with explicit validation boundaries.
- Readable optimization ladder: every kernel stage exists to expose one bottleneck shift.
- Evidence-first public story: correctness policy, benchmark scope, and local-versus-CI trust boundaries stay attached to every claim.
- Interview-grade positioning: the Pages site is written so the project can be explained, defended, and audited under technical pressure.
- Bilingual mirrored docs: English and Chinese routes stay structurally aligned across the full public site.
git clone https://github.com/LessUp/sgemm-optimization.git
cd sgemm-optimization
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/bin/sgemm_benchmark -a
ctest --test-dir buildRuntime tests and benchmarks require a local CUDA-capable machine. Hosted CI covers repository integrity, documentation, OpenSpec validation, and Pages buildability.
The README is the executive summary. The long-form technical narrative lives on Pages.
| Goal | Entry point |
|---|---|
| Open English home | English Home |
| Open Chinese home | 中文首页 |
| Get oriented quickly | Project Guide |
| Inspect system structure | Architecture |
| Study the kernel ladder | Academy |
| Check what the evidence proves | Validation |
| Trace papers and related repos | Research Desk |
| Read normative repository requirements | OpenSpec Specs |
| Environment | What it can prove |
|---|---|
| Hosted CI | Docs structure, route integrity, OpenSpec consistency, Pages buildability |
| Local CUDA GPU | Runtime correctness, fallback behavior, benchmark performance |
This split is deliberate. CI keeps the repository coherent, but only local GPU execution can validate runtime behavior and speed claims.
src/kernels/ CUDA SGEMM implementations
src/utils/ CUDA RAII, verification, benchmark helpers
src/main.cu benchmark CLI
tests/ Google Test coverage against cuBLAS
docs/ VitePress whitepaper and academy, mirrored under /en and /zh
openspec/ stable specs and change workflow
MIT. See LICENSE.md.