Skip to content

LessUp/sgemm-optimization

Repository files navigation

SGEMM Optimization

CI Pages License: MIT CUDA C++

English | 简体中文

This repository is a CUDA SGEMM case study presented as a technical whitepaper and kernel academy. It starts from readable FP32 baselines, climbs through tiled, bank-conflict-aware, double-buffer, and guarded Tensor Core WMMA paths, then frames every performance claim with explicit validation boundaries.

Why it stands out

  • Readable optimization ladder: every kernel stage exists to expose one bottleneck shift.
  • Evidence-first public story: correctness policy, benchmark scope, and local-versus-CI trust boundaries stay attached to every claim.
  • Interview-grade positioning: the Pages site is written so the project can be explained, defended, and audited under technical pressure.
  • Bilingual mirrored docs: English and Chinese routes stay structurally aligned across the full public site.

Quick start

git clone https://github.com/LessUp/sgemm-optimization.git
cd sgemm-optimization

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/bin/sgemm_benchmark -a
ctest --test-dir build

Runtime tests and benchmarks require a local CUDA-capable machine. Hosted CI covers repository integrity, documentation, OpenSpec validation, and Pages buildability.

GitHub Pages entry points

The README is the executive summary. The long-form technical narrative lives on Pages.

Goal Entry point
Open English home English Home
Open Chinese home 中文首页
Get oriented quickly Project Guide
Inspect system structure Architecture
Study the kernel ladder Academy
Check what the evidence proves Validation
Trace papers and related repos Research Desk
Read normative repository requirements OpenSpec Specs

Validation boundary

Environment What it can prove
Hosted CI Docs structure, route integrity, OpenSpec consistency, Pages buildability
Local CUDA GPU Runtime correctness, fallback behavior, benchmark performance

This split is deliberate. CI keeps the repository coherent, but only local GPU execution can validate runtime behavior and speed claims.

Source map

src/kernels/   CUDA SGEMM implementations
src/utils/     CUDA RAII, verification, benchmark helpers
src/main.cu    benchmark CLI
tests/         Google Test coverage against cuBLAS
docs/          VitePress whitepaper and academy, mirrored under /en and /zh
openspec/      stable specs and change workflow

License

MIT. See LICENSE.md.