Skip to content

perf: speed up SVD (~4.7x) and symmetric EVD (~1.7x) via transposed internal storage#167

Open
lpatiny wants to merge 9 commits into
mainfrom
speed
Open

perf: speed up SVD (~4.7x) and symmetric EVD (~1.7x) via transposed internal storage#167
lpatiny wants to merge 9 commits into
mainfrom
speed

Conversation

@lpatiny

@lpatiny lpatiny commented Sep 29, 2023

Copy link
Copy Markdown
Member

Summary

The decomposition algorithms iterate their hot inner loops down columns (the row index varies), but Matrix stores data row-major (data[row][col]). Those column walks thrash the CPU cache. This PR makes the hot loops scan memory sequentially by storing the worked-on matrices transposed internally and transposing them back before returning — so the public API and the numerical results are unchanged.

This supersedes the original exploratory commit (which globally swapped get/set). A global swap is the wrong fix: it speeds SVD/EVD but makes LU ~1.6× slower, and breaks non-square matrices and every consumer that assumes row-major data. The win is captured per-algorithm instead.

What was done

  • SVD (src/dc/svd.js): work on at = value.transpose() and accumulate the U/V singular vectors in transposed storage, then transpose back. The Householder and QR-rotation inner loops become sequential.
  • Symmetric EVD (src/dc/evd.js, tred2/tql2): accumulate the eigenvectors V transposed, then transpose back.
  • Non-symmetric EVD (orthes/hqr2): left row-major on purpose. Its two O(n³) phases (the QR sweep vs. the eigenvector back-transform Σ V(i,k)·H(k,j)) have opposite layout preferences, so a single static transpose cancels out — not worth the large, fragile rewrite of the complex-eigenvector code.
  • LU: untouched. Its only O(n³) loop is already a sequential row scan, so it is cache-optimal as-is.
  • Tests: the EVD only had a single 2×2 example. Added a reconstruction oracle (A·V = V·D, eigenvector orthonormality, complex eigenvalue pairs) for symmetric (4×4, 12×12) and non-symmetric matrices.
  • Benchmark: scripts/benchmark.js is now deterministic (seeded inputs), warmed up, and reports both EVD paths.

Speed (1000×1000 unless noted, deterministic, warmed up)

Decomposition before after speedup
SVD 17764 ms 3804 ms ~4.7×
EVD, symmetric (600×600) 666 ms 386 ms ~1.7×
EVD, symmetric (300×300) 78 ms 51 ms ~1.5×
LU 232 ms 238 ms unchanged (intentional)
EVD, non-symmetric left row-major (see above)

(Numbers from an M-series laptop; ratios are what matter.)

Results are bit-identical, not just close

Because only storage location changes and never the order of arithmetic, every IEEE-754 operation sees the same operands in the same sequence. Verified by dumping full outputs (singular/eigen values + complete vector matrices, full float precision) from this branch and from main, then byte-comparing with cmp:

  • SVD: byte-for-byte identical across square / tall / wide shapes × autoTranspose on/off.
  • EVD: byte-for-byte identical across n = 3, 5, 10, 25, 40, 80 × {symmetric, symmetric auto-detected, general}.

Zero differing bytes in both.

Follow-up

The non-symmetric EVD path could still be sped up (~1.5×) with a phase-split transpose (transpose V between the QR and back-transform phases). Left out of this PR to keep it low-risk; the new test oracle is in place to support it later.

@codecov

codecov Bot commented Sep 29, 2023

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 64.97175% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.46%. Comparing base (67cda77) to head (93bdb7e).

Files with missing lines Patch % Lines
scripts/benchmark.js 0.00% 56 Missing and 1 partial ⚠️
src/dc/svd.js 95.38% 3 Missing ⚠️
src/dc/evd.js 94.28% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #167      +/-   ##
==========================================
+ Coverage   64.83%   68.46%   +3.62%     
==========================================
  Files          47       48       +1     
  Lines        5625     5727     +102     
  Branches      954     1013      +59     
==========================================
+ Hits         3647     3921     +274     
+ Misses       1967     1794     -173     
- Partials       11       12       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

lpatiny added 4 commits June 20, 2026 11:04
The SVD hot loops iterate over rows for a fixed column. With the row-major
backing store (data[row][col]) these are column walks that thrash the cache.

Work on the transpose of the input and store the U/V singular vectors
transposed during the computation, then transpose them back before returning.
The inner loops then scan memory sequentially. The public API and results are
unchanged.

Measured (1000x1000, deterministic input): 17764 ms -> 3804 ms (~4.7x).
LU and EVD are unaffected. All tests pass.

Also remove the dead exploratory get/set comments in matrix.js and make
scripts/benchmark.js deterministic (seeded) and warmed up.

Assisted-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…airs)

The eigenvalue decomposition only had a single 2x2 example. Add reconstruction
tests (A·V = V·D) for symmetric (4x4 and 12x12), non-symmetric real-eigenvalue,
and complex-eigenvalue-pair matrices, plus orthonormality of symmetric
eigenvectors. These exercise tred2/tql2 and orthes/hqr2 and guard the upcoming
performance refactor.

Assisted-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tred2/tql2 (the symmetric eigenproblem) accumulate the eigenvectors in V with
the row index varying in the hot loops, i.e. column walks of the row-major
backing store. Store V transposed during the reduction and transpose it back
before returning; the inner loops then scan memory sequentially.

Measured (symmetric, deterministic input): 600x600 666 ms -> 386 ms (~1.7x),
300x300 ~1.5x. Results unchanged (guarded by the new reconstruction tests).

The non-symmetric path (orthes/hqr2) is deliberately left row-major: its two
O(n^3) phases (QR sweep vs eigenvector back-transform) have opposite layout
preferences, so a single transposed storage cannot help both.

Add an "EVD (symmetric)" column to scripts/benchmark.js.

Assisted-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lpatiny lpatiny changed the title feat: optimize SVD ? perf: speed up SVD (~4.7x) and symmetric EVD (~1.7x) via transposed internal storage Jun 20, 2026
@lpatiny lpatiny marked this pull request as ready for review June 20, 2026 09:52
lpatiny added 4 commits June 20, 2026 12:03
The transposed-storage optimization restored the logical layout of the output
matrices with `M.transpose()`, which allocates a second full matrix while the
old one is still live (transient ~1.5x peak memory).

These outputs are square (SVD's V and EVD's V are always n x n; SVD's U is
square whenever the input is), so transpose them in place via a new
`transposeSquareInPlace` helper. No allocation in the common square case, so
the optimization is now memory-neutral versus the original implementation. The
working copy `at = value.transpose()` already replaced the original
`value.clone()`, so it is not an extra allocation.

Results remain bit-identical (verified by byte comparison against main) and the
speedups are unchanged. Non-square SVD U falls back to allocating transpose.

Assisted-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Assisted-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the hand-rolled mulberry32 PRNG with the ecosystem's ml-xsadd
(XORSHIFT-ADD) generator via `new XSadd(seed).random`, in the EVD reconstruction
tests and scripts/benchmark.js. Add ml-xsadd to devDependencies.

Assisted-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Assisted-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lpatiny lpatiny requested review from Copilot and targos and removed request for Copilot June 20, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant