summa: 3-d (proc_h) process grid for batched general products#565
Merged
evaleev merged 4 commits intoJun 12, 2026
Merged
Conversation
Infrastructure for a 3-d (proc_h x proc_r x proc_c) batched-Summa grid that distributes the fused/batch (h, slab) dimension of a general product across process planes: - ProcGrid gains a rank-subset constructor (tagged rank_subset to avoid colliding with the same-arity test-only ctor) that builds a 2-d grid over a contiguous interval [rank_offset, rank_offset + nprocs) of the world's ranks; map_row/map_col and the row/col group factories emit world-correct ranks via the offset. The legacy full-world ctor is unchanged (offset 0). - SlabbedPmap gains a 3-d variant (proc_h, proc_h_stride): slab h belongs to plane h % proc_h of proc_h_stride contiguous ranks, and the per-slab base map's plane-local owners are offset by the slab's plane. The original 3-argument form (proc_h == 1, slab-replicated) is unchanged.
Distribute the fused/batch (h, slab) dimension of a general product over a third process-grid axis proc_h, recovering parallelism when the result is small (M*N result tiles < P ranks) -- most acutely no-external products (M=N=1, e.g. the PNO-CCSD PPL intermediate), where the 2-d grid otherwise degenerates to a single rank. The world's first proc_h * proc_h_stride ranks form proc_h h-planes of proc_h_stride = P/proc_h ranks; slab h is evaluated on plane h % proc_h, which runs an ordinary 2-d SUMMA over its own (offset) process grid. Slabs are communication-free (independent), so the surplus of ranks beyond one result-tile-per-rank is spent on this axis. Summa carries per-plane state (first_slab_, my_slabs_), restricts its slab iteration to the plane (next_step), indexes reduce tasks by plane-local slab ordinal (slab_ord), uses plane-unique dense broadcast keys, and computes the result-tile owner (result_tile_owner) as the within-plane cyclic owner shifted by the plane's world-rank offset -- matching set_tile's pmap-routed destination (the two disagreeing was a get_tile/set_tile owner mismatch that deadlocked cross-plane result transfers). proc_h == 1 reproduces the 2-d path exactly. ContEngine::init_distribution_general sizes proc_h by a greedy heuristic (spread ranks beyond min(P, M*N) over the slab axis, bounded by n_slabs) and builds the plane-local grid + 3-d operand/result pmaps. A TODO marks the principled co-optimization of proc_h with the 2-d aspect ratio from the h/left-external/right-external element extents and a memory bound.
Adds general_product_distributed_suite (UNLABELED, so the CI harness runs it at both np=1 and np=2; the existing general_product_suite is serial-labeled and never exercised the batched Summa across ranks). Seven differential cases vs the legacy sub-World einsum oracle: dense, sparse, mixed T x ToT, no-external (dense + ToT), the one-expression THC reconstruction, and dist_no_externals_3d_grid -- which engages the 3-d (proc_h > 1) grid and asserts the no-external result distributes across the h-planes rather than piling on one rank.
2ae292f to
3f00dae
Compare
32c1b77 to
7b4319c
Compare
Base automatically changed from
evaleev/feature/mixed-t-tot-trees
to
evaleev/feature/general-product-expr
June 12, 2026 15:06
There was a problem hiding this comment.
Pull request overview
This PR extends TiledArray’s batched SUMMA implementation for general products by adding a third, slab-parallel process-grid axis (proc_h). This addresses poor rank utilization in degenerate cases (notably M=N=1 no-external products) by distributing independent fused-index slabs across process planes, while preserving the prior 2‑D behavior when proc_h == 1.
Changes:
- Add a rank-subset
ProcGridconstructor to build 2‑D grids over contiguous world-rank intervals (supporting per-plane 2‑D SUMMA). - Introduce an h-grouped (3‑D) SlabbedPmap that offsets per-slab owners by the slab’s plane, distributing slabs across process planes.
- Update Contraction/Summa and ContEngine to compute
proc_h, run per-plane SUMMA, use plane-unique broadcast keys, and fix result-tile owner computation to avoid cross-plane deadlocks; add new distributed tests for general products.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/general_product.cpp | Adds an unlabeled distributed general-product test suite (runs at np=1 and np=2) including a proc_h>1 coverage case. |
| src/TiledArray/proc_grid.h | Adds rank_subset tagging + rank-offset support so row/col groups map to correct world ranks for rank-subset grids. |
| src/TiledArray/pmap/slabbed_pmap.h | Adds a 3‑D (h-grouped) SlabbedPmap variant that distributes slabs across process planes. |
| src/TiledArray/expressions/cont_engine.h | Chooses proc_h via a heuristic and wires per-plane grids/pmaps into general-product distribution and evaluation. |
| src/TiledArray/dist_eval/contraction_eval.h | Extends SUMMA to per-plane execution (step filtering, per-group reduce task indexing, unique broadcast IDs) and corrects result-tile ownership across planes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
1604
to
+1608
| DenseStepTask(const std::shared_ptr<Summa_>& owner, | ||
| const ordinal_type depth) | ||
| : StepTask(owner, owner->nsteps_ + 1ul), k_(0) { | ||
| : StepTask(owner, owner->my_steps() + 1ul), k_(owner->next_step(0ul)) { | ||
| StepTask::make_next_step_tasks(this, depth); | ||
| StepTask::spawn_get_row_col_tasks(k_); | ||
| if (k_ < owner_->nsteps_) StepTask::spawn_get_row_col_tasks(k_); |
Comment on lines
+1348
to
+1354
| // Distributed (np > 1) coverage of general products. Unlike | ||
| // general_product_suite (serial-labeled: classification/optimizer unit tests | ||
| // + np=1 evaluation), this suite carries NO label, so the CI harness runs it | ||
| // at BOTH np=1 and np=2 (see tests/CMakeLists.txt: np-1 excludes @distributed, | ||
| // np-2 excludes @serial). It exercises the batched Summa across ranks -- a | ||
| // path the serial suite never covered. Each case differential-tests the | ||
| // expression route against the legacy sub-World einsum oracle. |
Comment on lines
+870
to
+878
| const size_type P = world->size(); | ||
| proc_h_ = 1ul; | ||
| if (n_slabs_ > 1ul && P > 1ul) { | ||
| const size_type p2d_cap = std::min<size_type>(P, M * N); | ||
| proc_h_ = std::min<size_type>(n_slabs_, | ||
| std::max<size_type>(1ul, P / p2d_cap)); | ||
| } | ||
| proc_h_stride_ = P / proc_h_; | ||
|
|
- contraction_eval: clamp the SUMMA step-task pipeline depth to my_steps() (this rank's group's step count) instead of nsteps_. In the 3-d (proc_h_ > 1) case my_steps() < nsteps_, so clamping to nsteps_ pre-spawned surplus step tasks that all resolved to the terminating step (k_ == nsteps_). No-op for the 2-d path (my_slabs_ == nh_). - cont_engine: keep proc_h_stride_ == 0 for the ungrouped 2-d case (proc_h_ == 1), matching the field's documented invariant; only the grouped (proc_h_ > 1) grid uses P / proc_h_. - general_product test: correct the distributed suite header comment -- dist_inner_node_thc validates against explicit binary intermediates, not the legacy einsum oracle.
355f9c8
into
evaleev/feature/general-product-expr
9 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #564 (→ #563 → #562). Adds a third process-grid axis
proc_hto the batched Summa so the fused/batch (h, slab) dimension of a general product is distributed across process planes.Why
A general product is evaluated as
n_slabsindependent 2-d SUMMAs (one per fused-index tile) over a shared grid. When the result is small —M·Nresult tiles< Pranks, most acutely no-external products (M=N=1, e.g. the PNO-CCSD PPL intermediateW=gCC·gCC) — that shared 2-d grid degenerates to a single rank and the otherP-1sit idle. Slabs are communication-free (fully independent), so the surplus parallelism belongs on the slab axis.What
proc_h · proc_h_strideranks formproc_hh-planes ofproc_h_stride = P/proc_hranks; slabhruns an ordinary 2-d SUMMA on planeh % proc_hover its own offset sub-grid.proc_h == 1is exactly the prior 2-d path;n_slabs == 1(ordinary contraction) forcesproc_h == 1;M=N=1gives an effectively 1-d grid over the slabs.rank_subset) builds a 2-d grid over[rank_offset, rank_offset+nprocs)with world-correct group/map ranks.set_tile's pmap-routed destination (the two disagreeing was the one real bug here: aget_tile/set_tileowner mismatch that deadlocked cross-plane transfers).Sizing (placeholder, see TODO)
proc_his chosen by a greedy heuristic — spread ranks beyondmin(P, M·N)over the slab axis, bounded byn_slabs. This is correct and handles both degenerate ends, but it uses tile counts only. ATODOmarks the principled co-optimization ofproc_hwith the 2-d(proc_r, proc_c)aspect ratio from the h-/left-external-/right-external-mode element extents plus a per-rank memory bound.Validation
assign_subblock_block_base1failures).general_product_distributed_suite(unlabeled → CI runs it at np=1 and np=2; the batched Summa had no np>1 coverage before). 7 cases incl.dist_no_externals_3d_grid, which engagesproc_h>1and asserts the result distributes across planes.Not yet validated at np>2 — that's the next step on a larger machine.