Skip to content

summa: 3-d (proc_h) process grid for batched general products#565

Merged
evaleev merged 4 commits into
evaleev/feature/general-product-exprfrom
evaleev/feature/summa-3d-grid
Jun 12, 2026
Merged

summa: 3-d (proc_h) process grid for batched general products#565
evaleev merged 4 commits into
evaleev/feature/general-product-exprfrom
evaleev/feature/summa-3d-grid

Conversation

@evaleev

@evaleev evaleev commented Jun 12, 2026

Copy link
Copy Markdown
Member

Stacked on #564 (→ #563#562). Adds a third process-grid axis proc_h to the batched Summa so the fused/batch (h, slab) dimension of a general product is distributed across process planes.

Why

A general product is evaluated as n_slabs independent 2-d SUMMAs (one per fused-index tile) over a shared grid. When the result is small — M·N result tiles < P ranks, most acutely no-external products (M=N=1, e.g. the PNO-CCSD PPL intermediate W=gCC·gCC) — that shared 2-d grid degenerates to a single rank and the other P-1 sit idle. Slabs are communication-free (fully independent), so the surplus parallelism belongs on the slab axis.

What

  • Grid: the world's first proc_h · proc_h_stride ranks form proc_h h-planes of proc_h_stride = P/proc_h ranks; slab h runs an ordinary 2-d SUMMA on plane h % proc_h over its own offset sub-grid. proc_h == 1 is exactly the prior 2-d path; n_slabs == 1 (ordinary contraction) forces proc_h == 1; M=N=1 gives an effectively 1-d grid over the slabs.
  • ProcGrid rank-subset ctor (tagged rank_subset) builds a 2-d grid over [rank_offset, rank_offset+nprocs) with world-correct group/map ranks.
  • SlabbedPmap 3-d variant: plane-local base owners shifted by the slab's plane.
  • Summa carries per-plane state, restricts slab iteration to the plane, indexes reduce tasks by plane-local slab ordinal, uses plane-unique broadcast keys, and computes the result-tile owner as the within-plane cyclic owner plus the plane's world-rank offset — matching set_tile's pmap-routed destination (the two disagreeing was the one real bug here: a get_tile/set_tile owner mismatch that deadlocked cross-plane transfers).

Sizing (placeholder, see TODO)

proc_h is chosen by a greedy heuristic — spread ranks beyond min(P, M·N) over the slab axis, bounded by n_slabs. This is correct and handles both degenerate ends, but it uses tile counts only. A TODO marks the principled co-optimization of proc_h with the 2-d (proc_r, proc_c) aspect ratio from the h-/left-external-/right-external-mode element extents plus a per-rank memory bound.

Validation

  • np=1: full regression clean (general_product, einsum, expressions, sparse_shape, proc_grid; only the 2 pre-existing assign_subblock_block_base1 failures).
  • np=2: new general_product_distributed_suite (unlabeled → CI runs it at np=1 and np=2; the batched Summa had no np>1 coverage before). 7 cases incl. dist_no_externals_3d_grid, which engages proc_h>1 and asserts the result distributes across planes.
  • mpqc PNO-CCSD (c6h14/cc-pVDZ): np=1 unchanged; np=2 correct (−0.96207945064, agrees with np=1 to ~1e-12 FP-reorder noise), the 3-d grid active.

Not yet validated at np>2 — that's the next step on a larger machine.

evaleev added 3 commits June 12, 2026 10:04
Infrastructure for a 3-d (proc_h x proc_r x proc_c) batched-Summa grid
that distributes the fused/batch (h, slab) dimension of a general product
across process planes:

- ProcGrid gains a rank-subset constructor (tagged rank_subset to avoid
  colliding with the same-arity test-only ctor) that builds a 2-d grid
  over a contiguous interval [rank_offset, rank_offset + nprocs) of the
  world's ranks; map_row/map_col and the row/col group factories emit
  world-correct ranks via the offset. The legacy full-world ctor is
  unchanged (offset 0).
- SlabbedPmap gains a 3-d variant (proc_h, proc_h_stride): slab h belongs
  to plane h % proc_h of proc_h_stride contiguous ranks, and the per-slab
  base map's plane-local owners are offset by the slab's plane. The
  original 3-argument form (proc_h == 1, slab-replicated) is unchanged.
Distribute the fused/batch (h, slab) dimension of a general product over a
third process-grid axis proc_h, recovering parallelism when the result is
small (M*N result tiles < P ranks) -- most acutely no-external products
(M=N=1, e.g. the PNO-CCSD PPL intermediate), where the 2-d grid otherwise
degenerates to a single rank.

The world's first proc_h * proc_h_stride ranks form proc_h h-planes of
proc_h_stride = P/proc_h ranks; slab h is evaluated on plane h % proc_h,
which runs an ordinary 2-d SUMMA over its own (offset) process grid. Slabs
are communication-free (independent), so the surplus of ranks beyond one
result-tile-per-rank is spent on this axis. Summa carries per-plane state
(first_slab_, my_slabs_), restricts its slab iteration to the plane
(next_step), indexes reduce tasks by plane-local slab ordinal (slab_ord),
uses plane-unique dense broadcast keys, and computes the result-tile owner
(result_tile_owner) as the within-plane cyclic owner shifted by the plane's
world-rank offset -- matching set_tile's pmap-routed destination (the two
disagreeing was a get_tile/set_tile owner mismatch that deadlocked
cross-plane result transfers). proc_h == 1 reproduces the 2-d path exactly.

ContEngine::init_distribution_general sizes proc_h by a greedy heuristic
(spread ranks beyond min(P, M*N) over the slab axis, bounded by n_slabs)
and builds the plane-local grid + 3-d operand/result pmaps. A TODO marks
the principled co-optimization of proc_h with the 2-d aspect ratio from the
h/left-external/right-external element extents and a memory bound.
Adds general_product_distributed_suite (UNLABELED, so the CI harness runs
it at both np=1 and np=2; the existing general_product_suite is
serial-labeled and never exercised the batched Summa across ranks). Seven
differential cases vs the legacy sub-World einsum oracle: dense, sparse,
mixed T x ToT, no-external (dense + ToT), the one-expression THC
reconstruction, and dist_no_externals_3d_grid -- which engages the 3-d
(proc_h > 1) grid and asserts the no-external result distributes across the
h-planes rather than piling on one rank.
@evaleev evaleev force-pushed the evaleev/feature/mixed-t-tot-trees branch from 2ae292f to 3f00dae Compare June 12, 2026 14:08
@evaleev evaleev force-pushed the evaleev/feature/summa-3d-grid branch from 32c1b77 to 7b4319c Compare June 12, 2026 14:08
Base automatically changed from evaleev/feature/mixed-t-tot-trees to evaleev/feature/general-product-expr June 12, 2026 15:06
@evaleev evaleev requested a review from Copilot June 12, 2026 16:05

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends TiledArray’s batched SUMMA implementation for general products by adding a third, slab-parallel process-grid axis (proc_h). This addresses poor rank utilization in degenerate cases (notably M=N=1 no-external products) by distributing independent fused-index slabs across process planes, while preserving the prior 2‑D behavior when proc_h == 1.

Changes:

  • Add a rank-subset ProcGrid constructor to build 2‑D grids over contiguous world-rank intervals (supporting per-plane 2‑D SUMMA).
  • Introduce an h-grouped (3‑D) SlabbedPmap that offsets per-slab owners by the slab’s plane, distributing slabs across process planes.
  • Update Contraction/Summa and ContEngine to compute proc_h, run per-plane SUMMA, use plane-unique broadcast keys, and fix result-tile owner computation to avoid cross-plane deadlocks; add new distributed tests for general products.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/general_product.cpp Adds an unlabeled distributed general-product test suite (runs at np=1 and np=2) including a proc_h>1 coverage case.
src/TiledArray/proc_grid.h Adds rank_subset tagging + rank-offset support so row/col groups map to correct world ranks for rank-subset grids.
src/TiledArray/pmap/slabbed_pmap.h Adds a 3‑D (h-grouped) SlabbedPmap variant that distributes slabs across process planes.
src/TiledArray/expressions/cont_engine.h Chooses proc_h via a heuristic and wires per-plane grids/pmaps into general-product distribution and evaluation.
src/TiledArray/dist_eval/contraction_eval.h Extends SUMMA to per-plane execution (step filtering, per-group reduce task indexing, unique broadcast IDs) and corrects result-tile ownership across planes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1604 to +1608
DenseStepTask(const std::shared_ptr<Summa_>& owner,
const ordinal_type depth)
: StepTask(owner, owner->nsteps_ + 1ul), k_(0) {
: StepTask(owner, owner->my_steps() + 1ul), k_(owner->next_step(0ul)) {
StepTask::make_next_step_tasks(this, depth);
StepTask::spawn_get_row_col_tasks(k_);
if (k_ < owner_->nsteps_) StepTask::spawn_get_row_col_tasks(k_);
Comment thread tests/general_product.cpp Outdated
Comment on lines +1348 to +1354
// Distributed (np > 1) coverage of general products. Unlike
// general_product_suite (serial-labeled: classification/optimizer unit tests
// + np=1 evaluation), this suite carries NO label, so the CI harness runs it
// at BOTH np=1 and np=2 (see tests/CMakeLists.txt: np-1 excludes @distributed,
// np-2 excludes @serial). It exercises the batched Summa across ranks -- a
// path the serial suite never covered. Each case differential-tests the
// expression route against the legacy sub-World einsum oracle.
Comment on lines +870 to +878
const size_type P = world->size();
proc_h_ = 1ul;
if (n_slabs_ > 1ul && P > 1ul) {
const size_type p2d_cap = std::min<size_type>(P, M * N);
proc_h_ = std::min<size_type>(n_slabs_,
std::max<size_type>(1ul, P / p2d_cap));
}
proc_h_stride_ = P / proc_h_;

- contraction_eval: clamp the SUMMA step-task pipeline depth to my_steps()
  (this rank's group's step count) instead of nsteps_. In the 3-d
  (proc_h_ > 1) case my_steps() < nsteps_, so clamping to nsteps_
  pre-spawned surplus step tasks that all resolved to the terminating
  step (k_ == nsteps_). No-op for the 2-d path (my_slabs_ == nh_).

- cont_engine: keep proc_h_stride_ == 0 for the ungrouped 2-d case
  (proc_h_ == 1), matching the field's documented invariant; only the
  grouped (proc_h_ > 1) grid uses P / proc_h_.

- general_product test: correct the distributed suite header comment --
  dist_inner_node_thc validates against explicit binary intermediates,
  not the legacy einsum oracle.
@evaleev evaleev merged commit 355f9c8 into evaleev/feature/general-product-expr Jun 12, 2026
9 checks passed
@evaleev evaleev deleted the evaleev/feature/summa-3d-grid branch June 12, 2026 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants