summa: 3-d (proc_h) process grid for batched general products by evaleev · Pull Request #565 · ValeevGroup/tiledarray

evaleev · 2026-06-12T13:36:12Z

Stacked on #564 (→ #563 → #562). Adds a third process-grid axis proc_h to the batched Summa so the fused/batch (h, slab) dimension of a general product is distributed across process planes.

Why

A general product is evaluated as n_slabs independent 2-d SUMMAs (one per fused-index tile) over a shared grid. When the result is small — M·N result tiles < P ranks, most acutely no-external products (M=N=1, e.g. the PNO-CCSD PPL intermediate W=gCC·gCC) — that shared 2-d grid degenerates to a single rank and the other P-1 sit idle. Slabs are communication-free (fully independent), so the surplus parallelism belongs on the slab axis.

What

Grid: the world's first proc_h · proc_h_stride ranks form proc_h h-planes of proc_h_stride = P/proc_h ranks; slab h runs an ordinary 2-d SUMMA on plane h % proc_h over its own offset sub-grid. proc_h == 1 is exactly the prior 2-d path; n_slabs == 1 (ordinary contraction) forces proc_h == 1; M=N=1 gives an effectively 1-d grid over the slabs.
ProcGrid rank-subset ctor (tagged rank_subset) builds a 2-d grid over [rank_offset, rank_offset+nprocs) with world-correct group/map ranks.
SlabbedPmap 3-d variant: plane-local base owners shifted by the slab's plane.
Summa carries per-plane state, restricts slab iteration to the plane, indexes reduce tasks by plane-local slab ordinal, uses plane-unique broadcast keys, and computes the result-tile owner as the within-plane cyclic owner plus the plane's world-rank offset — matching set_tile's pmap-routed destination (the two disagreeing was the one real bug here: a get_tile/set_tile owner mismatch that deadlocked cross-plane transfers).

Sizing (placeholder, see TODO)

proc_h is chosen by a greedy heuristic — spread ranks beyond min(P, M·N) over the slab axis, bounded by n_slabs. This is correct and handles both degenerate ends, but it uses tile counts only. A TODO marks the principled co-optimization of proc_h with the 2-d (proc_r, proc_c) aspect ratio from the h-/left-external-/right-external-mode element extents plus a per-rank memory bound.

Validation

np=1: full regression clean (general_product, einsum, expressions, sparse_shape, proc_grid; only the 2 pre-existing assign_subblock_block_base1 failures).
np=2: new general_product_distributed_suite (unlabeled → CI runs it at np=1 and np=2; the batched Summa had no np>1 coverage before). 7 cases incl. dist_no_externals_3d_grid, which engages proc_h>1 and asserts the result distributes across planes.
mpqc PNO-CCSD (c6h14/cc-pVDZ): np=1 unchanged; np=2 correct (−0.96207945064, agrees with np=1 to ~1e-12 FP-reorder noise), the 3-d grid active.

Not yet validated at np>2 — that's the next step on a larger machine.

Infrastructure for a 3-d (proc_h x proc_r x proc_c) batched-Summa grid that distributes the fused/batch (h, slab) dimension of a general product across process planes: - ProcGrid gains a rank-subset constructor (tagged rank_subset to avoid colliding with the same-arity test-only ctor) that builds a 2-d grid over a contiguous interval [rank_offset, rank_offset + nprocs) of the world's ranks; map_row/map_col and the row/col group factories emit world-correct ranks via the offset. The legacy full-world ctor is unchanged (offset 0). - SlabbedPmap gains a 3-d variant (proc_h, proc_h_stride): slab h belongs to plane h % proc_h of proc_h_stride contiguous ranks, and the per-slab base map's plane-local owners are offset by the slab's plane. The original 3-argument form (proc_h == 1, slab-replicated) is unchanged.

Distribute the fused/batch (h, slab) dimension of a general product over a third process-grid axis proc_h, recovering parallelism when the result is small (M*N result tiles < P ranks) -- most acutely no-external products (M=N=1, e.g. the PNO-CCSD PPL intermediate), where the 2-d grid otherwise degenerates to a single rank. The world's first proc_h * proc_h_stride ranks form proc_h h-planes of proc_h_stride = P/proc_h ranks; slab h is evaluated on plane h % proc_h, which runs an ordinary 2-d SUMMA over its own (offset) process grid. Slabs are communication-free (independent), so the surplus of ranks beyond one result-tile-per-rank is spent on this axis. Summa carries per-plane state (first_slab_, my_slabs_), restricts its slab iteration to the plane (next_step), indexes reduce tasks by plane-local slab ordinal (slab_ord), uses plane-unique dense broadcast keys, and computes the result-tile owner (result_tile_owner) as the within-plane cyclic owner shifted by the plane's world-rank offset -- matching set_tile's pmap-routed destination (the two disagreeing was a get_tile/set_tile owner mismatch that deadlocked cross-plane result transfers). proc_h == 1 reproduces the 2-d path exactly. ContEngine::init_distribution_general sizes proc_h by a greedy heuristic (spread ranks beyond min(P, M*N) over the slab axis, bounded by n_slabs) and builds the plane-local grid + 3-d operand/result pmaps. A TODO marks the principled co-optimization of proc_h with the 2-d aspect ratio from the h/left-external/right-external element extents and a memory bound.

Adds general_product_distributed_suite (UNLABELED, so the CI harness runs it at both np=1 and np=2; the existing general_product_suite is serial-labeled and never exercised the batched Summa across ranks). Seven differential cases vs the legacy sub-World einsum oracle: dense, sparse, mixed T x ToT, no-external (dense + ToT), the one-expression THC reconstruction, and dist_no_externals_3d_grid -- which engages the 3-d (proc_h > 1) grid and asserts the no-external result distributes across the h-planes rather than piling on one rank.

Copilot

Pull request overview

This PR extends TiledArray’s batched SUMMA implementation for general products by adding a third, slab-parallel process-grid axis (proc_h). This addresses poor rank utilization in degenerate cases (notably M=N=1 no-external products) by distributing independent fused-index slabs across process planes, while preserving the prior 2‑D behavior when proc_h == 1.

Changes:

Add a rank-subset ProcGrid constructor to build 2‑D grids over contiguous world-rank intervals (supporting per-plane 2‑D SUMMA).
Introduce an h-grouped (3‑D) SlabbedPmap that offsets per-slab owners by the slab’s plane, distributing slabs across process planes.
Update Contraction/Summa and ContEngine to compute proc_h, run per-plane SUMMA, use plane-unique broadcast keys, and fix result-tile owner computation to avoid cross-plane deadlocks; add new distributed tests for general products.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/general_product.cpp	Adds an unlabeled distributed general-product test suite (runs at np=1 and np=2) including a `proc_h>1` coverage case.
src/TiledArray/proc_grid.h	Adds `rank_subset` tagging + rank-offset support so row/col groups map to correct world ranks for rank-subset grids.
src/TiledArray/pmap/slabbed_pmap.h	Adds a 3‑D (h-grouped) `SlabbedPmap` variant that distributes slabs across process planes.
src/TiledArray/expressions/cont_engine.h	Chooses `proc_h` via a heuristic and wires per-plane grids/pmaps into general-product distribution and evaluation.
src/TiledArray/dist_eval/contraction_eval.h	Extends SUMMA to per-plane execution (step filtering, per-group reduce task indexing, unique broadcast IDs) and corrects result-tile ownership across planes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    DenseStepTask(const std::shared_ptr<Summa_>& owner,
                  const ordinal_type depth)
-        : StepTask(owner, owner->nsteps_ + 1ul), k_(0) {
+        : StepTask(owner, owner->my_steps() + 1ul), k_(owner->next_step(0ul)) {
      StepTask::make_next_step_tasks(this, depth);
-      StepTask::spawn_get_row_col_tasks(k_);
+      if (k_ < owner_->nsteps_) StepTask::spawn_get_row_col_tasks(k_);


+// Distributed (np > 1) coverage of general products. Unlike
+// general_product_suite (serial-labeled: classification/optimizer unit tests
+// + np=1 evaluation), this suite carries NO label, so the CI harness runs it
+// at BOTH np=1 and np=2 (see tests/CMakeLists.txt: np-1 excludes @distributed,
+// np-2 excludes @serial). It exercises the batched Summa across ranks -- a
+// path the serial suite never covered. Each case differential-tests the
+// expression route against the legacy sub-World einsum oracle.


+      const size_type P = world->size();
+      proc_h_ = 1ul;
+      if (n_slabs_ > 1ul && P > 1ul) {
+        const size_type p2d_cap = std::min<size_type>(P, M * N);
+        proc_h_ = std::min<size_type>(n_slabs_,
+                                      std::max<size_type>(1ul, P / p2d_cap));
+      }
+      proc_h_stride_ = P / proc_h_;
+


- contraction_eval: clamp the SUMMA step-task pipeline depth to my_steps() (this rank's group's step count) instead of nsteps_. In the 3-d (proc_h_ > 1) case my_steps() < nsteps_, so clamping to nsteps_ pre-spawned surplus step tasks that all resolved to the terminating step (k_ == nsteps_). No-op for the 2-d path (my_slabs_ == nh_). - cont_engine: keep proc_h_stride_ == 0 for the ungrouped 2-d case (proc_h_ == 1), matching the field's documented invariant; only the grouped (proc_h_ > 1) grid uses P / proc_h_. - general_product test: correct the distributed suite header comment -- dist_inner_node_thc validates against explicit binary intermediates, not the legacy einsum oracle.

evaleev added 3 commits June 12, 2026 10:04

evaleev force-pushed the evaleev/feature/mixed-t-tot-trees branch from 2ae292f to 3f00dae Compare June 12, 2026 14:08

evaleev force-pushed the evaleev/feature/summa-3d-grid branch from 32c1b77 to 7b4319c Compare June 12, 2026 14:08

Base automatically changed from evaleev/feature/mixed-t-tot-trees to evaleev/feature/general-product-expr June 12, 2026 15:06

evaleev requested a review from Copilot June 12, 2026 16:05

Copilot started reviewing on behalf of evaleev June 12, 2026 16:05 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

evaleev merged commit 355f9c8 into evaleev/feature/general-product-expr Jun 12, 2026
9 checks passed

evaleev deleted the evaleev/feature/summa-3d-grid branch June 12, 2026 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

summa: 3-d (proc_h) process grid for batched general products#565

summa: 3-d (proc_h) process grid for batched general products#565
evaleev merged 4 commits into
evaleev/feature/general-product-exprfrom
evaleev/feature/summa-3d-grid

evaleev commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

evaleev commented Jun 12, 2026

Why

What

Sizing (placeholder, see TODO)

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants