[STRESS / DO NOT MERGE] Validate Fabric GPU interop necessity on multi-GPU CI by hujc7 · Pull Request #6300 · isaac-sim/IsaacLab

hujc7 · 2026-06-30T06:22:59Z

1. Summary

Diagnostic-only PR; do not merge.
Validates whether --/physics/fabricUseGPUInterop=false is necessary for multi-GPU CI stability.
Reuses the production multi-GPU container, shard runner, queue reconciliation, and test selection.

2. Experiment

Run five negative trials first with the Fabric interop argument completely omitted.
Run five controls with --/physics/fabricUseGPUInterop=false.
Run all 17 currently eligible multi-GPU test files in every trial; ignore all MULTI_GPU_SKIP_REASON exclusions. The former 18th file, test_newton_model_utils.py, no longer has multi-GPU device parametrization after Propagate Newton shape colors before cloning #6194.
Hold the three renderer pinning arguments and all other workflow settings constant.
Log the resolved Carb setting and preserve launcher logs, JUnit XML, aggregate summaries, and queue reconciliation for every trial.
Gate unrelated Docker and installation test jobs so this PR only consumes test capacity for this diagnostic.

3. Decision criteria

Evidence that the flag is needed: at least one negative trial reproduces the target crash, timeout, or queue-orphan failure while controls complete cleanly.
An invalid setting resolution, incomplete file set, unclaimed test, or orphaned test invalidates that trial.
If the first 5 + 5 trials are inconclusive, run another 5 + 5 before drawing a conclusion.

Most test callers pass both ``sim_cfg=`` and ``device=`` to :func:`isaaclab.sim.build_simulation_context`, implicitly expecting the ``device`` kwarg to win. The helper previously dropped the kwarg silently when ``sim_cfg`` was provided, causing warp kernel-launch device mismatches on non-default GPUs: the test fixture allocated ``env_ids`` on the requested device while the articulation's ``self.device`` resolved from the untouched ``sim_cfg`` default (``cuda:0``), and ``wp.launch(..., device=self.device)`` failed with:: RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index', trying to launch on device='cuda:0', but input array for argument 'env_ids' is on device=cuda:2. Change ``device``'s default to ``None`` (sentinel) and apply it as an override after sim_cfg construction in both branches. The one test that asserted the old "sim_cfg overrides everything" contract is updated to cover the new override semantics.

Add an ISAACLAB_PIN_KIT_GPU env var to AppLauncher. When truthy, it appends Kit command-line overrides that pin the renderer to a single GPU (renderer.multiGpu.enabled=False, autoEnable=False, maxGpuCount=1) and disable the fabric GPU-interop path (physics.fabricUseGPUInterop= false), so each Kit process touches only its assigned GPU instead of enumerating every visible GPU at startup. Used by the multi-GPU CI workflow to avoid a shared GPU-interop context across concurrent sibling shards, which otherwise surfaces as "Stage X already attached" errors and SimulationApp.close hangs (see isaac-sim#3475). Off by default; single-GPU and user-facing rendering paths are unchanged.

Adds the scope-intersect-runtime device selection helper (test_devices) and its unit test, so unit tests can declare the devices they are valid on and the multi-GPU lane can narrow them via ISAACLAB_TEST_DEVICES.

Pins torch and Warp to the target device before allocations and scopes the CUDA graph capture to it, so Newton runs correctly on cuda:1+ (issue isaac-sim#5132).

…test tools/conftest.py gains a directory-rename work queue (claim/inflight/done) for work-stealing across shards and a per-file report slug to avoid JUnit collisions between same-basename files. A workspace-root conftest skips non-device-parametrized tests on non-default cuda shards, since single-GPU CI already covers them on cuda:0.

…ions Adds an extra-env-vars input so the multi-GPU workflow can inject ISAACLAB_TEST_DEVICES / ISAACLAB_SIM_DEVICE into the test container.

Adds a workflow that runs the unit-test suite across non-default cuda devices in one container with N parallel pytest shards pulling from a shared work queue, plus the inside-container shard runner it mounts.

Switches device-parametrized unit tests to test_devices() so they also run on the non-default GPUs in the multi-GPU lane. Mechanical scope change only; no test logic changes.

…cher When the caller does not pass an explicit device, AppLauncher reads ISAACLAB_SIM_DEVICE and uses it as the device. Lets the multi-GPU CI lane boot Kit on a non-default GPU without editing every test's AppLauncher() call site.

Speeds up iteration on the multi-GPU lane: forces run_docker_tests=false in build.yaml and gates docs.yaml / install-ci.yml behind a DO-NOT-MERGE PR-title check. Revert this commit before the PR is merged.

The test job's runs-on carried the `gpu` label, which in the self-hosted fleet tags single-GPU runners. Requiring both `gpu` and `multi-gpu` routed the job onto a single-GPU box, so the shard runner aborted with "Need at least 2 visible devices; found 1"; the over-constrained label also left the job queued for hours. Drop `gpu` so the job targets the multi-GPU pool.

Move the host-orchestration bash (symlink, work-queue seed, MIG detection, docker run, reconciler) and the JUnit-XML aggregation Python out of the workflow's inline run: blocks into version-controlled, lint-able scripts under .github/actions/multi-gpu/, alongside the existing multi_gpu_shard_runner.sh. The workflow drops from 456 to 181 lines and each step becomes a one-line call. Behavior-preserving: data still flows via step env, $GITHUB_OUTPUT, $GITHUB_ENV, and $GITHUB_STEP_SUMMARY; the container --name now uses the runner built-ins $GITHUB_RUN_ID/$GITHUB_RUN_ATTEMPT in place of the YAML-only ${{ github.run_id }} templating. Also documents why the test job must not carry the gpu label.

The ECR cache repo is resolved per runner pool (single-GPU runners -> gitci-docker-cache; multi-GPU runners -> multigpu-docker-cache). Building on [self-hosted, gpu] pushed the image to gitci-docker-cache, which the multi-GPU test job cannot see, so it rebuilt from scratch (~27 min) on the multi-GPU runner inside its pull step. Build on the multi-GPU pool so the image lands in multigpu-docker-cache and the test job's pull hits.

Move non-default-shard device selection out of a repo-root conftest.py into a scoped pytest plugin (mgpu_shard_select) that tools/conftest.py injects per file only on multi-GPU shards. The plugin keys off ISAACLAB_TEST_DEVICES so keep/drop matches each test's device parametrization, deselects out-of-scope variants (cpu, cuda:0, other-index), and maps the all-deselected NO_TESTS_COLLECTED exit to OK. Confines the behavior to the lane instead of a global root conftest.

Annotate the uncommon bash idioms in the two multi-GPU orchestration scripts for non-bash readers. Remove the now-dead py-spy/gdb hang-capture remnants -- the SYS_PTRACE cap-add and the py-spy pip install -- along with the matching changelog bullet, since the conftest-side capture they supported was removed.

…ests actions" This reverts commit 6cef107.

…m#5823" This reverts commit c514199.

# Conflicts: # source/isaaclab_ovphysx/test/assets/test_articulation.py # source/isaaclab_ovphysx/test/assets/test_rigid_object.py # source/isaaclab_ovphysx/test/assets/test_rigid_object_collection.py # tools/conftest.py

Make DeviceScope composable while preserving raw mask support. Derive Kit boot and shard reporting from ISAACLAB_TEST_DEVICES so test device selection has one source of truth.

Run the complete multi-GPU suite five times without the Fabric GPU\ninterop override, followed by five controls with it disabled. Preserve\nper-trial settings, queue state, logs, and JUnit reports so the flag's\nnecessity can be evaluated on the target CI hardware.

hujc7 · 2026-06-30T09:24:01Z

Fabric GPU interop diagnostic result

The negative condition reproduced the multi-GPU CI instability.

Attempt 1 (5 omitted + 5 false): all 10 final trials passed with 17/17 files, zero JUnit failures, zero unclaimed files, and zero orphans. Negative trial 4 had a 120 s test_simulation_context.py startup hang on its first attempt and passed only after the built-in retry. All five explicit-false controls were clean. Requested/resolved setting markers were correct in every trial.
Attempt 2 (same SHA/image/runner): the first negative trial again omitted /physics/fabricUseGPUInterop, seeded all 17 files, then the entire A/B step was terminated after 526 s with exit 143 (SIGTERM). Runner cleanup found and terminated the orphaned Docker process. There was no workflow cancellation, timeout, or lost-runner message. The termination prevented reconciliation, artifact upload, and the remaining trials.

Conclusion: keep --/physics/fabricUseGPUInterop=false in the multi-GPU CI lane. The exact negative condition produced one recovered startup hang and then one fatal signal termination, while the five same-SHA controls completed cleanly. This validates the flag as a necessary practical CI mitigation; it does not establish that Fabric interop is the sole root cause or justify disabling it globally.

Runs: attempt 1, attempt 2.

hujc7 added 24 commits June 5, 2026 08:55

[Tests] Add test_devices helper for device-parametrized unit tests

f1c0d18

Adds the scope-intersect-runtime device selection helper (test_devices) and its unit test, so unit tests can declare the devices they are valid on and the multi-GPU lane can narrow them via ISAACLAB_TEST_DEVICES.

[Newton] Fix Newton/Warp init on non-default CUDA devices

878990a

Pins torch and Warp to the target device before allocations and scopes the CUDA graph capture to it, so Newton runs correctly on cuda:1+ (issue isaac-sim#5132).

[CI] Forward extra env vars through run-tests / run-package-tests act…

6cef107

…ions Adds an extra-env-vars input so the multi-GPU workflow can inject ISAACLAB_TEST_DEVICES / ISAACLAB_SIM_DEVICE into the test container.

[CI] Add multi-GPU pytest workflow

71658df

Adds a workflow that runs the unit-test suite across non-default cuda devices in one container with N parallel pytest shards pulling from a shared work queue, plus the inside-container shard runner it mounts.

[Tests] Parametrize unit tests over multi-GPU device scope

cef4569

Switches device-parametrized unit tests to test_devices() so they also run on the non-default GPUs in the multi-GPU lane. Mechanical scope change only; no test logic changes.

TEMP: skip docker/docs/install-ci CI while iterating isaac-sim#5823

c514199

Speeds up iteration on the multi-GPU lane: forces run_docker_tests=false in build.yaml and gates docs.yaml / install-ci.yml behind a DO-NOT-MERGE PR-title check. Revert this commit before the PR is merged.

Revert "[CI] Forward extra env vars through run-tests / run-package-t…

1e710b1

…ests actions" This reverts commit 6cef107.

Revert "TEMP: skip docker/docs/install-ci CI while iterating isaac-si…

4441f68

…m#5823" This reverts commit c514199.

Merge remote-tracking branch 'origin/develop' into jichuanh/multi-gpu-ci

3694e7f

Merge remote-tracking branch 'origin/develop' into jichuanh/multi-gpu-ci

2a18923

Refine multi-GPU test device selection

c799821

Unify multi-GPU test device selection

52a23ac

Make DeviceScope composable while preserving raw mask support. Derive Kit boot and shard reporting from ISAACLAB_TEST_DEVICES so test device selection has one source of truth.

Fix OVPhysX test merge resolution

14593ea

github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels Jun 30, 2026

hujc7 mentioned this pull request Jun 30, 2026

[MGPU] App: make Kit renderer multi-GPU opt-in #5933

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[STRESS / DO NOT MERGE] Validate Fabric GPU interop necessity on multi-GPU CI#6300

[STRESS / DO NOT MERGE] Validate Fabric GPU interop necessity on multi-GPU CI#6300
hujc7 wants to merge 24 commits into
isaac-sim:developfrom
hujc7:jichuanh/mgpu-integration-diagnostic

hujc7 commented Jun 30, 2026

Uh oh!

hujc7 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hujc7 commented Jun 30, 2026

1. Summary

2. Experiment

3. Decision criteria

Uh oh!

hujc7 commented Jun 30, 2026

Fabric GPU interop diagnostic result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant