Skip to content

[STRESS / DO NOT MERGE] Validate Fabric GPU interop necessity on multi-GPU CI#6300

Draft
hujc7 wants to merge 24 commits into
isaac-sim:developfrom
hujc7:jichuanh/mgpu-integration-diagnostic
Draft

[STRESS / DO NOT MERGE] Validate Fabric GPU interop necessity on multi-GPU CI#6300
hujc7 wants to merge 24 commits into
isaac-sim:developfrom
hujc7:jichuanh/mgpu-integration-diagnostic

Conversation

@hujc7

@hujc7 hujc7 commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

1. Summary

  • Diagnostic-only PR; do not merge.
  • Validates whether --/physics/fabricUseGPUInterop=false is necessary for multi-GPU CI stability.
  • Reuses the production multi-GPU container, shard runner, queue reconciliation, and test selection.

2. Experiment

  • Run five negative trials first with the Fabric interop argument completely omitted.
  • Run five controls with --/physics/fabricUseGPUInterop=false.
  • Run all 17 currently eligible multi-GPU test files in every trial; ignore all MULTI_GPU_SKIP_REASON exclusions. The former 18th file, test_newton_model_utils.py, no longer has multi-GPU device parametrization after Propagate Newton shape colors before cloning #6194.
  • Hold the three renderer pinning arguments and all other workflow settings constant.
  • Log the resolved Carb setting and preserve launcher logs, JUnit XML, aggregate summaries, and queue reconciliation for every trial.
  • Gate unrelated Docker and installation test jobs so this PR only consumes test capacity for this diagnostic.

3. Decision criteria

  • Evidence that the flag is needed: at least one negative trial reproduces the target crash, timeout, or queue-orphan failure while controls complete cleanly.
  • An invalid setting resolution, incomplete file set, unclaimed test, or orphaned test invalidates that trial.
  • If the first 5 + 5 trials are inconclusive, run another 5 + 5 before drawing a conclusion.

hujc7 added 24 commits June 5, 2026 08:55
Most test callers pass both ``sim_cfg=`` and ``device=`` to
:func:`isaaclab.sim.build_simulation_context`, implicitly expecting the
``device`` kwarg to win. The helper previously dropped the kwarg silently
when ``sim_cfg`` was provided, causing warp kernel-launch device
mismatches on non-default GPUs: the test fixture allocated ``env_ids``
on the requested device while the articulation's ``self.device``
resolved from the untouched ``sim_cfg`` default (``cuda:0``), and
``wp.launch(..., device=self.device)`` failed with::

    RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index',
    trying to launch on device='cuda:0',
    but input array for argument 'env_ids' is on device=cuda:2.

Change ``device``'s default to ``None`` (sentinel) and apply it as an
override after sim_cfg construction in both branches. The one test that
asserted the old "sim_cfg overrides everything" contract is updated to
cover the new override semantics.
Add an ISAACLAB_PIN_KIT_GPU env var to AppLauncher. When truthy, it
appends Kit command-line overrides that pin the renderer to a single
GPU (renderer.multiGpu.enabled=False, autoEnable=False, maxGpuCount=1)
and disable the fabric GPU-interop path (physics.fabricUseGPUInterop=
false), so each Kit process touches only its assigned GPU instead of
enumerating every visible GPU at startup.

Used by the multi-GPU CI workflow to avoid a shared GPU-interop context
across concurrent sibling shards, which otherwise surfaces as
"Stage X already attached" errors and SimulationApp.close hangs (see
isaac-sim#3475). Off by default;
single-GPU and user-facing rendering paths are unchanged.
Adds the scope-intersect-runtime device selection helper (test_devices) and its unit test, so unit tests can declare the devices they are valid on and the multi-GPU lane can narrow them via ISAACLAB_TEST_DEVICES.
Pins torch and Warp to the target device before allocations and scopes the CUDA graph capture to it, so Newton runs correctly on cuda:1+ (issue isaac-sim#5132).
…test

tools/conftest.py gains a directory-rename work queue (claim/inflight/done) for work-stealing across shards and a per-file report slug to avoid JUnit collisions between same-basename files. A workspace-root conftest skips non-device-parametrized tests on non-default cuda shards, since single-GPU CI already covers them on cuda:0.
…ions

Adds an extra-env-vars input so the multi-GPU workflow can inject ISAACLAB_TEST_DEVICES / ISAACLAB_SIM_DEVICE into the test container.
Adds a workflow that runs the unit-test suite across non-default cuda devices in one container with N parallel pytest shards pulling from a shared work queue, plus the inside-container shard runner it mounts.
Switches device-parametrized unit tests to test_devices() so they also run on the non-default GPUs in the multi-GPU lane. Mechanical scope change only; no test logic changes.
…cher

When the caller does not pass an explicit device, AppLauncher reads ISAACLAB_SIM_DEVICE and uses it as the device. Lets the multi-GPU CI lane boot Kit on a non-default GPU without editing every test's AppLauncher() call site.
Speeds up iteration on the multi-GPU lane: forces run_docker_tests=false in build.yaml and gates docs.yaml / install-ci.yml behind a DO-NOT-MERGE PR-title check. Revert this commit before the PR is merged.
The test job's runs-on carried the `gpu` label, which in the
self-hosted fleet tags single-GPU runners. Requiring both `gpu` and
`multi-gpu` routed the job onto a single-GPU box, so the shard runner
aborted with "Need at least 2 visible devices; found 1"; the
over-constrained label also left the job queued for hours. Drop `gpu`
so the job targets the multi-GPU pool.
Move the host-orchestration bash (symlink, work-queue seed, MIG
detection, docker run, reconciler) and the JUnit-XML aggregation Python
out of the workflow's inline run: blocks into version-controlled,
lint-able scripts under .github/actions/multi-gpu/, alongside the
existing multi_gpu_shard_runner.sh. The workflow drops from 456 to 181
lines and each step becomes a one-line call.

Behavior-preserving: data still flows via step env, $GITHUB_OUTPUT,
$GITHUB_ENV, and $GITHUB_STEP_SUMMARY; the container --name now uses
the runner built-ins $GITHUB_RUN_ID/$GITHUB_RUN_ATTEMPT in place of the
YAML-only ${{ github.run_id }} templating. Also documents why the test
job must not carry the gpu label.
The ECR cache repo is resolved per runner pool (single-GPU runners ->
gitci-docker-cache; multi-GPU runners -> multigpu-docker-cache). Building
on [self-hosted, gpu] pushed the image to gitci-docker-cache, which the
multi-GPU test job cannot see, so it rebuilt from scratch (~27 min) on the
multi-GPU runner inside its pull step. Build on the multi-GPU pool so the
image lands in multigpu-docker-cache and the test job's pull hits.
Move non-default-shard device selection out of a repo-root conftest.py
into a scoped pytest plugin (mgpu_shard_select) that tools/conftest.py
injects per file only on multi-GPU shards. The plugin keys off
ISAACLAB_TEST_DEVICES so keep/drop matches each test's device
parametrization, deselects out-of-scope variants (cpu, cuda:0,
other-index), and maps the all-deselected NO_TESTS_COLLECTED exit to
OK. Confines the behavior to the lane instead of a global root conftest.
Annotate the uncommon bash idioms in the two multi-GPU orchestration
scripts for non-bash readers. Remove the now-dead py-spy/gdb
hang-capture remnants -- the SYS_PTRACE cap-add and the py-spy pip
install -- along with the matching changelog bullet, since the
conftest-side capture they supported was removed.
# Conflicts:
#	source/isaaclab_ovphysx/test/assets/test_articulation.py
#	source/isaaclab_ovphysx/test/assets/test_rigid_object.py
#	source/isaaclab_ovphysx/test/assets/test_rigid_object_collection.py
#	tools/conftest.py
Make DeviceScope composable while preserving raw mask support. Derive Kit boot and shard reporting from ISAACLAB_TEST_DEVICES so test device selection has one source of truth.
Run the complete multi-GPU suite five times without the Fabric GPU\ninterop override, followed by five controls with it disabled. Preserve\nper-trial settings, queue state, logs, and JUnit reports so the flag's\nnecessity can be evaluated on the target CI hardware.
@github-actions github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels Jun 30, 2026
@hujc7

hujc7 commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

Fabric GPU interop diagnostic result

The negative condition reproduced the multi-GPU CI instability.

  • Attempt 1 (5 omitted + 5 false): all 10 final trials passed with 17/17 files, zero JUnit failures, zero unclaimed files, and zero orphans. Negative trial 4 had a 120 s test_simulation_context.py startup hang on its first attempt and passed only after the built-in retry. All five explicit-false controls were clean. Requested/resolved setting markers were correct in every trial.
  • Attempt 2 (same SHA/image/runner): the first negative trial again omitted /physics/fabricUseGPUInterop, seeded all 17 files, then the entire A/B step was terminated after 526 s with exit 143 (SIGTERM). Runner cleanup found and terminated the orphaned Docker process. There was no workflow cancellation, timeout, or lost-runner message. The termination prevented reconciliation, artifact upload, and the remaining trials.

Conclusion: keep --/physics/fabricUseGPUInterop=false in the multi-GPU CI lane. The exact negative condition produced one recovered startup hang and then one fatal signal termination, while the five same-SHA controls completed cleanly. This validates the flag as a necessary practical CI mitigation; it does not establish that Fabric interop is the sole root cause or justify disabling it globally.

Runs: attempt 1, attempt 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant