[STRESS / DO NOT MERGE] Validate Fabric GPU interop necessity on multi-GPU CI#6300
Draft
hujc7 wants to merge 24 commits into
Draft
[STRESS / DO NOT MERGE] Validate Fabric GPU interop necessity on multi-GPU CI#6300hujc7 wants to merge 24 commits into
hujc7 wants to merge 24 commits into
Conversation
Most test callers pass both ``sim_cfg=`` and ``device=`` to
:func:`isaaclab.sim.build_simulation_context`, implicitly expecting the
``device`` kwarg to win. The helper previously dropped the kwarg silently
when ``sim_cfg`` was provided, causing warp kernel-launch device
mismatches on non-default GPUs: the test fixture allocated ``env_ids``
on the requested device while the articulation's ``self.device``
resolved from the untouched ``sim_cfg`` default (``cuda:0``), and
``wp.launch(..., device=self.device)`` failed with::
RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index',
trying to launch on device='cuda:0',
but input array for argument 'env_ids' is on device=cuda:2.
Change ``device``'s default to ``None`` (sentinel) and apply it as an
override after sim_cfg construction in both branches. The one test that
asserted the old "sim_cfg overrides everything" contract is updated to
cover the new override semantics.
Add an ISAACLAB_PIN_KIT_GPU env var to AppLauncher. When truthy, it appends Kit command-line overrides that pin the renderer to a single GPU (renderer.multiGpu.enabled=False, autoEnable=False, maxGpuCount=1) and disable the fabric GPU-interop path (physics.fabricUseGPUInterop= false), so each Kit process touches only its assigned GPU instead of enumerating every visible GPU at startup. Used by the multi-GPU CI workflow to avoid a shared GPU-interop context across concurrent sibling shards, which otherwise surfaces as "Stage X already attached" errors and SimulationApp.close hangs (see isaac-sim#3475). Off by default; single-GPU and user-facing rendering paths are unchanged.
Adds the scope-intersect-runtime device selection helper (test_devices) and its unit test, so unit tests can declare the devices they are valid on and the multi-GPU lane can narrow them via ISAACLAB_TEST_DEVICES.
Pins torch and Warp to the target device before allocations and scopes the CUDA graph capture to it, so Newton runs correctly on cuda:1+ (issue isaac-sim#5132).
…test tools/conftest.py gains a directory-rename work queue (claim/inflight/done) for work-stealing across shards and a per-file report slug to avoid JUnit collisions between same-basename files. A workspace-root conftest skips non-device-parametrized tests on non-default cuda shards, since single-GPU CI already covers them on cuda:0.
…ions Adds an extra-env-vars input so the multi-GPU workflow can inject ISAACLAB_TEST_DEVICES / ISAACLAB_SIM_DEVICE into the test container.
Adds a workflow that runs the unit-test suite across non-default cuda devices in one container with N parallel pytest shards pulling from a shared work queue, plus the inside-container shard runner it mounts.
Switches device-parametrized unit tests to test_devices() so they also run on the non-default GPUs in the multi-GPU lane. Mechanical scope change only; no test logic changes.
…cher When the caller does not pass an explicit device, AppLauncher reads ISAACLAB_SIM_DEVICE and uses it as the device. Lets the multi-GPU CI lane boot Kit on a non-default GPU without editing every test's AppLauncher() call site.
Speeds up iteration on the multi-GPU lane: forces run_docker_tests=false in build.yaml and gates docs.yaml / install-ci.yml behind a DO-NOT-MERGE PR-title check. Revert this commit before the PR is merged.
The test job's runs-on carried the `gpu` label, which in the self-hosted fleet tags single-GPU runners. Requiring both `gpu` and `multi-gpu` routed the job onto a single-GPU box, so the shard runner aborted with "Need at least 2 visible devices; found 1"; the over-constrained label also left the job queued for hours. Drop `gpu` so the job targets the multi-GPU pool.
Move the host-orchestration bash (symlink, work-queue seed, MIG
detection, docker run, reconciler) and the JUnit-XML aggregation Python
out of the workflow's inline run: blocks into version-controlled,
lint-able scripts under .github/actions/multi-gpu/, alongside the
existing multi_gpu_shard_runner.sh. The workflow drops from 456 to 181
lines and each step becomes a one-line call.
Behavior-preserving: data still flows via step env, $GITHUB_OUTPUT,
$GITHUB_ENV, and $GITHUB_STEP_SUMMARY; the container --name now uses
the runner built-ins $GITHUB_RUN_ID/$GITHUB_RUN_ATTEMPT in place of the
YAML-only ${{ github.run_id }} templating. Also documents why the test
job must not carry the gpu label.
The ECR cache repo is resolved per runner pool (single-GPU runners -> gitci-docker-cache; multi-GPU runners -> multigpu-docker-cache). Building on [self-hosted, gpu] pushed the image to gitci-docker-cache, which the multi-GPU test job cannot see, so it rebuilt from scratch (~27 min) on the multi-GPU runner inside its pull step. Build on the multi-GPU pool so the image lands in multigpu-docker-cache and the test job's pull hits.
Move non-default-shard device selection out of a repo-root conftest.py into a scoped pytest plugin (mgpu_shard_select) that tools/conftest.py injects per file only on multi-GPU shards. The plugin keys off ISAACLAB_TEST_DEVICES so keep/drop matches each test's device parametrization, deselects out-of-scope variants (cpu, cuda:0, other-index), and maps the all-deselected NO_TESTS_COLLECTED exit to OK. Confines the behavior to the lane instead of a global root conftest.
Annotate the uncommon bash idioms in the two multi-GPU orchestration scripts for non-bash readers. Remove the now-dead py-spy/gdb hang-capture remnants -- the SYS_PTRACE cap-add and the py-spy pip install -- along with the matching changelog bullet, since the conftest-side capture they supported was removed.
…ests actions" This reverts commit 6cef107.
# Conflicts: # source/isaaclab_ovphysx/test/assets/test_articulation.py # source/isaaclab_ovphysx/test/assets/test_rigid_object.py # source/isaaclab_ovphysx/test/assets/test_rigid_object_collection.py # tools/conftest.py
Make DeviceScope composable while preserving raw mask support. Derive Kit boot and shard reporting from ISAACLAB_TEST_DEVICES so test device selection has one source of truth.
Run the complete multi-GPU suite five times without the Fabric GPU\ninterop override, followed by five controls with it disabled. Preserve\nper-trial settings, queue state, logs, and JUnit reports so the flag's\nnecessity can be evaluated on the target CI hardware.
Collaborator
Author
Fabric GPU interop diagnostic resultThe negative condition reproduced the multi-GPU CI instability.
Conclusion: keep |
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
1. Summary
--/physics/fabricUseGPUInterop=falseis necessary for multi-GPU CI stability.2. Experiment
--/physics/fabricUseGPUInterop=false.MULTI_GPU_SKIP_REASONexclusions. The former 18th file,test_newton_model_utils.py, no longer has multi-GPU device parametrization after Propagate Newton shape colors before cloning #6194.3. Decision criteria