Skip to content

bdev/nvme: abort in-flight qpair I/O when destructing a controller#55

Open
schmidt-scaled wants to merge 1 commit into
nvmf-blocking-from-basefrom
fix/destruct-abort-inflight-qpair-io
Open

bdev/nvme: abort in-flight qpair I/O when destructing a controller#55
schmidt-scaled wants to merge 1 commit into
nvmf-blocking-from-basefrom
fix/destruct-abort-inflight-qpair-io

Conversation

@schmidt-scaled

Copy link
Copy Markdown

Problem

Once a controller enters destruct, both bdev_nvme_reset_ctrlr_unsafe() and bdev_nvme_failover_ctrlr_unsafe() return -ENXIO (if (nvme_ctrlr->destruct) return -ENXIO;). So after a bdev_nvme_delete, the reset/failover path can no longer tear down the I/O qpairs.

If the transport peer is unreachable (remote target stopped), I/O already submitted to a qpair is:

  • never completed by a wire response, and
  • never aborted by a reset (suppressed above).

Consumers keep their channels open until that I/O completes, which pins the controller ref indefinitely and blocks the destruct from ever finishing.

Field evidence

Seen on a remote_jm bdev whose journal-client held a perpetually-outstanding heartbeat read:

  1. 09:21:08 — graceful delete → destruct = true, SPDK_BDEV_EVENT_REMOVE delivered; JC keeps its channel open waiting on the in-flight read. Sibling remote_alceml devices on the same node tore down cleanly (no long-outstanding read).
  2. Peer vanished. Admin-queue failure (poll_adminq → failover) and the per-I/O timeout (timeout_cb → reset) both requested recovery and both returned -ENXIO because destruct was already set. No reset, no qpair teardown.
  3. Controller sat as a ghost — Submitting Keep Alive failed every ~5s — for ~10 minutes, until an unrelated leadership change made the JC release the read and finally close the channel.

Root cause: destruct disables the only mechanism that aborts in-flight qpair I/O, but performs no qpair teardown of its own.

Fix

On destruct, proactively disconnect the I/O qpairs (unless a reset is already doing so), reusing the existing bdev_nvme_reset_destroy_qpair iteration. dont_retry is set so aborted I/O is failed (DNR) rather than retried. This lets consumers complete their I/O and close their channels so the destruct can complete.

  • bdev_nvme_destruct_abort_qpairs() — guarded (skips if resetting), takes a ref, iterates channels.
  • bdev_nvme_destruct_abort_qpairs_done() — drops the ref after the async iteration.
  • _nvme_ctrlr_destruct() — calls it between depopulate_namespaces and put_ref.

Test

test_destruct_abort_inflight_io (in bdev_nvme_ut.c): two consumers hold channels with connected qpairs, then bdev_nvme_delete triggers destruct with no reset. Asserts resetting == false yet both ctrlr_ch->qpair->qpair == NULL after polling — i.e. destruct disconnected the qpairs on its own. Before the fix these stay non-NULL (the deadlock).

Not yet verified

⚠️ Not compiled/run in this environment (no CUnit / unpopulated submodules / not ./configured). Needs a local ./test/unit/unittest.sh (or targeted make -C test/unit/lib/bdev/nvme) pass before merge. Note the SPDK-fork CI does not currently run unit tests.

🤖 Generated with Claude Code

Once a controller is being destructed, bdev_nvme_reset_ctrlr_unsafe() and
bdev_nvme_failover_ctrlr_unsafe() both return -ENXIO, so the reset/failover
path can no longer tear down the I/O qpairs. If the transport peer is
unreachable (e.g. the remote target was stopped), I/O already submitted to a
qpair is never completed by a wire response and never aborted by a reset.
Consumers keep their channels open until that I/O finishes, which pins the
controller indefinitely and blocks the destruct from ever completing.

This was observed in the field on a remote_jm bdev whose journal-client held a
perpetually-outstanding heartbeat read: a graceful delete set destruct=true, the
peer then vanished, the admin-queue-failure and per-I/O-timeout both requested a
reset that returned -ENXIO, and the controller sat as a ghost (Keep Alive failed
every ~5s) for ~10 minutes until an unrelated leadership change made the client
release the read.

Fix: on destruct, proactively disconnect the I/O qpairs (unless a reset is
already doing so), reusing the existing bdev_nvme_reset_destroy_qpair path. This
aborts the outstanding I/O with DNR set so it is failed rather than retried,
letting consumers complete their I/O and close their channels so the destruct
can finish.

Adds test_destruct_abort_inflight_io to verify that destruct disconnects the
qpairs on its own, with no reset in progress and channels still held.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant