bdev/nvme: abort in-flight qpair I/O when destructing a controller#55
Open
schmidt-scaled wants to merge 1 commit into
Open
bdev/nvme: abort in-flight qpair I/O when destructing a controller#55schmidt-scaled wants to merge 1 commit into
schmidt-scaled wants to merge 1 commit into
Conversation
Once a controller is being destructed, bdev_nvme_reset_ctrlr_unsafe() and bdev_nvme_failover_ctrlr_unsafe() both return -ENXIO, so the reset/failover path can no longer tear down the I/O qpairs. If the transport peer is unreachable (e.g. the remote target was stopped), I/O already submitted to a qpair is never completed by a wire response and never aborted by a reset. Consumers keep their channels open until that I/O finishes, which pins the controller indefinitely and blocks the destruct from ever completing. This was observed in the field on a remote_jm bdev whose journal-client held a perpetually-outstanding heartbeat read: a graceful delete set destruct=true, the peer then vanished, the admin-queue-failure and per-I/O-timeout both requested a reset that returned -ENXIO, and the controller sat as a ghost (Keep Alive failed every ~5s) for ~10 minutes until an unrelated leadership change made the client release the read. Fix: on destruct, proactively disconnect the I/O qpairs (unless a reset is already doing so), reusing the existing bdev_nvme_reset_destroy_qpair path. This aborts the outstanding I/O with DNR set so it is failed rather than retried, letting consumers complete their I/O and close their channels so the destruct can finish. Adds test_destruct_abort_inflight_io to verify that destruct disconnects the qpairs on its own, with no reset in progress and channels still held. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Once a controller enters destruct, both
bdev_nvme_reset_ctrlr_unsafe()andbdev_nvme_failover_ctrlr_unsafe()return-ENXIO(if (nvme_ctrlr->destruct) return -ENXIO;). So after abdev_nvme_delete, the reset/failover path can no longer tear down the I/O qpairs.If the transport peer is unreachable (remote target stopped), I/O already submitted to a qpair is:
Consumers keep their channels open until that I/O completes, which pins the controller ref indefinitely and blocks the destruct from ever finishing.
Field evidence
Seen on a
remote_jmbdev whose journal-client held a perpetually-outstanding heartbeat read:09:21:08— graceful delete →destruct = true,SPDK_BDEV_EVENT_REMOVEdelivered; JC keeps its channel open waiting on the in-flight read. Siblingremote_alcemldevices on the same node tore down cleanly (no long-outstanding read).poll_adminq→ failover) and the per-I/O timeout (timeout_cb→ reset) both requested recovery and both returned-ENXIObecausedestructwas already set. No reset, no qpair teardown.Submitting Keep Alive failedevery ~5s — for ~10 minutes, until an unrelated leadership change made the JC release the read and finally close the channel.Root cause: destruct disables the only mechanism that aborts in-flight qpair I/O, but performs no qpair teardown of its own.
Fix
On destruct, proactively disconnect the I/O qpairs (unless a reset is already doing so), reusing the existing
bdev_nvme_reset_destroy_qpairiteration.dont_retryis set so aborted I/O is failed (DNR) rather than retried. This lets consumers complete their I/O and close their channels so the destruct can complete.bdev_nvme_destruct_abort_qpairs()— guarded (skips ifresetting), takes a ref, iterates channels.bdev_nvme_destruct_abort_qpairs_done()— drops the ref after the async iteration._nvme_ctrlr_destruct()— calls it betweendepopulate_namespacesandput_ref.Test
test_destruct_abort_inflight_io(inbdev_nvme_ut.c): two consumers hold channels with connected qpairs, thenbdev_nvme_deletetriggers destruct with no reset. Assertsresetting == falseyet bothctrlr_ch->qpair->qpair == NULLafter polling — i.e. destruct disconnected the qpairs on its own. Before the fix these stay non-NULL (the deadlock).Not yet verified
./configured). Needs a local./test/unit/unittest.sh(or targetedmake -C test/unit/lib/bdev/nvme) pass before merge. Note the SPDK-fork CI does not currently run unit tests.🤖 Generated with Claude Code