[Deepin-Kernel-SIG] [linux 6.6.y] [Upstream] Add support for multishot reads Add io_uring support for waitid#1815
Conversation
Reviewer's GuideAdds io_uring multishot read support via a new IORING_OP_READ_MULTISHOT opcode and introduces asynchronous waitid support via a new IORING_OP_WAITID opcode, including core wait/waitid refactoring in kernel/exit.c to share logic with io_uring and to support safe cancellation and wakeup races. Sequence diagram for io_uring WAITID async lifecycle and cancellationsequenceDiagram
actor App
participant IoUring as io_uring
participant WaitidPrep as io_waitid_prep
participant WaitidIssue as io_waitid
participant KWPrep as kernel_waitid_prepare
participant DoWait as __do_wait
participant WQ as wait_chldexit_queue
participant WQCb as io_waitid_wait
participant TW as io_waitid_cb
participant Cancel as io_waitid_cancel
App->>IoUring: submit SQE (opcode IORING_OP_WAITID)
IoUring->>WaitidPrep: io_waitid_prep(sqe)
WaitidPrep-->>IoUring: set io_waitid fields
IoUring->>WaitidIssue: io_waitid(req, issue_flags)
WaitidIssue->>KWPrep: kernel_waitid_prepare(&iwa.wo, which, upid, &info, options, NULL)
KWPrep-->>WaitidIssue: 0 or error
alt prepare error
WaitidIssue-->>IoUring: io_req_set_res(error)
else prepared
WaitidIssue->>WQ: add_wait_queue(wait_chldexit_queue, &iwa.wo.child_wait)
WaitidIssue->>DoWait: __do_wait(&iwa.wo)
alt child already has matching state
DoWait-->>WaitidIssue: ret >= 0 or < 0
WaitidIssue->>IoUring: io_waitid_finish(ret)
else need to sleep
DoWait-->>WaitidIssue: -ERESTARTSYS
Note over WaitidIssue,WQ: armed for async completion
end
end
par child_state_change
WQ->>WQCb: io_waitid_wait(child_wait, mode, sync, task)
alt pid_child_should_wake
WQCb->>TW: io_req_task_work_add(req)
end
and async_task_work
TW->>DoWait: __do_wait(&iwa.wo)
DoWait-->>TW: ret
TW->>WaitidIssue: io_waitid_complete(req, ret)
WaitidIssue->>IoUring: io_req_task_complete(req)
and cancellation
App->>Cancel: io_waitid_cancel(ctx, &cd, issue_flags)
Cancel->>WaitidIssue: __io_waitid_cancel(ctx, req)
alt first owner
WaitidIssue->>WQ: list_del_init(&iwa.wo.child_wait.entry)
WaitidIssue->>IoUring: io_waitid_complete(req, -ECANCELED)
end
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In
kernel_waitid_prepare()you initializef_flagsto 0 and never assign to it, so theO_NONBLOCK-based WNOHANG behavior is effectively dead; either removef_flagsentirely or plumb the intended file flags in so that pidfd/non-blocking semantics are preserved. - The new
io_uring/waitid.hheader lacks an include guard, which can lead to multiple-definition issues if it is included more than once; add a conventional#ifndef/#define/#endifguard around its contents. - In
struct io_waitidthestruct file *filemember is never used and can be removed to avoid confusion and keep the structure minimal.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `kernel_waitid_prepare()` you initialize `f_flags` to 0 and never assign to it, so the `O_NONBLOCK`-based WNOHANG behavior is effectively dead; either remove `f_flags` entirely or plumb the intended file flags in so that pidfd/non-blocking semantics are preserved.
- The new `io_uring/waitid.h` header lacks an include guard, which can lead to multiple-definition issues if it is included more than once; add a conventional `#ifndef/#define/#endif` guard around its contents.
- In `struct io_waitid` the `struct file *file` member is never used and can be removed to avoid confusion and keep the structure minimal.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Pull request overview
This pull request backports upstream io_uring enhancements by adding a new multishot read opcode and adding async waitid(2) support, including necessary refactoring of the core wait logic in kernel/exit.c so it can be reused from io_uring.
Changes:
- Add
IORING_OP_READ_MULTISHOTto support multishot reads using provided buffers on pollable file types. - Add
IORING_OP_WAITIDand a newio_uring/waitid.cimplementation for asynchronouswaitid(2)notifications, with integration into io_uring cancellation and teardown. - Refactor wait/waitid internals by introducing shared helpers (
__do_wait(),kernel_waitid_prepare(),pid_child_should_wake()) and adjusting rw vectored handling to use per-op metadata.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| kernel/exit.h | New internal header exposing wait helpers/structs for reuse by io_uring. |
| kernel/exit.c | Refactors do_wait() logic into __do_wait() and adds shared helper functions. |
| io_uring/waitid.h | Declares io_uring async waitid interfaces and async state container. |
| io_uring/waitid.c | Implements the IORING_OP_WAITID opcode and cancellation/removal support. |
| io_uring/rw.h | Exposes new multishot read prep/issue helpers. |
| io_uring/rw.c | Adds multishot read implementation and updates vectored detection logic. |
| io_uring/opdef.h | Adds a per-op vectored metadata bit to io_issue_def. |
| io_uring/opdef.c | Registers READ_MULTISHOT and WAITID opcodes and marks vectored ops. |
| io_uring/Makefile | Adds waitid.o to the io_uring build. |
| io_uring/io_uring.c | Initializes ctx->waitid_list and cancels waitid requests on teardown. |
| io_uring/cancel.c | Hooks async cancel into WAITID cancellation handling. |
| include/uapi/linux/io_uring.h | Extends SQE union and adds opcode IDs for READ_MULTISHOT and WAITID. |
| include/linux/io_uring_types.h | Adds waitid_list to io_ring_ctx for tracking outstanding waitid requests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| read_lock(&tasklist_lock); | ||
|
|
||
| if (wo->wo_type == PIDTYPE_PID) { | ||
| retval = do_wait_pid(wo); | ||
| if (retval) |
| static long do_wait(struct wait_opts *wo) | ||
| { | ||
| int retval; | ||
|
|
| // SPDX-License-Identifier: GPL-2.0-only | ||
| #ifndef LINUX_WAITID_H | ||
| #define LINUX_WAITID_H | ||
|
|
mainline inclusion
from mainline-v6.7-rc1
category: performance
This is cleaner than gating on the opcode type, particularly as more
read/write type opcodes may be added.
Then we can use that for the data import, and for __io_read() on
whether or not we need to copy state.
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conflicts:
io_uring/rw.c
[because of ("io_uring: add io_file_can_poll() helper") merged before]
(cherry picked from commit d2d778f)
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.7-rc1 category: performance This behaves like IORING_OP_READ, except: 1) It only supports pollable files (eg pipes, sockets, etc). Note that for sockets, you probably want to use recv/recvmsg with multishot instead. 2) It supports multishot mode, meaning it will repeatedly trigger a read and fill a buffer when data is available. This allows similar use to recv/recvmsg but on non-sockets, where a single request will repeatedly post a CQE whenever data is read from it. 3) Because of #2, it must be used with provided buffers. This is uniformly true across any request type that supports multishot and transfers data, with the reason being that it's obviously not possible to pass in a single buffer for the data, as multiple reads may very well trigger before an application has a chance to process previous CQEs and the data passed from them. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: io_uring/rw.c (cherry picked from commit fc68fcd) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.7-rc1 category: other Abstract out the helper that decides if we should wake up following a wake_up() callback on our internal waitqueue. No functional changes intended in this patch. Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 9d900d4) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.7-rc1 category: other Rather than have a maze of gotos, put the actual logic in __do_wait() and have do_wait() loop deal with waitqueue setup/teardown and whether to call __do_wait() again. No functional changes intended in this patch. Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 06a101c) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.7-rc1 category: other Move the setup logic out of kernel_waitid(), and into a separate helper. No functional changes intended in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit eda7e9d) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.7-rc1 category: performance Move struct wait_opts and waitid_info into kernel/exit.h, and include function declarations for the recently added helpers. Make them non-static as well. This is in preparation for adding a waitid operation through io_uring. With the abtracted helpers, this is now possible. Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 2e521a2) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.7-rc1 category: performance This adds support for an async version of waitid(2), in a fully async version. If an event isn't immediately available, wait for a callback to trigger a retry. The format of the sqe is as follows: sqe->len The 'which', the idtype being queried/waited for. sqe->fd The 'pid' (or id) being waited for. sqe->file_index The 'options' being set. sqe->addr2 A pointer to siginfo_t, if any, being filled in. buf_index, add3, and waitid_flags are reserved/unused for now. waitid_flags will be used for options for this request type. One interesting use case may be to add multi-shot support, so that the request stays armed and posts a notification every time a monitored process state change occurs. Note that this does not support rusage, on Arnd's recommendation. See the waitid(2) man page for details on the arguments. Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit f31ecf6) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.7-rc1 category: performance Retain top 8bits of uring_cmd flags for kernel internal use, so that we can move IORING_URING_CMD_POLLED out of uapi header. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 528ce67) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.7-rc1 category: feature uring_cmd may never complete, such as ublk, in which uring cmd isn't completed until one new block request is coming from ublk block device. Add cancelable uring_cmd to provide mechanism to driver for cancelling pending commands in its own way. Add API of io_uring_cmd_mark_cancelable() for driver to mark one command as cancelable, then io_uring will cancel this command in io_uring_cancel_generic(). ->uring_cmd() callback is reused for canceling command in driver's way, then driver gets notified with the cancelling from io_uring. Add API of io_uring_cmd_get_task() to help driver cancel handler deal with the canceling. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: include/linux/io_uring.h (cherry picked from commit 93b8cc6) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
b6407e6 to
4edc681
Compare
mainline inclusion from mainline-v7.1-rc5 category: bugfix IORING_OP_WAITID stores its result fields in struct io_waitid::info and later copies them to userspace siginfo. The prep path initializes the request arguments, but it does not initialize info itself. If the wait operation completes without reporting a child event, the common wait code can return without writing wo_info. In that case io_waitid_finish() still copies iw->info to userspace, exposing stale bytes from the reused io_kiocb command storage. Clear the result storage during prep so the io_uring path matches the regular waitid syscall, which uses a zero-initialized struct waitid_info. Fixes: f31ecf6 ("io_uring: add IORING_OP_WAITID support") Cc: stable@vger.kernel.org # 6.7+ Signed-off-by: Heechan Kang <gganji11@naver.com> Link: https://patch.msgid.link/20260516184709.852814-1-gganji11@naver.com Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 93d93f5f8da791e98159795c6ef683f45bd95d13) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
…eed it mainline inclusion from mainline-v6.7-rc1 category: bugfix The new read multishot method doesn't need to allocate async data ever, as it doesn't do vectored IO and it must only be used with provided buffers. While it doesn't have ->prep_async() set, it also sets ->async_size to 0, which is different from any other read/write type we otherwise support. If it's used on a file type that isn't pollable, we do try and allocate this async data, and then try and use that data. But since we passed in a size of 0 for the data, we get a NULL back on data allocation. We then proceed to dereference that to copy state, and that obviously won't end well. Add a check in io_setup_async_rw() for this condition, and avoid copying state. Also add a check for whether or not buffer selection is specified in prep while at it. Fixes: fc68fcd ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Link: https://bugzilla.kernel.org/show_bug.cgi?id=218101 Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 0df96fb) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.7-rc1 category: bugfix For addr: this field is not used, since buffer select is forced. But by forcing it to be zero it leaves open future uses of the field. len is actually usable, you could imagine that you want to receive multishot up to a certain length. However right now this is not how it is implemented, and it seems safer to force this to be zero. Fixes: fc68fcd ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Dylan Yudaken <dyudaken@gmail.com> Link: https://lore.kernel.org/r/20231106203909.197089-3-dyudaken@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 49fbe99) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.7-rc1 category: bugfix When doing a multishot read, the code path reuses the old read paths. However this breaks an assumption built into those paths, namely that struct io_rw::len is available for reuse by __io_import_iovec. For multishot this results in len being set for the first receive call, and then subsequent calls are clamped to that buffer length incorrectly. Instead keep len as zero after recycling buffers, to reuse the full buffer size of the next selected buffer. Fixes: fc68fcd ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Dylan Yudaken <dyudaken@gmail.com> Link: https://lore.kernel.org/r/20231106203909.197089-4-dyudaken@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit e537592) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.8-rc3 category: bugfix io_read_mshot() always relies on poll triggering retries, and this works fine as long as we do a retry per size of the buffer being read. The buffer size is given by the size of the buffer(s) in the given buffer group ID. But if we're reading less than what is available, then we don't always get to read everything that is available. For example, if the buffers available are 32 bytes and we have 64 bytes to read, then we'll correctly read the first 32 bytes and then wait for another poll trigger before we attempt the next read. This next poll trigger may never happen, in which case we just sit forever and never make progress, or it may trigger at some point in the future, and now we're just delivering the available data much later than we should have. io_read_mshot() could do retries itself, but that is wasteful as we'll be going through all of __io_read() again, and most likely in vain. Rather than do that, bump our poll reference count and have io_poll_check_events() do one more loop and check with vfs_poll() if we have more data to read. If we do, io_read_mshot() will get invoked again directly and we'll read the next chunk. io_poll_multishot_retry() must only get called from inside io_poll_issue(), which is our multishot retry handler, as we know we already "own" the request at this point. Cc: stable@vger.kernel.org Link: axboe/liburing#1041 Fixes: fc68fcd ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit c79f52f) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.9-rc1 category: bugfix We can't post CQEs from io-wq with DEFER_TASKRUN set, normal completions are handled but aux should be explicitly disallowed by opcode handlers. Cc: stable@vger.kernel.org Fixes: fc68fcd ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6fb7cba6f5366da25f4d3eb95273f062309d97fa.1709740837.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 70581dc) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.9-rc1 category: bugfix If read multishot is being invoked from the poll retry handler, then we should return IOU_ISSUE_SKIP_COMPLETE rather than -EAGAIN. If not, then a CQE will be posted with -EAGAIN rather than triggering the retry when the file is flagged as readable again. Cc: stable@vger.kernel.org Reported-by: Sargun Dhillon <sargun@meta.com> Fixes: fc68fcd ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 0a3737d) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.9-rc1 category: bugfix We know the request is either being removed, or already in the process of being removed through task_work, so we can delete it from our waitid list upfront. This is important for remove all conditions, as we otherwise will find it multiple times and prevent cancelation progress. Remove the dead check in cancelation as well for the hash_node being empty or not. We already have a waitid reference check for ownership, so we don't need to check the list too. Cc: stable@vger.kernel.org Fixes: f31ecf6 ("io_uring: add IORING_OP_WAITID support") Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 2b35b8b) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.9-rc3 category: bugfix Supporting multishot reads requires support for NOWAIT, as the alternative would be always having io-wq execute the work item whenever the poll readiness triggered. Any fast file type will have NOWAIT support (eg it understands both O_NONBLOCK and IOCB_NOWAIT). If the given file type does not, then simply resort to single shot execution. Cc: stable@vger.kernel.org Fixes: fc68fcd ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 2a975d4) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
mainline inclusion from mainline-v6.18-rc1 category: bugfix For a successful return, always remove our entry from the wait queue entry list. Previously this was skipped if a cancelation was in progress, but this can race with another invocation of the wait queue entry callback. Cc: stable@vger.kernel.org Fixes: f31ecf6 ("io_uring: add IORING_OP_WAITID support") Reported-by: syzbot+b9e83021d9c642a33d8c@syzkaller.appspotmail.com Tested-by: syzbot+b9e83021d9c642a33d8c@syzkaller.appspotmail.com Link: https://lore.kernel.org/io-uring/68e5195e.050a0220.256323.001f.GAE@google.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 2f8229d) Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
https://lore.kernel.org/all/20230911204021.1479172-1-axboe@kernel.dk/
Hi,
We support multishot for other request types, generally in the shape of
a flag for the request. Doing a flag based approach with reads isn't
straightforward, as the read/write flags are in the RWF_ space. Instead,
add a separate opcode for this, IORING_OP_READ_MULTISHOT.
This can only be used provided buffers, like other multishot request
types that read/receive data.
It can also only be used for pollable file types, like a tun device or
pipes, for example. File types that are always readable (or seekable),
like regular files, cannot be used with multishot reads.
--
Jens Axboe
https://lore.kernel.org/all/20230909151124.1229695-1-axboe@kernel.dk/
Hi,
This adds support for IORING_OP_WAITID, which is an async variant of
the waitid(2) syscall. Rather than have a parent need to block waiting
on a child task state change, it can now simply get an async notication
when the requested state change has occured.
Patches 1..4 are purely prep patches, and should not have functional
changes. They split out parts of do_wait() into __do_wait(), so that
the prepare-to-wait and sleep parts are contained within do_wait().
Patch 5 adds io_uring support.
I wrote a few basic tests for this, which can be found in the
'waitid' branch of liburing:
https://git.kernel.dk/cgit/liburing/log/?h=waitid
Also spun a custom kernel for someone to test it, and no issues reported
so far.
The code can also be found here:
https://git.kernel.dk/cgit/linux/log/?h=io_uring-waitid
include/linux/io_uring_types.h | 2 +
include/uapi/linux/io_uring.h | 2 +
io_uring/Makefile | 3 +-
io_uring/cancel.c | 5 +
io_uring/io_uring.c | 3 +
io_uring/opdef.c | 10 +-
io_uring/waitid.c | 372 +++++++++++++++++++++++++++++++++
io_uring/waitid.h | 15 ++
kernel/exit.c | 131 ++++++------
kernel/exit.h | 30 +++
10 files changed, 512 insertions(+), 61 deletions(-)
Changes since v3:
this means that the opcode values have changed. The liburing repo
has been rebased as a result as well, you'll want to update that too.
liburing also has update test cases.
scheme similar to the internal poll io_uring handling
--
Jens Axboe
Summary by Sourcery
Add io_uring support for multishot reads and async waitid, along with the necessary wait/waitid refactoring and integration.
New Features:
Enhancements:
Build: