[24.04-linux-nvidia-6.17] Backport nexthop removal performance optimization by gobenji · Pull Request #478 · NVIDIA/NV-Kernels

gobenji · 2026-06-30T13:55:16Z

Backport the following series to the linux-nvidia kernel:

Series merged in bb3402f Merge branch 'ipv4-flush-the-fib-once-on-multiple-nexthop-removal'
31c777b ipv4: Provide a FIB flushing signal from nexthop removal functions
35ce551 ipv4: Flush the FIB once on multiple nexthop removal
5dcbd64 ipv4: Add __must_check to nexthop removal functions

The incentive is to get the performance optimization described in the log of commit 35ce551.

LP: https://bugs.launchpad.net/bugs/2158449
RM: 5088664

Ignore: yes Signed-off-by: Jacob Martin <jacob.martin@canonical.com>

BugLink: https://bugs.launchpad.net/bugs/2158201 Properties: no-test-build Signed-off-by: Jacob Martin <jacob.martin@canonical.com>

Signed-off-by: Jacob Martin <jacob.martin@canonical.com>

BugLink: https://bugs.launchpad.net/bugs/2157912 Use the overflow status bit to track overflow on each bandwidth counter read and add the counter size to the correction when overflow is detected. This assumes that only a single overflow has occurred since the last read of the counter. Overflow interrupts, on hardware that supports them could be used to remove this limitation. Cc: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> (backported from commit b353637) [fenghuay: Differences from upstream b353637: - Retains NV long-counter read paths, reset_on_next_read, and T241 now *= 64. - Applies T241 overflow correction * 64 inline in __ris_msmon_read() (moved to mpam_msmon_overflow_val() wrapper in the 9e5afb7 follow-up commit). - Initialize overflow to false. ] Signed-off-by: Fenghua Yu <fenghuay@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2157912 Now that the larger counter sizes are probed, make use of them. Callers of mpam_msmon_read() may not know (or care!) about the different counter sizes. Allow them to specify mpam_feat_msmon_mbwu and have the driver pick the counter to use. Only 32bit accesses to the MSC are required to be supported by the spec, but these registers are 64bits. The lower half may overflow into the higher half between two 32bit reads. To avoid this, use a helper that reads the top half multiple times to check for overflow. Signed-off-by: Rohit Mathew <rohit.mathew@arm.com> [morse: merged multiple patches from Rohit, added explicit counter selection ] Signed-off-by: James Morse <james.morse@arm.com> Cc: Peter Newman <peternewman@google.com> Reviewed-by: Ben Horgan <ben.horgan@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> (backported from commit 9e5afb7) [fenghuay: Partial port; this tree already has long-counter read paths, mpam_msc_read_mbwu_l(), and T241-MPAM-6 workaround. This commit ports per-width overflow correction, OFLOW_STATUS_L overflow detection, and T241 overflow correction scaling via mpam_msmon_overflow_val(). Apply T241-MPAM-6 * 64 scaling to 31/44-bit counters only; skip 63-bit counters for both the counter read and overflow correction. ] Signed-off-by: Fenghua Yu <fenghuay@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2157922 This patch fixes a few issues: 1. To enable MBW total bytes, continue MBWU counter selection after CSU setup. This follows upstream 7.1. 2. Use mbm_cntr_assignable for assignment enablement so occupancy and MBWU sharing a class do not break mbm_L3_assignments. This follows commit 4766d7e https://gitlab.arm.com/linux-arm/linux-bh.git mpam_abmc_v4 3. Always bind picked MBWU to mbm_total_bytes. This follows commit b2bbf3b https://gitlab.arm.com/linux-arm/linux-bh.git mpam_abmc_v4 4. Rebind mbm_total_bytes at monitor init to the resource that owns the MBWU class via resctrl_mon_event_set_resource(), so memory-level MSC platforms expose mbm_total_bytes under mon_MB_* and mbm_MB_assignments. This fix is necessary when only MBW total bytes (i.e. no MBW local bytes) is supported. Signed-off-by: Fenghua Yu <fenghuay@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

… events BugLink: https://bugs.launchpad.net/bugs/2157922 When mbm_total_bytes is exposed on MBA for memory-level MSC monitors, resctrl_arch_mon_capable() must reflect mon_capable on the backing resource, not only L3. Also gate MBM counter auto-assign and unassign on each enabled event's resource so MBA-backed mbm_total_bytes gets ABMC setup on group create and cleanup on group delete. Signed-off-by: Fenghua Yu <fenghuay@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2158038 When running in interval mode every third count of a time event isn't showing properly: ``` $ perf stat -e duration_time -a -I 1000 1.001082862 1,002,290,425 duration_time 2.004264262 1,003,183,516 duration_time 3.007381401 <not counted> duration_time 4.011160141 1,003,705,631 duration_time 5.014515385 1,003,290,110 duration_time 6.018539680 <not counted> duration_time 7.022065321 1,003,591,720 duration_time ``` The regression came in with a different fix, found through bisection, commit 68cb156 ("perf tool_pmu: Fix aggregation on duration_time"). The issue is caused by the enabled and running time of the event matching the old_count's and creating a delta of 0, which is indicative of an error. Fixes: 68cb156 ("perf tool_pmu: Fix aggregation on duration_time") Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> (backported from commit bdf96c4) [mochs: Minor context adjustment due to absent definitions] Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Nirmoy Das <nirmoyd@nvidia.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2156928 This patch updates the lan743x driver to prevent the use of netdev-based logging APIs (such as netdev_dbg) before the network device has been successfully registered. Using netdev-based logging prior to registration results in log messages referencing "(unnamed net_device) (uninitialized)", which can be confusing and less informative. The driver must use netif_msg_ APIs and device-based logging (e.g. dev_dbg) until netdev registration is complete. This ensures log entries are associated with the correct device context and improves log clarity. After registration, netdev-based logging APIs can be used safely. Signed-off-by: David Thompson <davthompson@nvidia.com> Link: https://patch.msgid.link/20260528165017.421576-1-davthompson@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> (cherry picked from commit e3c6508) Signed-off-by: David Thompson <davthompson@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2156928 VLAN-tagged interfaces on lan743x devices were previously unreachable via SSH and failed to respond to large ping packets (e.g. "ping -s 1469" given MTU=1500). In these scenarios, "ethtool -S" reports non-zero "RX Oversize Frame Errors". According to Microchip AN2948, the MAC_RX FSE (VLAN field size enforcement) bit determines whether frames with VLAN tags exceeding the base MTU plus tag length are discarded. The driver must set the MAC_RX.FSE bit before setting MAC_RX.RXEN to allow VLAN-tagged frames up to the interface MTU, preventing them from being treated as oversized. As a result, both the base and VLAN-tagged interfaces can use the same MTU without receive errors. Fixes: 23f0703 ("lan743x: Add main source files for new lan743x driver") Signed-off-by: David Thompson <davthompson@nvidia.com> Reviewed-by: Thangaraj Samynathan <Thangaraj.s@microchip.com> Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Tested-by: Nicolai Buchwitz <nb@tipi-net.de> # lan7430 on arm64 (RevPi Link: https://patch.msgid.link/20260529210300.433135-1-davthompson@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> (cherry picked from commit 8173d22) Signed-off-by: David Thompson <davthompson@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2143032 cxlmd->endpoint starts as ERR_PTR(-ENXIO) until endpoint port registration links the memdev to a real cxl_port. Treat NULL and error pointers as "endpoint not linked" before dereferencing cxlmd->endpoint in CXL helper paths. Fixes: eb61834 ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation") Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Seth Forshee <sforshee@nvidia.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

… messages BugLink: https://bugs.launchpad.net/bugs/2158328 The current MSI-X restoration path assumes the Command register Memory bit is enabled when writing MSI-X messages. But its possible the last saved and restored state of device may not have the Memory bit enabled, even if a device driver later enables Memory bit and MSI-X. Attempting to access Memory space without Memory bit enabled can lead to Unsupported Request (UR) from the device. Fix this by enabling Memory bit and restore it afterwards. Signed-off-by: Farhan Ali <alifm@linux.ibm.com> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com> Reviewed-by: Thomas Gleixner <tglx@kernel.org> (cherry picked from https://lore.kernel.org/all/20260622171840.1618-5-alifm@linux.ibm.com/) Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Seth Forshee <sforshee@nvidia.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2158449 Plumb a bool value throughout the various nexthop removal functions, determined in the innermost __remove_nexthop_fib() (which still does the FIB flushing) and propagated up all callers. The next patch will make use of this signal to optimize the removal of multiple nexthops by moving the FIB flushing up the call hierarchy. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260507075606.322405-2-cratiu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> (backported from commit 31c777b) [bpoirier: Conflicts: - net/ipv4/nexthop.c Due to the absence of b2662e7 net: nexthop: fix percpu use-after-free in remove_nh_grp_entry (v7.0-rc4) -> Adjust context, remove deferred_free argument] Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Acked-by: Omer Barak <obarak@nvidia.com> Acked-by: Aya Levin <ayal@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2158449 When a device is going down or when a net namespace is deleted, all nexthops on it are removed, and for each nexthop being removed the FIB table is flushed, which does a full trie traversal looking for entries marked RTNH_F_DEAD and removing them. This is O(N x R), with N being number of dev nexthops and R being number of IPv4 routes. The RTNL is held the entire time. When there are many nexthops to be removed and many routing entries, this can result in the RTNL being held for multiple minutes, which causes unhappiness in other processes trying to acquire the RTNL (e.g. systemd-networkd for DHCP renewals). In a complicated deployment with multiple vxlan devices, each having 16K nexthops and a total of 128K ipv4 routes, this is exactly what happens: nexthop_flush_dev() # loops over 16K nexthops -> remove_nexthop() -> __remove_nexthop() -> __remove_nexthop_fib() # marks fi->fib_flags |= RTNH_F_DEAD -> fib_flush() # for EACH nexthop! -> fib_table_flush() # walks the ENTIRE FIB, 128K entries This patch makes use of the previously added FIB flushing signal to only do a single FIB flush after all nexthops to be removed are marked as RTNH_F_DEAD: - __remove_nexthop_fib() no longer flushes the FIB. - nexthop_flush_dev() and flush_all_nexthops() now keep track whether any nexthop was removed and trigger a FIB flush at the end. - a new wrapper is defined, remove_one_nexthop() which calls remove_nexthop() and flushes if necessary. This is intended for places which must remove a single nexthop and shouldn't worry about the need to trigger a FIB flush. For now, the only caller is rtm_del_nexthop(). - The two direct callers of __remove_nexthop() get a WARN_ON_ONCE, since the nh about to be removed should not have any FIB entries referencing it when replacing or inserting a new one. This dramatically improves performance from O(N x R) to O(N + R). Releasing a nexthop reference in remove_nexthop() now no longer frees it. Instead, it is deleted when the last fib_info pointing to it gets freed via free_fib_info_rcu(). All routing code is already careful not to take into consideration routes marked with RTNH_F_DEAD. Tested with: DEV=eth2 ip link set up dev $DEV ip link add testnh0 link $DEV type macvlan mode bridge ip addr add 198.51.100.1/24 dev testnh0 ip link set testnh0 up seq 1 65536 | \ sed 's/.*/nexthop add id & via 198.51.100.2 dev testnh0/' | \ ip -batch - i=1 for a in $(seq 0 255); do for b in $(seq 0 255); do echo "route add 10.${a}.${b}.0/32 nhid $i" i=$((i + 1)) done done | ip -batch - time ip link set testnh0 down ip link del testnh0 Without this patch: real 0m32.601s user 0m0.000s sys 0m32.511s With this patch: real 0m0.209s user 0m0.000s sys 0m0.153s Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260507075606.322405-3-cratiu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> (cherry picked from commit 35ce551) Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Acked-by: Omer Barak <obarak@nvidia.com> Acked-by: Aya Levin <ayal@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2158449 These functions return a signal whether FIB flushing is required which must not be ignored. Use the compiler to help with enforcing this requirement in the future. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260507075606.322405-4-cratiu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> (backported from commit 5dcbd64) [bpoirier: Conflicts: - net/ipv4/nexthop.c Due to the absence of b2662e7 net: nexthop: fix percpu use-after-free in remove_nh_grp_entry (v7.0-rc4) -> Adjust context, remove deferred_free argument] Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Acked-by: Omer Barak <obarak@nvidia.com> Acked-by: Aya Levin <ayal@nvidia.com>

github-actions · 2026-06-30T14:07:09Z

PR Validation Report

Patchscan ✅ No Missing Fixes

All cherry-picked commits checked — no missing upstream fixes found.

PR Lint ❌ Errors found

Details

Checking 3 commits...

Cherry-pick digest:
E: d0c571f1ee92 ("ipv4: Add __must_check to nexthop remova"): patch-ID mismatch with upstream 5dcbd64e66ba
E: ae425db79a06 ("ipv4: Provide a FIB flushing signal from"): patch-ID mismatch with upstream 31c777be2a2e
┌──────────────┬──────────────────────────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject                              │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ d0c571f1ee92 │ 5dcbd64e66ba ipv4: Add __must_check to nexthop removal functions │ MISMATCH   │ match   │ preserved + bpoirier adde │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 681844d48a36 │ 35ce55100c61 ipv4: Flush the FIB once on multiple nexthop remova │ match      │ match   │ preserved + bpoirier adde │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ ae425db79a06 │ 31c777be2a2e ipv4: Provide a FIB flushing signal from nexthop re │ MISMATCH   │ match   │ preserved + bpoirier adde │
└──────────────┴──────────────────────────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint: all checks passed.

nirmoy · 2026-06-30T16:36:21Z

BaseOS Kernel Review

Summary

No issues found across the reviewed commits.

Findings: no problems found

Latest watcher review: open review

Generated test plan: open test plan

Kernel deb build: successful (download debs, 4 files)

Head: d0c571f1ee92

This comment is maintained by nv-pr-bot. It is updated when the GitHub watcher publishes a newer review.

nvmochs · 2026-06-30T19:42:39Z

@gobenji Can you provide more context on these patches? The 6.17-HWE kernel is going EOL

jacobmartin0 and others added 15 commits June 24, 2026 18:21

UBUNTU: Start new release

1126af9

Ignore: yes Signed-off-by: Jacob Martin <jacob.martin@canonical.com>

UBUNTU: link-to-tracker: update tracking bug

08c9781

BugLink: https://bugs.launchpad.net/bugs/2158201 Properties: no-test-build Signed-off-by: Jacob Martin <jacob.martin@canonical.com>

UBUNTU: Ubuntu-nvidia-6.17-6.17.0-1027.27

069676f

Signed-off-by: Jacob Martin <jacob.martin@canonical.com>

nirmoy added the help wanted Extra attention is needed label Jun 30, 2026

nirmoy added the pending_review_comment label Jun 30, 2026

nvidia-bfigg force-pushed the 24.04_linux-nvidia-6.17-next branch from 362b68d to aea4c7d Compare July 1, 2026 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[24.04-linux-nvidia-6.17] Backport nexthop removal performance optimization#478

[24.04-linux-nvidia-6.17] Backport nexthop removal performance optimization#478
gobenji wants to merge 15 commits into
NVIDIA:24.04_linux-nvidia-6.17-nextfrom
gobenji:nexthop-removal-performance

gobenji commented Jun 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

nirmoy commented Jun 30, 2026

Uh oh!

nvmochs commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

gobenji commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 30, 2026

PR Validation Report

Patchscan ✅ No Missing Fixes

PR Lint ❌ Errors found

Uh oh!

nirmoy commented Jun 30, 2026

BaseOS Kernel Review

Summary

Uh oh!

nvmochs commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

gobenji commented Jun 30, 2026 •

edited

Loading