Skip to content

[24.04-linux-nvidia-6.17] Backport nexthop removal performance optimization#478

Open
gobenji wants to merge 15 commits into
NVIDIA:24.04_linux-nvidia-6.17-nextfrom
gobenji:nexthop-removal-performance
Open

[24.04-linux-nvidia-6.17] Backport nexthop removal performance optimization#478
gobenji wants to merge 15 commits into
NVIDIA:24.04_linux-nvidia-6.17-nextfrom
gobenji:nexthop-removal-performance

Conversation

@gobenji

@gobenji gobenji commented Jun 30, 2026

Copy link
Copy Markdown

Backport the following series to the linux-nvidia kernel:

Series merged in bb3402f Merge branch 'ipv4-flush-the-fib-once-on-multiple-nexthop-removal'
31c777b ipv4: Provide a FIB flushing signal from nexthop removal functions
35ce551 ipv4: Flush the FIB once on multiple nexthop removal
5dcbd64 ipv4: Add __must_check to nexthop removal functions

The incentive is to get the performance optimization described in the log of commit 35ce551.

LP: https://bugs.launchpad.net/bugs/2158449
RM: 5088664

jacobmartin0 and others added 15 commits June 24, 2026 18:21
Ignore: yes
Signed-off-by: Jacob Martin <jacob.martin@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2158201
Properties: no-test-build
Signed-off-by: Jacob Martin <jacob.martin@canonical.com>
Signed-off-by: Jacob Martin <jacob.martin@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2157912

Use the overflow status bit to track overflow on each bandwidth counter
read and add the counter size to the correction when overflow is detected.

This assumes that only a single overflow has occurred since the last read
of the counter. Overflow interrupts, on hardware that supports them could
be used to remove this limitation.

Cc: Zeng Heng <zengheng4@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Zeng Heng <zengheng4@huawei.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(backported from commit b353637)
[fenghuay: Differences from upstream b353637:
 - Retains NV long-counter read paths, reset_on_next_read, and T241 now *= 64.
 - Applies T241 overflow correction * 64 inline in __ris_msmon_read() (moved
   to mpam_msmon_overflow_val() wrapper in the 9e5afb7 follow-up commit).
 - Initialize overflow to false.
]
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2157912

Now that the larger counter sizes are probed, make use of them.

Callers of mpam_msmon_read() may not know (or care!) about the different
counter sizes. Allow them to specify mpam_feat_msmon_mbwu and have the
driver pick the counter to use.

Only 32bit accesses to the MSC are required to be supported by the
spec, but these registers are 64bits. The lower half may overflow
into the higher half between two 32bit reads. To avoid this, use
a helper that reads the top half multiple times to check for overflow.

Signed-off-by: Rohit Mathew <rohit.mathew@arm.com>
[morse: merged multiple patches from Rohit, added explicit counter selection ]
Signed-off-by: James Morse <james.morse@arm.com>
Cc: Peter Newman <peternewman@google.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(backported from commit 9e5afb7)
[fenghuay: Partial port; this tree already has long-counter read paths,
 mpam_msc_read_mbwu_l(), and T241-MPAM-6 workaround.
 This commit ports per-width overflow correction, OFLOW_STATUS_L overflow
 detection, and T241 overflow correction scaling via mpam_msmon_overflow_val().
 Apply T241-MPAM-6 * 64 scaling to 31/44-bit counters only; skip 63-bit
 counters for both the counter read and overflow correction.
]
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2157922

This patch fixes a few issues:

1. To enable MBW total bytes, continue MBWU counter selection after
CSU setup. This follows upstream 7.1.

2. Use mbm_cntr_assignable for assignment enablement
so occupancy and MBWU sharing a class do not break mbm_L3_assignments.
This follows commit 4766d7e https://gitlab.arm.com/linux-arm/linux-bh.git mpam_abmc_v4

3. Always bind picked MBWU to mbm_total_bytes.
This follows commit b2bbf3b https://gitlab.arm.com/linux-arm/linux-bh.git mpam_abmc_v4

4. Rebind mbm_total_bytes at monitor init to the resource that owns the
MBWU class via resctrl_mon_event_set_resource(), so memory-level MSC
platforms expose mbm_total_bytes under mon_MB_* and mbm_MB_assignments.
This fix is necessary when only MBW total bytes (i.e. no MBW local
bytes) is supported.

Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
… events

BugLink: https://bugs.launchpad.net/bugs/2157922

When mbm_total_bytes is exposed on MBA for memory-level MSC monitors,
resctrl_arch_mon_capable() must reflect mon_capable on the backing
resource, not only L3. Also gate MBM counter auto-assign and unassign
on each enabled event's resource so MBA-backed mbm_total_bytes gets
ABMC setup on group create and cleanup on group delete.

Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2158038

When running in interval mode every third count of a time event isn't
showing properly:
```
$ perf stat -e duration_time -a -I 1000
     1.001082862      1,002,290,425      duration_time
     2.004264262      1,003,183,516      duration_time
     3.007381401      <not counted>      duration_time
     4.011160141      1,003,705,631      duration_time
     5.014515385      1,003,290,110      duration_time
     6.018539680      <not counted>      duration_time
     7.022065321      1,003,591,720      duration_time
```
The regression came in with a different fix, found through bisection,
commit 68cb156 ("perf tool_pmu: Fix aggregation on
duration_time"). The issue is caused by the enabled and running time
of the event matching the old_count's and creating a delta of 0, which
is indicative of an error.

Fixes: 68cb156 ("perf tool_pmu: Fix aggregation on duration_time")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
(backported from commit bdf96c4)
[mochs: Minor context adjustment due to absent definitions]
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2156928

This patch updates the lan743x driver to prevent the use of netdev-based
logging APIs (such as netdev_dbg) before the network device has been
successfully registered. Using netdev-based logging prior to registration
results in log messages referencing "(unnamed net_device) (uninitialized)",
which can be confusing and less informative.

The driver must use netif_msg_ APIs and device-based logging (e.g. dev_dbg)
until netdev registration is complete. This ensures log entries are
associated with the correct device context and improves log clarity. After
registration, netdev-based logging APIs can be used safely.

Signed-off-by: David Thompson <davthompson@nvidia.com>
Link: https://patch.msgid.link/20260528165017.421576-1-davthompson@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(cherry picked from commit e3c6508)
Signed-off-by: David Thompson <davthompson@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2156928

VLAN-tagged interfaces on lan743x devices were previously unreachable via
SSH and failed to respond to large ping packets (e.g. "ping -s 1469" given
MTU=1500). In these scenarios, "ethtool -S" reports non-zero "RX Oversize
Frame Errors". According to Microchip AN2948, the MAC_RX FSE (VLAN field
size enforcement) bit determines whether frames with VLAN tags exceeding
the base MTU plus tag length are discarded.

The driver must set the MAC_RX.FSE bit before setting MAC_RX.RXEN to allow
VLAN-tagged frames up to the interface MTU, preventing them from being
treated as oversized. As a result, both the base and VLAN-tagged interfaces
can use the same MTU without receive errors.

Fixes: 23f0703 ("lan743x: Add main source files for new lan743x driver")
Signed-off-by: David Thompson <davthompson@nvidia.com>
Reviewed-by: Thangaraj Samynathan <Thangaraj.s@microchip.com>
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Tested-by: Nicolai Buchwitz <nb@tipi-net.de> # lan7430 on arm64 (RevPi
Link: https://patch.msgid.link/20260529210300.433135-1-davthompson@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(cherry picked from commit 8173d22)
Signed-off-by: David Thompson <davthompson@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

cxlmd->endpoint starts as ERR_PTR(-ENXIO) until endpoint port registration
links the memdev to a real cxl_port.

Treat NULL and error pointers as "endpoint not linked" before dereferencing
cxlmd->endpoint in CXL helper paths.

Fixes: eb61834 ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Seth Forshee <sforshee@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
… messages

BugLink: https://bugs.launchpad.net/bugs/2158328

The current MSI-X restoration path assumes the Command register Memory bit
is enabled when writing MSI-X messages. But its possible the last saved and
restored state of device may not have the Memory bit enabled, even if a
device driver later enables Memory bit and MSI-X. Attempting to access
Memory space without Memory bit enabled can lead to Unsupported Request
(UR) from the device. Fix this by enabling Memory bit and restore
it afterwards.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@kernel.org>
(cherry picked from https://lore.kernel.org/all/20260622171840.1618-5-alifm@linux.ibm.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Seth Forshee <sforshee@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2158449

Plumb a bool value throughout the various nexthop removal functions,
determined in the innermost __remove_nexthop_fib() (which still does the
FIB flushing) and propagated up all callers.

The next patch will make use of this signal to optimize the removal of
multiple nexthops by moving the FIB flushing up the call hierarchy.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260507075606.322405-2-cratiu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(backported from commit 31c777b)
[bpoirier:
 Conflicts:
 - net/ipv4/nexthop.c
 	Due to the absence of
 	b2662e7 net: nexthop: fix percpu use-after-free in remove_nh_grp_entry (v7.0-rc4)
 	-> Adjust context, remove deferred_free argument]
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Acked-by: Omer Barak <obarak@nvidia.com>
Acked-by: Aya Levin <ayal@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2158449

When a device is going down or when a net namespace is deleted, all
nexthops on it are removed, and for each nexthop being removed the FIB
table is flushed, which does a full trie traversal looking for entries
marked RTNH_F_DEAD and removing them. This is O(N x R), with N being
number of dev nexthops and R being number of IPv4 routes.

The RTNL is held the entire time.

When there are many nexthops to be removed and many routing entries,
this can result in the RTNL being held for multiple minutes, which
causes unhappiness in other processes trying to acquire the RTNL (e.g.
systemd-networkd for DHCP renewals).

In a complicated deployment with multiple vxlan devices, each having
16K nexthops and a total of 128K ipv4 routes, this is exactly what
happens:

nexthop_flush_dev()                # loops over 16K nexthops
  -> remove_nexthop()
    -> __remove_nexthop()
      -> __remove_nexthop_fib()    # marks fi->fib_flags |= RTNH_F_DEAD
        -> fib_flush()             # for EACH nexthop!
	  -> fib_table_flush()     # walks the ENTIRE FIB, 128K entries

This patch makes use of the previously added FIB flushing signal to only
do a single FIB flush after all nexthops to be removed are marked as
RTNH_F_DEAD:
- __remove_nexthop_fib() no longer flushes the FIB.
- nexthop_flush_dev() and flush_all_nexthops() now keep track whether
  any nexthop was removed and trigger a FIB flush at the end.
- a new wrapper is defined, remove_one_nexthop() which calls
  remove_nexthop() and flushes if necessary. This is intended for places
  which must remove a single nexthop and shouldn't worry about the need
  to trigger a FIB flush. For now, the only caller is rtm_del_nexthop().
- The two direct callers of __remove_nexthop() get a WARN_ON_ONCE, since
  the nh about to be removed should not have any FIB entries referencing
  it when replacing or inserting a new one.

This dramatically improves performance from O(N x R) to O(N + R).

Releasing a nexthop reference in remove_nexthop() now no longer frees
it. Instead, it is deleted when the last fib_info pointing to it gets
freed via free_fib_info_rcu(). All routing code is already careful not
to take into consideration routes marked with RTNH_F_DEAD.

Tested with:
DEV=eth2
ip link set up dev $DEV
ip link add testnh0 link $DEV type macvlan mode bridge
ip addr add 198.51.100.1/24 dev testnh0
ip link set testnh0 up

seq 1 65536 | \
sed 's/.*/nexthop add id & via 198.51.100.2 dev testnh0/' | \
ip -batch -

i=1
for a in $(seq 0 255); do
  for b in $(seq 0 255); do
    echo "route add 10.${a}.${b}.0/32 nhid $i"
    i=$((i + 1))
  done
done | ip -batch -

time ip link set testnh0 down
ip link del testnh0

Without this patch:
real	0m32.601s
user	0m0.000s
sys	0m32.511s

With this patch:
real	0m0.209s
user	0m0.000s
sys	0m0.153s

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260507075606.322405-3-cratiu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(cherry picked from commit 35ce551)
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Acked-by: Omer Barak <obarak@nvidia.com>
Acked-by: Aya Levin <ayal@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2158449

These functions return a signal whether FIB flushing is required which
must not be ignored. Use the compiler to help with enforcing this
requirement in the future.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260507075606.322405-4-cratiu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(backported from commit 5dcbd64)
[bpoirier:
 Conflicts:
 - net/ipv4/nexthop.c
 	Due to the absence of
 	b2662e7 net: nexthop: fix percpu use-after-free in remove_nh_grp_entry (v7.0-rc4)
 	-> Adjust context, remove deferred_free argument]
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Acked-by: Omer Barak <obarak@nvidia.com>
Acked-by: Aya Levin <ayal@nvidia.com>
@nirmoy nirmoy added the help wanted Extra attention is needed label Jun 30, 2026
@github-actions

Copy link
Copy Markdown
Contributor

PR Validation Report

Patchscan ✅ No Missing Fixes

All cherry-picked commits checked — no missing upstream fixes found.

PR Lint ❌ Errors found

Details
Checking 3 commits...

Cherry-pick digest:
E: d0c571f1ee92 ("ipv4: Add __must_check to nexthop remova"): patch-ID mismatch with upstream 5dcbd64e66ba
E: ae425db79a06 ("ipv4: Provide a FIB flushing signal from"): patch-ID mismatch with upstream 31c777be2a2e
┌──────────────┬──────────────────────────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject                              │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ d0c571f1ee92 │ 5dcbd64e66ba ipv4: Add __must_check to nexthop removal functions │ MISMATCH   │ match   │ preserved + bpoirier adde │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 681844d48a36 │ 35ce55100c61 ipv4: Flush the FIB once on multiple nexthop remova │ match      │ match   │ preserved + bpoirier adde │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ ae425db79a06 │ 31c777be2a2e ipv4: Provide a FIB flushing signal from nexthop re │ MISMATCH   │ match   │ preserved + bpoirier adde │
└──────────────┴──────────────────────────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint: all checks passed.

@nirmoy

nirmoy commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

BaseOS Kernel Review

Summary

No issues found across the reviewed commits.

Findings: no problems found

Latest watcher review: open review

Generated test plan: open test plan

Kernel deb build: successful (download debs, 4 files)

Head: d0c571f1ee92

This comment is maintained by nv-pr-bot. It is updated when the GitHub watcher publishes a newer review.

@nvmochs

nvmochs commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

@gobenji Can you provide more context on these patches? The 6.17-HWE kernel is going EOL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

help wanted Extra attention is needed pending_review_comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants