nvbug-6193808: Work around mojibake in nvml.system_get_process_name on WSL by mdboom · Pull Request #2118 · NVIDIA/cuda-python

mdboom · 2026-05-20T19:20:07Z

Summary

cuda.core.system.get_process_name(pid) raises UnicodeDecodeError under
WSL whenever the calling process has a non-C locale (which is the default
state for any CPython process, since the interpreter calls
setlocale(LC_ALL, "") at startup). This is reproducible by running the
cuda_core test suite with any seed that schedules
tests/system/test_system_device.py::test_compute_running_processes before
tests/system/test_system_system.py::test_get_process_name.

The underlying defect is in NVML's WSL implementation (see Root cause
below). This PR adds a scoped, defensive workaround in cuda_core so the
public API returns a correct value on WSL. It also fixes a latent issue
where get_process_name was effectively unusable from a fresh process
because it never primed NVML's per-PID name cache.

Root cause: the WSL mojibake

NVML's nvmlSystemGetProcessName on WSL takes a different code path
depending on the process's current locale. With the default "C" locale,
the function returns the basename portion of /proc/<pid>/exe correctly.
With any other locale (including the typical en_US.UTF-8), it instead
walks an internal UTF-16LE buffer holding the executable path but uses a
4-byte stride (as if the buffer were UTF-32LE). Each "code point" it
pulls is therefore two adjacent ASCII bytes packed into the low and
next-higher bytes of a single 24-bit value. That value is then emitted as
an extended 5-byte UTF-8 sequence (the 0xF8-prefixed encoding used to
represent code points beyond U+10FFFF).

The net result for, say, a Python process whose /proc/<pid>/exe resolves
to:

/home/mdboom/.local/share/uv/python/cpython-3.14.0-linux-x86_64-gnu/bin/python3.14

is that the returned buffer looks like ~180 bytes of f8 … chunks
followed by the correctly-encoded trailing basename, e.g.:

f8 9a 80 80 af  f8 9b 90 81 af  f8 8b b0 81 a5  f8 99 80 81 ad  …  /python3.14\0

Decoding the first chunk illustrates the pattern:

bytes f8 9a 80 80 af decode as the extended-UTF-8 code point 0x68002F
that 24-bit value packs 'h' (0x68) in the high byte and '/' (0x2F)
in the low byte — i.e. the source ASCII bytes /h read as a
little-endian 32-bit value padded with zeros

Every chunk has this structure; together they spell out the prefix
/home/mdboom/.local/share/uv/python/cpython-3.14.0-linux-x86_64-gnu/bin/
two characters at a time. The trailing /python3.14 is unaffected because
of where the buggy stride leaves the cursor.

Why the workaround needs to "re-prime"

nvmlSystemGetProcessName is cache-driven: the per-PID name is populated
the first time NVML enumerates compute processes that include the PID
(typically via nvmlDeviceGetComputeRunningProcesses_v3). Critically:

The mojibake is produced during the prime call, not during the read.
Once a buggy entry is in NVML's cache, switching to the "C" locale and
re-reading does not unscramble it — the cache survives the locale
flip.
Re-running the prime call under the "C" locale overwrites the cached
entry with the correct UTF-8 string. Subsequent reads (in any locale)
then return correctly.

So the workaround must do prime + read together under "C".

Behaviour after this PR

Native Linux / Windows: behaviour is identical to before, except that
get_process_name now primes the NVML cache automatically. This makes
it usable from a fresh process (previously a caller had to have
manually queried device.compute_running_processes first or accept
NotFoundError).
WSL: the locale flip is applied around the prime + read sequence,
so the returned name is the correct UTF-8 string regardless of the
caller's locale.

Discussion

Should we instead try to fix this at the cuda_bindings layer and fix it
for all cuda_bindings users? In that case I guess cuda_core should raise
an exception from get_process_name if on WSL and the cuda_bindings
installed is too old?
I do plan to file the underlying bug with NVML. Locale-sensitive APIs should
generally be avoided.

…n WSL

…e-on-wsl

rwgk

I used Cursor GPT-5.5 1M Extra High

GPT Findings

Medium: cuda_bindings/docs/source/release/13.2.0-notes.rst and
cuda_bindings/docs/source/release/13.3.0-notes.rst give an incomplete
raw-NVML workaround. The PR and nvbug context indicate the corruption happens
when nvmlDeviceGetComputeRunningProcesses_v3 first primes the PID cache, so
setting the locale to "C" only before nvml.system_get_process_name can be
too late if the cache was already populated. The notes should say to prime and
read under "C", matching the cuda.core workaround.
Medium: cuda_core/cuda/core/system/_system.pyx now makes
get_process_name() depend on successfully enumerating compute processes on
every device. Any unrelated device-level NVML failure now breaks a per-PID
lookup, including on non-WSL where this is a behavior change. Consider trying
the direct read first on non-WSL, or making priming failures narrower.
Low: cuda_core/docs/source/release/1.1.0-notes.rst says the WSL workaround
may hold a global lock, but the implementation uses POSIX per-thread locale
APIs and no global lock is present. That note looks stale or misleading.

Lightweight Thread-Safety Recommendation

The new locale switching implementation is reasonably thread-safe because it
uses POSIX newlocale/uselocale/freelocale, which scope the "C" locale to
the calling OS thread rather than mutating process-global locale state.

The remaining race is around NVML's process-name cache, which appears to be
process/global driver state. On WSL, another thread could call a cache-priming
path such as nvmlDeviceGetComputeRunningProcesses_v3 under a non-"C" locale
between get_process_name()'s prime and read steps, reintroducing corrupted
cached data.

A lightweight improvement would be to add a module-private Python
threading.RLock shared by the cuda.core paths most likely to touch this
cache. Hold it around:

system.get_process_name()'s WSL c_locale_guard() + prime + read sequence.
Device.compute_running_processes on WSL, since that is the main in-package
path that can prime the process-name cache.

This would not protect raw cuda.bindings.nvml.* calls or external users, but
it would cover the most likely cuda.core race without globally serializing all
NVML access. A Python RLock is sufficient here: even if Cython releases the GIL
inside NVML wrappers, the lock remains held until the surrounding Python with
block exits.

kkraus14 · 2026-05-21T02:42:07Z

 ------------

 * Updating from older versions (v12.6.2.post1 and below) via ``pip install -U cuda-python`` might not work. Please do a clean re-installation by uninstalling ``pip uninstall -y cuda-python`` followed by installing ``pip install cuda-python``.
+* ``nvml.system_get_process_name`` on WSL can return incorrect values.  To work around this, set the locale to "C" before calling ``nvml.device_get_compute_running_processes_v3`` (which sets the process names) and before calling ``nvml.system_get_process_name``. ``cuda_core`` does this automatically, but users of the raw NVML API will need to do this manually.


Why is this in both 13.2 and 13.3 release notes?

It's a compatibility note that's in line with the existing v12.6.2.post1 and below note, which we've been carrying forward since 12.8.0, I believe in every single release since then:

$ git grep 'v12.6.2.post1 and below' | cutniq | cut -d/ -f-4 | uniq -c 16 cuda_bindings/docs/source/release 15 cuda_python/docs/source/release

I was working on the assumption that we put "known issues" on previous releases. (Since the issue applies to all of them).

mdboom · 2026-05-21T11:58:57Z

@kkraus14, @rwgk: How do you feel about my question?

Should we instead try to fix this at the cuda_bindings layer and fix it for all cuda_bindings users? In that case I guess cuda_core should raise an exception from get_process_name if on WSL and the cuda_bindings installed is too old?

kkraus14 · 2026-05-21T15:03:30Z

@kkraus14, @rwgk: How do you feel about my question?

Should we instead try to fix this at the cuda_bindings layer and fix it for all cuda_bindings users? In that case I guess cuda_core should raise an exception from get_process_name if on WSL and the cuda_bindings installed is too old?

My 2c: We very intentionally don't do any magic in cuda_bindings where if the underlying C library has this behavior / issue then I think it's fine for cuda_bindings to have it as well. Having a workaround in cuda_core feels correct to me.

rwgk · 2026-05-21T15:16:08Z

@kkraus14, @rwgk: How do you feel about my question?

Should we instead try to fix this at the cuda_bindings layer and fix it for all cuda_bindings users? In that case I guess cuda_core should raise an exception from get_process_name if on WSL and the cuda_bindings installed is too old?

My 2c: We very intentionally don't do any magic in cuda_bindings where if the underlying C library has this behavior / issue then I think it's fine for cuda_bindings to have it as well. Having a workaround in cuda_core feels correct to me.

That's the conclusion I'm coming to, too. I discussed this rationale with GPT-5.5:

evidence is strong that there is a bug in NVML
it's not the purpose of the cuda_bindings layer to add sophisticated workarounds for bugs in the wrapped libraries
cuda_core is by design adding a higher-level layer, sophisticated workarounds are more suitable and feasible in that layer (and this PR looks great to me)
the compatibility note in this PR is helpful, but it's not even really a compatibility note, but more a heads-up: we know there is a bug in NVML, we know how to work around it, and we added the workaround in cuda_core, but if you use cuda_bindings directly you need to be aware of it and work around it yourself
ultimately we expect that the NVML bug will be fixed, and no workaround is needed in the future

Additional idea, to be helpful to our users:

Separately from the cuda_core workaround, I still think cuda_bindings would benefit from routing C-string decoding through a tiny helper, to make failures actionable. The current UnicodeDecodeError tells us only the first bad byte and position; in this case the useful information was the raw buffer contents and which NVML call produced them. A helper could keep the success path identical, but on decode failure raise an exception/message that includes the source API name plus a bounded repr/hex dump of the bytes. That would have made the nvbug much easier to diagnose and should be generally useful for future driver/library string issues. — I would avoid special system_get_process_name behavior; it's too niche.

rwgk · 2026-05-21T15:27:26Z

I cancelled the CI manually after seeing that Python 3.12 with 12.9 is hanging again (two jobs). I'll report the details on the tracking bug we have already. I triggered a rerun.

rwgk · 2026-05-21T15:42:35Z

The hanging jobs from the previous attempt are now tracked here:

#2004 (comment)

rwgk · 2026-05-21T15:52:38Z

It looks like the jobs tracked under #2004 (comment) are hanging again. Maybe the driver update made the issue a lot worse than before? Each time this happens, two runners are blocked for 4 hours, and of course merging the PR is blocked, too.

Interestingly, I just see this:

tests/memory_ipc/test_send_buffers.py::TestIpcReexport::test_main[DeviceMR] RERUN [ 12%]

But after that it's still hanging.

rwgk · 2026-05-21T16:08:34Z

Motivated by the experience here, I created #2122 — [ENH]: Make cuda_bindings UnicodeDecodeError more actionable

rwgk · 2026-05-21T16:36:24Z

The two jobs were hanging again in the 4th attempt. I cancelled them to unblock the runners. In the meantime @aryanputta sent PR #2121, I just triggered the CI there. I'll come back here to try again after PR #2121 is merged.

github-actions · 2026-05-22T00:40:45Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

nvbug-6193808: Work around mojibake in nvml.system_get_process_name o…

06a2bc8

…n WSL

mdboom added this to the cuda.core next milestone May 20, 2026

mdboom self-assigned this May 20, 2026

mdboom added bug Something isn't working P1 Medium priority - Should do cuda.core Everything related to the cuda.core module labels May 20, 2026

github-actions Bot added the cuda.bindings Everything related to the cuda.bindings module label May 20, 2026

mdboom added 2 commits May 20, 2026 15:23

Merge remote-tracking branch 'upstream/main' into fix-get-process-nam…

aa31400

…e-on-wsl

Re-enable test

bfb518e

This comment has been minimized.

Sign in to view

Move POSIX-only functionality to a separate module

22fd00c

rwgk reviewed May 20, 2026

View reviewed changes

Address comments in the PR

960659c

mdboom requested a review from rwgk May 20, 2026 23:47

rwgk approved these changes May 21, 2026

View reviewed changes

kkraus14 reviewed May 21, 2026

View reviewed changes

kkraus14 approved these changes May 21, 2026

View reviewed changes

rwgk enabled auto-merge (squash) May 21, 2026 15:27

rwgk mentioned this pull request May 21, 2026

[ENH]: Make cuda_bindings UnicodeDecodeError more actionable #2122

Open

This was referenced May 21, 2026

tests: kill zombie IPC child processes after join timeout #2121

Closed

tests: layered defense against IPC child-process hangs (#2004) #2124

Merged

Merge branch 'main' into fix-get-process-name-on-wsl

7fd04fe

rwgk merged commit 2957595 into NVIDIA:main May 21, 2026
184 of 186 checks passed

This comment has been minimized.

Sign in to view

rwgk mentioned this pull request May 22, 2026

Maintenance: small CI and test-helper cleanups #2126

Merged

aryanputta mentioned this pull request May 22, 2026

cuda.bindings: add decode_c_str helper for actionable UnicodeDecodeError (#2122) #2128

Open

Conversation

mdboom commented May 20, 2026

Summary

Root cause: the WSL mojibake

Why the workaround needs to "re-prime"

Behaviour after this PR

Discussion

Uh oh!

This comment has been minimized.

rwgk left a comment

Choose a reason for hiding this comment

GPT Findings

Lightweight Thread-Safety Recommendation

Uh oh!

kkraus14 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

rwgk May 21, 2026

Choose a reason for hiding this comment

Uh oh!

mdboom May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mdboom commented May 21, 2026

Uh oh!

kkraus14 commented May 21, 2026

Uh oh!

rwgk commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rwgk commented May 21, 2026

Uh oh!

rwgk commented May 21, 2026

Uh oh!

rwgk commented May 21, 2026

Uh oh!

rwgk commented May 21, 2026

Uh oh!

rwgk commented May 21, 2026

Uh oh!

Uh oh!

This comment has been minimized.

github-actions Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rwgk commented May 21, 2026 •

edited

Loading