Skip to content

nvbug-6193808: Work around mojibake in nvml.system_get_process_name on WSL#2118

Merged
rwgk merged 6 commits into
NVIDIA:mainfrom
mdboom:fix-get-process-name-on-wsl
May 21, 2026
Merged

nvbug-6193808: Work around mojibake in nvml.system_get_process_name on WSL#2118
rwgk merged 6 commits into
NVIDIA:mainfrom
mdboom:fix-get-process-name-on-wsl

Conversation

@mdboom
Copy link
Copy Markdown
Contributor

@mdboom mdboom commented May 20, 2026

Summary

cuda.core.system.get_process_name(pid) raises UnicodeDecodeError under
WSL whenever the calling process has a non-C locale (which is the default
state for any CPython process, since the interpreter calls
setlocale(LC_ALL, "") at startup). This is reproducible by running the
cuda_core test suite with any seed that schedules
tests/system/test_system_device.py::test_compute_running_processes before
tests/system/test_system_system.py::test_get_process_name.

The underlying defect is in NVML's WSL implementation (see Root cause
below). This PR adds a scoped, defensive workaround in cuda_core so the
public API returns a correct value on WSL. It also fixes a latent issue
where get_process_name was effectively unusable from a fresh process
because it never primed NVML's per-PID name cache.

Root cause: the WSL mojibake

NVML's nvmlSystemGetProcessName on WSL takes a different code path
depending on the process's current locale. With the default "C" locale,
the function returns the basename portion of /proc/<pid>/exe correctly.
With any other locale (including the typical en_US.UTF-8), it instead
walks an internal UTF-16LE buffer holding the executable path but uses a
4-byte stride (as if the buffer were UTF-32LE). Each "code point" it
pulls is therefore two adjacent ASCII bytes packed into the low and
next-higher bytes of a single 24-bit value. That value is then emitted as
an extended 5-byte UTF-8 sequence (the 0xF8-prefixed encoding used to
represent code points beyond U+10FFFF).

The net result for, say, a Python process whose /proc/<pid>/exe resolves
to:

/home/mdboom/.local/share/uv/python/cpython-3.14.0-linux-x86_64-gnu/bin/python3.14

is that the returned buffer looks like ~180 bytes of f8 … chunks
followed by the correctly-encoded trailing basename, e.g.:

f8 9a 80 80 af  f8 9b 90 81 af  f8 8b b0 81 a5  f8 99 80 81 ad  …  /python3.14\0

Decoding the first chunk illustrates the pattern:

  • bytes f8 9a 80 80 af decode as the extended-UTF-8 code point 0x68002F
  • that 24-bit value packs 'h' (0x68) in the high byte and '/' (0x2F)
    in the low byte — i.e. the source ASCII bytes /h read as a
    little-endian 32-bit value padded with zeros

Every chunk has this structure; together they spell out the prefix
/home/mdboom/.local/share/uv/python/cpython-3.14.0-linux-x86_64-gnu/bin/
two characters at a time. The trailing /python3.14 is unaffected because
of where the buggy stride leaves the cursor.

Why the workaround needs to "re-prime"

nvmlSystemGetProcessName is cache-driven: the per-PID name is populated
the first time NVML enumerates compute processes that include the PID
(typically via nvmlDeviceGetComputeRunningProcesses_v3). Critically:

  • The mojibake is produced during the prime call, not during the read.
  • Once a buggy entry is in NVML's cache, switching to the "C" locale and
    re-reading does not unscramble it — the cache survives the locale
    flip.
  • Re-running the prime call under the "C" locale overwrites the cached
    entry with the correct UTF-8 string. Subsequent reads (in any locale)
    then return correctly.

So the workaround must do prime + read together under "C".

Behaviour after this PR

  • Native Linux / Windows: behaviour is identical to before, except that
    get_process_name now primes the NVML cache automatically. This makes
    it usable from a fresh process (previously a caller had to have
    manually queried device.compute_running_processes first or accept
    NotFoundError).
  • WSL: the locale flip is applied around the prime + read sequence,
    so the returned name is the correct UTF-8 string regardless of the
    caller's locale.

Discussion

  • Should we instead try to fix this at the cuda_bindings layer and fix it
    for all cuda_bindings users? In that case I guess cuda_core should raise
    an exception from get_process_name if on WSL and the cuda_bindings
    installed is too old?
  • I do plan to file the underlying bug with NVML. Locale-sensitive APIs should
    generally be avoided.

@mdboom mdboom added this to the cuda.core next milestone May 20, 2026
@mdboom mdboom self-assigned this May 20, 2026
@mdboom mdboom added bug Something isn't working P1 Medium priority - Should do cuda.core Everything related to the cuda.core module labels May 20, 2026
@github-actions github-actions Bot added the cuda.bindings Everything related to the cuda.bindings module label May 20, 2026
@github-actions

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Cursor GPT-5.5 1M Extra High


GPT Findings

  1. Medium: cuda_bindings/docs/source/release/13.2.0-notes.rst and
    cuda_bindings/docs/source/release/13.3.0-notes.rst give an incomplete
    raw-NVML workaround. The PR and nvbug context indicate the corruption happens
    when nvmlDeviceGetComputeRunningProcesses_v3 first primes the PID cache, so
    setting the locale to "C" only before nvml.system_get_process_name can be
    too late if the cache was already populated. The notes should say to prime and
    read under "C", matching the cuda.core workaround.

  2. Medium: cuda_core/cuda/core/system/_system.pyx now makes
    get_process_name() depend on successfully enumerating compute processes on
    every device. Any unrelated device-level NVML failure now breaks a per-PID
    lookup, including on non-WSL where this is a behavior change. Consider trying
    the direct read first on non-WSL, or making priming failures narrower.

  3. Low: cuda_core/docs/source/release/1.1.0-notes.rst says the WSL workaround
    may hold a global lock, but the implementation uses POSIX per-thread locale
    APIs and no global lock is present. That note looks stale or misleading.

Lightweight Thread-Safety Recommendation

The new locale switching implementation is reasonably thread-safe because it
uses POSIX newlocale/uselocale/freelocale, which scope the "C" locale to
the calling OS thread rather than mutating process-global locale state.

The remaining race is around NVML's process-name cache, which appears to be
process/global driver state. On WSL, another thread could call a cache-priming
path such as nvmlDeviceGetComputeRunningProcesses_v3 under a non-"C" locale
between get_process_name()'s prime and read steps, reintroducing corrupted
cached data.

A lightweight improvement would be to add a module-private Python
threading.RLock shared by the cuda.core paths most likely to touch this
cache. Hold it around:

  • system.get_process_name()'s WSL c_locale_guard() + prime + read sequence.
  • Device.compute_running_processes on WSL, since that is the main in-package
    path that can prime the process-name cache.

This would not protect raw cuda.bindings.nvml.* calls or external users, but
it would cover the most likely cuda.core race without globally serializing all
NVML access. A Python RLock is sufficient here: even if Cython releases the GIL
inside NVML wrappers, the lock remains held until the surrounding Python with
block exits.

@mdboom mdboom requested a review from rwgk May 20, 2026 23:47
------------

* Updating from older versions (v12.6.2.post1 and below) via ``pip install -U cuda-python`` might not work. Please do a clean re-installation by uninstalling ``pip uninstall -y cuda-python`` followed by installing ``pip install cuda-python``.
* ``nvml.system_get_process_name`` on WSL can return incorrect values. To work around this, set the locale to "C" before calling ``nvml.device_get_compute_running_processes_v3`` (which sets the process names) and before calling ``nvml.system_get_process_name``. ``cuda_core`` does this automatically, but users of the raw NVML API will need to do this manually.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this in both 13.2 and 13.3 release notes?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a compatibility note that's in line with the existing v12.6.2.post1 and below note, which we've been carrying forward since 12.8.0, I believe in every single release since then:

$ git grep 'v12.6.2.post1 and below' | cutniq | cut -d/ -f-4 | uniq -c
     16 cuda_bindings/docs/source/release
     15 cuda_python/docs/source/release

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working on the assumption that we put "known issues" on previous releases. (Since the issue applies to all of them).

Comment thread cuda_core/cuda/core/system/_system.pyx
@mdboom
Copy link
Copy Markdown
Contributor Author

mdboom commented May 21, 2026

@kkraus14, @rwgk: How do you feel about my question?

Should we instead try to fix this at the cuda_bindings layer and fix it for all cuda_bindings users? In that case I guess cuda_core should raise an exception from get_process_name if on WSL and the cuda_bindings installed is too old?

@kkraus14
Copy link
Copy Markdown
Collaborator

@kkraus14, @rwgk: How do you feel about my question?

Should we instead try to fix this at the cuda_bindings layer and fix it for all cuda_bindings users? In that case I guess cuda_core should raise an exception from get_process_name if on WSL and the cuda_bindings installed is too old?

My 2c: We very intentionally don't do any magic in cuda_bindings where if the underlying C library has this behavior / issue then I think it's fine for cuda_bindings to have it as well. Having a workaround in cuda_core feels correct to me.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

@kkraus14, @rwgk: How do you feel about my question?

Should we instead try to fix this at the cuda_bindings layer and fix it for all cuda_bindings users? In that case I guess cuda_core should raise an exception from get_process_name if on WSL and the cuda_bindings installed is too old?

My 2c: We very intentionally don't do any magic in cuda_bindings where if the underlying C library has this behavior / issue then I think it's fine for cuda_bindings to have it as well. Having a workaround in cuda_core feels correct to me.

That's the conclusion I'm coming to, too. I discussed this rationale with GPT-5.5:

  • evidence is strong that there is a bug in NVML
  • it's not the purpose of the cuda_bindings layer to add sophisticated workarounds for bugs in the wrapped libraries
  • cuda_core is by design adding a higher-level layer, sophisticated workarounds are more suitable and feasible in that layer (and this PR looks great to me)
  • the compatibility note in this PR is helpful, but it's not even really a compatibility note, but more a heads-up: we know there is a bug in NVML, we know how to work around it, and we added the workaround in cuda_core, but if you use cuda_bindings directly you need to be aware of it and work around it yourself
  • ultimately we expect that the NVML bug will be fixed, and no workaround is needed in the future

Additional idea, to be helpful to our users:

Separately from the cuda_core workaround, I still think cuda_bindings would benefit from routing C-string decoding through a tiny helper, to make failures actionable. The current UnicodeDecodeError tells us only the first bad byte and position; in this case the useful information was the raw buffer contents and which NVML call produced them. A helper could keep the success path identical, but on decode failure raise an exception/message that includes the source API name plus a bounded repr/hex dump of the bytes. That would have made the nvbug much easier to diagnose and should be generally useful for future driver/library string issues. — I would avoid special system_get_process_name behavior; it's too niche.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

I cancelled the CI manually after seeing that Python 3.12 with 12.9 is hanging again (two jobs). I'll report the details on the tracking bug we have already. I triggered a rerun.

@rwgk rwgk enabled auto-merge (squash) May 21, 2026 15:27
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

The hanging jobs from the previous attempt are now tracked here:

#2004 (comment)

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

It looks like the jobs tracked under #2004 (comment) are hanging again. Maybe the driver update made the issue a lot worse than before? Each time this happens, two runners are blocked for 4 hours, and of course merging the PR is blocked, too.

Interestingly, I just see this:

tests/memory_ipc/test_send_buffers.py::TestIpcReexport::test_main[DeviceMR] RERUN [ 12%]

But after that it's still hanging.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

Motivated by the experience here, I created #2122 — [ENH]: Make cuda_bindings UnicodeDecodeError more actionable

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

The two jobs were hanging again in the 4th attempt. I cancelled them to unblock the runners. In the meantime @aryanputta sent PR #2121, I just triggered the CI there. I'll come back here to try again after PR #2121 is merged.

@rwgk rwgk merged commit 2957595 into NVIDIA:main May 21, 2026
184 of 186 checks passed
@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module P1 Medium priority - Should do

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants