Skip to content

Add Bazel PyPI manifest extraction#1324

Open
Simon (simonhj) wants to merge 30 commits into
v1.xfrom
workspace/bazel-ecosystem
Open

Add Bazel PyPI manifest extraction#1324
Simon (simonhj) wants to merge 30 commits into
v1.xfrom
workspace/bazel-ecosystem

Conversation

@simonhj
Copy link
Copy Markdown

@simonhj Simon (simonhj) commented May 21, 2026

This PR makes Bazel manifest creation Python-aware.

This builds on the Maven Bazel work from #1312, which closes an inline-declaration gap that exists in rules_jvm_external: Bazel can resolve Maven artifacts that do not exist in a checked-in Maven manifest. Python is different. rules_python commonly resolves packages from a checked-in pinned requirements or lock file and exposes those packages as Bazel labels.

It works like this: a Bazel Python rule points to a checked-in requirements file. Bazel reads that file and makes the declared packages available as dependencies in the configured pip hub. Future Bazel build targets can then directly declare dependencies on those Python packages.

What this PR does is emit a generated requirements.txt that contains only the pinned Python packages reachable from Bazel Python rules. It does not mutate or remove entries from the user's checked-in requirements file. The value is scoping the generated manifest to Bazel's reached package set instead of assuming every checked-in requirement is used by Bazel Python targets.

This functionality does not kick in automatically, since I'm not fully convinced it won't cause more harm than good or cause confusion. It has to be manually enabled with socket manifest bazel --ecosystem pypi. socket scan create --auto-manifest continues to generate Bazel Maven manifests only.

Worked out example

Suppose a repo has a pinned Python requirements file with both application dependencies and development/tooling dependencies:

# requirements.txt
certifi==2024.8.30
charset-normalizer==3.4.0
idna==3.10
pluggy==1.5.0
pytest==8.3.3
requests==2.32.3
ruff==0.6.9
urllib3==2.2.2

The Bazel module wires that requirements file into rules_python:

# MODULE.bazel
bazel_dep(name = "rules_python", version = "1.5.0")

pip = use_extension("@rules_python//python/extensions:pip.bzl", "pip")
pip.parse(
    hub_name = "pypi",
    python_version = "3.12",
    requirements_lock = "//:requirements.txt",
)
use_repo(pip, "pypi")

Now Bazel makes those packages available as labels under the @pypi hub. But the actual Python code only depends on requests:

# app/BUILD.bazel
load("@rules_python//python:defs.bzl", "py_binary")

py_binary(
    name = "server",
    srcs = ["server.py"],
    deps = ["@pypi//requests"],
)

Running the new opt-in extractor:

socket manifest bazel . --ecosystem pypi

asks Bazel which Python dependencies are reachable from Python rules in the repo:

bazel query 'deps(kind("py_library|py_binary|py_test", //...))'

The extractor then filters that Bazel result to labels from the discovered @pypi hub, maps those labels back to pinned versions from requirements.txt, and writes a generated Socket manifest:

# .socket/bazel-manifests/requirements.txt
certifi==2024.8.30
charset-normalizer==3.4.0
idna==3.10
requests==2.32.3
urllib3==2.2.2

requests is included because //app:server depends on it. certifi, charset-normalizer, idna, and urllib3 are included because Bazel reaches them through requests' transitive dependency graph. pytest, pluggy, and ruff are not included because no Bazel Python target reaches them.

That scoping is the point of the PR: Socket scans the Python dependency set that Bazel can actually reach, not every package that happens to be present in the checked-in requirements file.

Summary of changes

  • add socket manifest bazel --ecosystem pypi support for whole-repo Bazel PyPI requirements.txt generation
  • discover rules_python pip hubs via Bazel command output first, with bounded static fallback paths
  • keep Bazel PyPI generation explicit; socket scan create --auto-manifest continues to generate Bazel Maven only
  • add bounded verbose diagnostics for Bazel subprocess, discovery, extraction, and empty-result triage
  • document the new command surface and add exact constructed-fixture oracle coverage

Note

Medium Risk
Adds a new Bazel PyPI extraction path and new Bazel subprocess commands/diagnostics, which could affect manifest generation behavior and error handling in Bazel workspaces. PyPI generation is opt-in, limiting blast radius, but the Bazel query runner changes impact all Bazel-based extraction.

Overview
Enables opt-in PyPI manifest extraction for Bazel workspaces via socket manifest bazel --ecosystem pypi, generating a reached-set requirements.txt by discovering rules_python pip hubs, querying Python target deps, and resolving pinned versions from requirements_lock.txt (with spoke-tag fallback and conflict detection).

Updates socket manifest bazel to support repeatable --ecosystem selection (defaulting to Maven-only), and refactors Maven extraction to report noEcosystemFound so auto-manifest can distinguish "no Bazel Maven present" from hard failures.

Improves Bazel diagnostics and compatibility by switching Bzlmod repo enumeration to bazel mod dump_repo_mapping, adding bazel mod show_extension plumbing for pip hub metadata, and emitting bounded --verbose subprocess traces (argv/cwd/duration/status/output sizes + stderr tail). Documentation, changelog, and tests are updated, including a PyPI fixture oracle.

Reviewed by Cursor Bugbot for commit 1b767d4. Configure here.

@simonhj Simon (simonhj) force-pushed the workspace/bazel-ecosystem branch from 1a9b2cd to 4009c0c Compare May 21, 2026 20:17
@simonhj Simon (simonhj) marked this pull request as ready for review May 22, 2026 10:09
@simonhj Simon (simonhj) force-pushed the workspace/bazel-ecosystem branch from 1b767d4 to 9a39cc7 Compare May 22, 2026 10:10
Copy link
Copy Markdown
Contributor

@mtorp Martin Torp (mtorp) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small follow-ups from review, plus a request to bump the CLI version since this adds a user-visible feature.

Comment thread src/commands/manifest/bazel/extract_bazel_to_pypi.mts Outdated
Comment thread src/commands/manifest/bazel/bazel-pypi-discovery.mts Outdated
Comment thread src/commands/manifest/bazel/bazel-repo-discovery.mts
@mtorp
Copy link
Copy Markdown
Contributor

Could you bump the CLI version (currently 1.1.100 in package.json) as part of this PR? This adds a new user-visible feature (socket manifest bazel --ecosystem pypi and the existing Unreleased CHANGELOG entry), so it should land with a version bump rather than piggybacking on the next unrelated release.

Copy link
Copy Markdown
Contributor

@mtorp Martin Torp (mtorp) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Nice piece of work — the layered discovery (Bazel command → bounded static parsing fallback), DoS guards (file-size caps, candidate caps, bounded regexes), and the noEcosystemFound / evaluateEcosystemOutcomes outcome model are all well done. Tests, typecheck, and lint are clean.

The inline comments are nits; please address them and the version bump before merging.

Simon (simonhj) and others added 22 commits May 22, 2026 15:47
- Add repeatable --ecosystem flag (maven, pypi) to socket manifest bazel
- Update command description and help text for multi-ecosystem support
- Add ecosystem to socket.json defaults chain
- Add buildPypiProbeFor to bazel-query-runner for hub alias/package probing
- Extend tests for --ecosystem dry-run and buildPypiProbeFor query shape
- Update cmd-manifest snapshot for new bazel subcommand description
- Add bazel-pypi-discovery.mts: two-step PyPI hub discovery for Bzlmod and legacy WORKSPACE
- Parse use_extension(..., "pip") bindings and match .parse(...) for Bzlmod
- Parse pip_parse, pip_install, and pip_repository for legacy WORKSPACE
- Export PypiHubInfo, discoverPypiHubs, parsePypiHubCandidates, validatePypiHub
- Hub validation accepts alias/pkg markers without requiring pypi_name= on hub
- Security: MAX_WORKSPACE_FILE_BYTES, MAX_CANDIDATES caps, bounded regexes
- Add bazel-pypi-discovery.test.mts: 28 tests covering Bzlmod, legacy, multiple hubs,
  renamed bindings, validation probes, verbose diagnostics, DoS guards
- Fix stray token syntax error in extract_bazel_to_pypi.mts from bad edit
- Add committed oracle requirements.expected.txt (35 packages)
- Fix test sort comparison to match sortPackageLines implementation
- All 3 constructed tests now pass (exact match, explicit mode, sandbox fallback)
…ed dual-ecosystem coverage

Retroactive commit for plan 02.1-03 follow-up work left uncommitted after the
partial 9b38ef3d1 commit. All five files map to scope documented or implied
by the 02.1-03 SUMMARY:

- generate_auto_manifest.mts: PyPI branch added to Bazel auto-manifest
  dispatch, runs extractBazelToPypi after extractBazelToMaven and collects
  generated requirements.txt paths; noEcosystemFound coerced to boolean to
  satisfy exactOptionalPropertyTypes.
- generate_auto_manifest.test.mts: dual-ecosystem mocked coverage (both
  succeed, Maven-only, PyPI-only, both hard-fail, both no-discovery,
  socket.json overrides, cross-ecosystem error tolerance).
- bazel-pypi-discovery.mts: discoverPypiHubs dedup fix so parsed candidates
  overwrite the default seed when hub names collide, preserving
  requirementsLockLabel metadata.
- bazel-pypi-parser.mts: filterReachedPypiPackages now matches labels via
  regex from start-of-token boundaries so it handles both --output=label
  and --output=build deps array forms; removed unused
  no-cond-assign eslint-disable directive.
- bazel-query-runner.mts: buildBazelArgv parameterized on output format
  (default "build"); reached-closure query passes "label" because it is
  line-filterable.

Pre-commit hooks bypassed at user direction; equivalent checks were run
manually: eslint --report-unused-disable-directives on the 5 files (clean)
and full-project pnpm check:tsc (clean).
Updates the user-facing documentation for the new Bazel PyPI extraction
path delivered by Phase 02.1:

- README.md `socket manifest bazel` section now describes both Maven and
  PyPI output, the repeatable `--ecosystem maven|pypi` flag, auto-detect
  behavior when no flag is given, and the Python/PyPI extraction
  pipeline (hub discovery, py_library/py_binary/py_test queries,
  requirements_lock.txt fast path, PEP 503 canonical name==version
  output).
- New "PyPI Name and Version Semantics" section documents PEP 503
  normalization, lockfile-over-spoke-tag precedence, and conflict
  detection for same-normalized-name different-version cases.
- New "Unsupported PyPI Forms (Phase 02.1)" section documents the
  Phase 02.1 scope boundary: direct URL / editable / unpinned
  requirements are not emitted, private corpus validation requires
  auth, whole-repo Tier 2 only.
- New "Cross-Language Edges" section assigns cross-language traversal
  (e.g. rust_library -> py_library via PyO3) to Phase 4 per D-14.
- CHANGELOG.md `[Unreleased]` "Added" section gains an entry for the
  new PyPI extraction with user-benefit wording, Bzlmod and WORKSPACE
  support callouts, and a mention that `socket scan create
  --auto-manifest` picks up the generated PyPI manifest.

Validation (pre-commit hooks bypassed via --no-verify; pre-existing
test debt unrelated to this change blocks the full pre-commit run,
documented in STATE.md): `pnpm check:tsc` clean; eslint
--report-unused-disable-directives on the modified files clean.
… output

UAT verification surfaced a 1-line position swap between live `socket
manifest bazel --ecosystem pypi` output and the committed oracle
(`pydantic` vs `pydantic-core`). The constructed-fixture vitest passed
anyway because `comparePypiManifest` is set-based after PEP 503
normalization, but the README/SUMMARY claim of byte-equal exact match
was incorrect.

Regenerated the oracle from the current `sortPackageLines` output so
the byte-equal claim holds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes 13 errors and 4 warnings from eslint in Phase 2.1 bazel-pypi files:
- Move inline arrow functions to module scope (unicorn/consistent-function-scoping)
- Add eslint-disable-next-line no-await-in-loop for sequential Bazel operations
- Fix import ordering (import-x/order, sort-imports)
- Fix object key sorting in destructuring (sort-destructure-keys)
- Fix array type syntax (@typescript-eslint/array-type)
- Remove unused eslint-disable directive
- Add missing braces around if conditions (curly)
- Auto-fix formatting in related bazel-pypi parser and discovery modules

All 51 affected unit tests pass.
@simonhj Simon (simonhj) force-pushed the workspace/bazel-ecosystem branch from 56bc4c9 to 5992511 Compare May 22, 2026 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants