Skip to content

Optimize package metadata query in simple page retrieval #1242

@dkliban

Description

@dkliban

Problem

The package metadata query in the PyPI Simple API retrieve method is a significant contributor to database load in production. During peak load on 2026-05-20 at 16:00 UTC, 4 concurrent instances of this query pattern consumed 2.73 avg active sessions on the RDS instance (db.m7g.4xlarge, 16 vCPUs). Total DB load reached 21.9 avg active sessions, exceeding the CPU count.

Source

pulp_python/app/pypi/views.py lines 376-386:

packages = local_packages.annotate(
    repo_added_time=Subquery(repo_added_subquery)
).values(
    "filename",
    "sha256",
    "metadata_sha256",
    "requires_python",
    "size",
    "repo_added_time",
    "version",
)

SQL pattern observed (from RDS Performance Insights)

SELECT "python_pythonpackagecontent"."filename" AS "filename",
       "python_pythonpackagecontent"."sha256" AS "sha256",
       "python_pythonpackagecontent"."metadata_sha256" AS "metadata_sha256",
       "python_pythonpackagecontent"."requires_python" AS "requires_python",
       "python_pythonpackagecontent"."size" AS "size",
       "core_content"."pulp_created" AS "pulp_created",
       "python_pythonpackagecontent"."version" AS "version"
FROM "python_pythonpackagecontent"
INNER JOIN "core_content" ON (...)

The wait event breakdown during the spike was primarily CPU (17.9 avg active sessions), indicating the queries are compute-bound rather than I/O-bound.

Suggestions

  • Add database indexes on commonly filtered/joined columns
  • Consider caching results for frequently-accessed packages
  • Evaluate whether the Subquery annotation for repo_added_time can be optimized (e.g., via a JOIN instead of a correlated subquery)

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions