Skip to content

Run the Manager locally (make dev) via a NodeApi abstraction + DummyApi#353

Open
runleveldev wants to merge 18 commits into
mainfrom
329-fixes
Open

Run the Manager locally (make dev) via a NodeApi abstraction + DummyApi#353
runleveldev wants to merge 18 commits into
mainfrom
329-fixes

Conversation

@runleveldev

@runleveldev runleveldev commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Merge #351 and #352 first (this branch already contains them, rebased on current main which includes #347).

Why

Supersedes #329. That PR made the Manager runnable on localhost but did so in ways flagged in review: a mock that bypassed the job-runner, in-band signaling (apiUrl === 'local'), a SQLite-specific migration, and dev-login provisioning that clobbered the Docker Compose bootstrap.

This reworks the same goal — run the Manager standalone and create containers without Proxmox — but addresses the root causes and keeps the real provisioning code path intact. It also lays the groundwork for additional production-capable node providers behind a single NodeApi abstraction (starting with the dummy provider).

What changed

1. NodeApi abstraction (ProxmoxApi + new DummyApi)

  • Node.api() now dispatches on a new nodeType column:
    • proxmox (default) → authenticated ProxmoxApi (unchanged behavior).
    • dummy → new DummyApi (utils/dummy-api.js) implementing the same method surface, simulating the Proxmox calls.
  • No HTTP-handler short-circuit. Container creation still flows POST /containersJobjob-runner.jsbin/create-container.jsnode.api(). The full creation path is exercised; only the hypervisor calls are faked. Running the job-runner is required (and make dev does so).
  • DummyApi.clusterResources('lxc') derives the running set from the DB so the live container-status resolver (Container status API: compute live status instead of static DB column #352) reports dummy containers as running; the dev bootstrap gives the dummy node placeholder credentials so the resolver routes through node.api().

2. Provider-agnostic-ish node selection

Node selection only excludes nodes that can't provision (it picks a dummy node, or a node with full apiUrl/tokenId/secret), so a half-configured node is never selected and then failed in node.api(). The DELETE path likewise routes VM teardown through node.api(). Node.api() validates apiUrl alongside the credentials.

3. nodeType column (Postgres-safe)

New migration 20260605000000-add-node-type-to-nodes adds nodeType (STRING(50), default 'proxmox'). Existing rows backfill to proxmox via the column default; Node.bulkCreate (node import) also applies it, so production node selection is unchanged.

4. Root-cause fix for the erroneous UNIQUE on Containers.nodeId (via patch-package)

A v6 backport of sequelize/sequelize#17583 (patches/sequelize+6.37.8.patch) so SQLite's changeColumn/describeTable no longer marks members of a composite index as unique. Wired via patch-package (postinstall). Replaces the SQLite-specific migration.

5. Local dev environment out of /auth/dev

New bin/dev-bootstrap.js seeds a localhost site, a dummy node, and a localhost external domain. It no-ops if any Site already exists, so it never collides with the Docker Compose bootstrap. The /auth/dev login route is left clean. The groups + push-notification seeders are made idempotent so re-running them under make dev doesn't violate unique constraints.

6. SQL query logging gated by LOG_LEVEL=trace

Sequelize query logging (noisy — the job-runner polls the Jobs table every tick) is gated behind LOG_LEVEL=trace, in both dev and production; off otherwise. Pass it through: make dev LOG_LEVEL=trace.

7. @vscode/sqlite3 instead of the unmaintained sqlite3

The dev SQLite driver is the actively-maintained @vscode/sqlite3 fork, wired in via Sequelize v6 dialectModule. It builds from source (no prebuilt-binary glibc mismatches that broke make dev on LTS distros) and is a devDependency required lazily only when the sqlite dialect is selected — so production (Postgres) never installs or loads it (npm ci --omit=dev skips the native build entirely).

8. make dev

Lives in create-a-container/Makefile (the root Makefile delegates). Self-contained: installs server + client deps with npm install --no-package-lock (so it never rewrites the lockfile), runs migrations, seeds the localhost site + dummy node, builds the client, then runs server (nodemon) + job-runner + client watch together. No .env is needed — the server's defaults (SQLite at data/database.sqlite, an auto-generated session secret, dev login when not in production) are sufficient. db:migrate is fixed to call sequelize-cli.

9. Docs

New "Run the Manager Locally (make dev)" section in the development-workflow doc; README + release-pipeline pointers updated; LOG_LEVEL documented in example.env.

How this addresses the #329 review comments

Review comment Resolution
Mocks are a bad idea; prefer a NodeApi interface + DummyApi, with the job-runner still in the loop DummyApi implements the ProxmoxApi surface; creation runs through the job-runner and bin/create-container.js unchanged
In-band signaling (apiUrl === 'local'); want a dedicated field Dedicated nodeType column drives Node.api()
Dev-login defaults break the docker-compose bootstrap Provisioning moved out of /auth/dev into bin/dev-bootstrap.js, which no-ops if any Site exists
Don't merge SQLite-specific migrations (Postgres is the only supported DB) No SQLite migration; root cause fixed via patch-package backport of sequelize#17583
Redundant .gitignore entry make dev uses the default data/database.sqlite (already ignored by create-a-container/data/.gitignore)

Also addressed the Copilot review on this PR (provisionable-node filter + clearer error, Node.api() apiUrl validation, conditional LOG_LEVEL forwarding in the Makefile, larger non-reused DummyApi VMID space, real locally-administered MAC prefix, patch-package moved so --omit=dev installs don't fail).

Testing

  • Migrations run clean on fresh SQLite (incl. Container status API: compute live status instead of static DB column #352's remove-container-status); Containers has no column-level UNIQUE on nodeId.
  • End-to-end: bin/create-container.js against DummyApi lands the container, and the live status resolver reports it running (and creating/failed for the other states); DELETE routes through node.api() and removes the row.
  • npm ci (dev) and npm ci --omit=dev (production, no @vscode/sqlite3, no native build) both succeed.
  • make dev runs server + job-runner + client watch and does not modify the lockfile.

Notes

Closes #329

Admin-defined default environment variables were merged into a
container's DB record when the container was created. This froze the
defaults to their creation-time values and exposed them in the edit UI.

Now the container record stores only user-defined env vars. System
defaults (from Settings) and NVIDIA GPU defaults are merged in at
configure-time, just before the Proxmox API call, with user-defined
values taking precedence.

- models/container.js: buildLxcEnvConfig() now merges
  defaults < NVIDIA defaults < user vars; add getSystemDefaultEnvVars(),
  parseEnvironmentVars(), and nvidiaDefaultEnvVars() helpers.
- bin/create-container.js: merge system defaults at configure-time and
  stop writing merged env/entrypoint back into the DB record.
- bin/reconfigure-container.js: merge system defaults on edit so newly
  added defaults also reach existing containers.
- routers/api/v1/containers.js: stop baking NVIDIA defaults into the
  saved record on create.

Closes #349
Move all environment-variable merge logic into Container#buildLxcEnvConfig
so there is exactly one place that determines the env to deploy. The
method now loads the admin-defined system defaults itself (it became
async) and composes them with the NVIDIA and user-defined values:

  system defaults < NVIDIA defaults < user-defined values

Callers in create-container.js and reconfigure-container.js are reduced
to a single `await container.buildLxcEnvConfig()` call and no longer load
or pass defaults themselves. getSystemDefaultEnvVars, parseEnvironmentVars
and nvidiaDefaultEnvVars are now internal helpers of that method.
… the Proxmox env string

Addresses PR review: env var keys were only trimmed, so a key containing
'=' or NUL (or a non-string/array value) could corrupt the NUL-separated
KEY=value blob sent to Proxmox (splitting one var into many, breaking
parsing, or stringifying objects to "[object Object]").

Add Container.normalizeEnvVars() as the single definition of a valid env
var: input must be a plain object; keys are trimmed and must match a
conventional env-var name (no '=', NUL, or whitespace); values are coerced
to strings, with null/undefined, objects/arrays, and NUL-containing values
dropped.

Apply it at every ingest/load point:
- getSystemDefaultEnvVars() and parseEnvironmentVars() normalize their output.
- API create/update routes serialize user env vars via a shared
  serializeUserEnvVars() helper that normalizes before persisting.

buildLxcEnvConfig can now trust its inputs, so the per-pair guards are
dropped.
buildLxcEnvConfig always added env/entrypoint to Proxmox's `delete` list
when the container record had no value. On create the container is freshly
cloned from a template that may carry its own env/entrypoint, so emitting
`delete` actively wiped those template-provided defaults.

Add a deleteMissing option (default false). Proxmox's config PUT is a
partial update, so omitting a key leaves it untouched:

- create-container.js uses the default (false): empty env/entrypoint are
  omitted, preserving whatever the cloned template provides.
- reconfigure-container.js passes deleteMissing:true so clearing the last
  env var or removing a custom entrypoint still unsets it on the existing
  container.

The reconfigure restart-detection already ignores the `delete` key, so a
pure-delete config does not, by itself, trigger a restart.
Reconfigure uses deleteMissing, so any env/entrypoint that exists only on
the cloned template (and was never stored on the container) would be
unset on the first reconfigure — surprising the user. We can't re-query
the template later: templates are mutable Docker refs that may change or
disappear, and we don't cache them.

Compromise: at creation, after the template is cloned, read its LXC
config and fold its env/entrypoint into the container record as if the
user had supplied them (precedence: template < user). This guarantees
those values persist across future reconfigures without changing API
client expectations.

System and NVIDIA defaults are deliberately NOT persisted — they remain
configure-time-only (merged by buildLxcEnvConfig) so they aren't frozen
into the record or exposed as user values in the edit UI.

- models/container.js: add Container.parseLxcEnvString() and
  Container#persistTemplateDefaults().
- bin/create-container.js: call persistTemplateDefaults() with the cloned
  template's config before building/applying the LXC env config.
Resolves #350.

Container status was a static column updated only by the create/reconfigure
job scripts, so it drifted from reality whenever Proxmox changed out of band
or a job died mid-run. This replaces it with status computed on demand.

New utils/container-status.js resolves status from three sources:
  - Proxmox: does the LXC exist, and is it running/stopped?
  - Jobs: active create/reconfigure job, or did the last create job fail?
  - Config: does the live LXC config match what the API server expects?

Resolved statuses: running, offline, creating, restarting, failed, missing,
out-of-sync (plus unknown when Proxmox is unreachable).

New endpoint:
  GET /api/v1/sites/:siteId/containers/:id/status -> { data: { status } }

Non-breaking: the same live status is still embedded on the list/show/create
payloads, so existing consumers keep working.

The static Containers.status column is dropped (migration + model). All former
writers were updated to stop persisting status; templates.js now selects
provisioned containers by VMID + IPv4 instead of status='running'. The list
page queries the per-container status endpoint (seeded by the list value for
instant paint, polling while creating/restarting).

OpenAPI and README updated.
Live status is already embedded in the existing list/show/create container
payloads, so the dedicated GET /containers/:id/status endpoint is redundant.
Per-row polling + hydration on the index page is not needed at this time, so
revert the ContainerStatusBadge to a simple presentational badge that renders
the live status from the list response.

Removes the endpoint, the client query/key/result type, the OpenAPI path, and
the README endpoint section (the status-value reference is retained as a field
description). The ContainerStatus enum and the live resolver remain.
- out-of-sync is a warning, not a danger, badge variant
- templates.js routes containers by ipv4Address only (a VMID isn't required to route)
- container-status.js: match --container-id as a whole argument when looking up
  create/reconfigure jobs, so container 12 no longer matches container 123's
  jobs (which could misclassify status). LIKE narrows at the DB level; a regex
  confirms the id is terminated by a space or end-of-string.
- resource-requests.js: the applyResourceToExistingContainers query filtered on
  the now-removed Containers.status column ("Unknown column 'status'" after the
  migration). Use containerId != null (a VMID is required to reconfigure anyway).
- README: note the create endpoint also returns the live status field.
The calling convention for create/reconfigure job commands is stable and
deterministic, so match the command string exactly in the query rather than
fetching LIKE candidates and regex-filtering in JS:

- create: exact match on `node bin/create-container.js --container-id=<id>`
  (create commands never carry trailing args).
- reconfigure: exact match OR `<base> %` (exact base followed by a space), which
  covers the optional ` --<resource>=<value>` flags while still preventing id 12
  from matching id 123.
Resolving the container list status was O(N) DB queries because every container
ran its own create/reconfigure job lookup (up to 2 queries each). Add
prefetchLatestJobs() which fetches the latest create + reconfigure job for the
whole set of containers in a single query (bounded WHERE clauses, independent of
N), buckets them per container, and feeds the result into computeContainerStatus
via an optional `jobs` param.

computeContainerStatuses now issues one job query total instead of ~2N. The
single-container show path is unchanged (still queries per container when no
prefetched jobs are passed).
…ng, out-of-sync)

Resolving the container list was still slow because each container did per-item
work: a command-string job lookup for 'restarting', and a Proxmox lxcConfig call
for 'out-of-sync'. Both are inherently O(N) (Proxmox has no bulk config endpoint),
which dominated index latency.

Drop both statuses:
- 'restarting': not used by known API clients (launchpad only checks == running),
  and reconfigure jobs have no FK so detecting it required command-string queries.
- 'out-of-sync': required one lxcConfig call per in-Proxmox container; no bulk
  Proxmox config API exists to batch it.

Status now derives only from:
- Proxmox run-state, via one clusterResources('lxc') snapshot per node, and
- the create job, resolved from the eager-loaded creationJob association (or the
  creationJobId FK) — no command-string queries.

Result: the index makes one Proxmox call per node and zero per-container Proxmox
or DB job queries (verified at 30 containers). Remaining statuses: running,
offline, creating, failed, missing, unknown. PUT still enqueues a reconfigure job
and returns its jobId; only the 'restarting' status label is removed.

Updates the resolver, containers router (eager-load creationJob on the index),
client types/badges, OpenAPI enum, and README.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the Manager runnable standalone on localhost via make dev, adding a provider abstraction (Node.api() dispatching by nodeType) and a DummyApi so container creation can exercise the real job-runner path without needing Proxmox.

Changes:

  • Add nodeType with Node.api() dispatch (proxmox vs dummy) and implement DummyApi to simulate hypervisor calls.
  • Introduce make dev + bin/dev-bootstrap.js to bootstrap a local SQLite dev environment and run server + job-runner together.
  • Add a patch-package backport patch for Sequelize SQLite composite-index uniqueness behavior and gate Sequelize SQL logging behind LOG_LEVEL=trace.

Reviewed changes

Copilot reviewed 14 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
README.md Updates docs pointer to highlight make dev workflow.
mie-opensource-landing/docs/developers/development-workflow.md Adds a dedicated make dev section and reorganizes Docker-stack guidance.
Makefile Adds dev target to install, migrate, bootstrap, build, and run server + job-runner.
create-a-container/utils/dummy-api.js Introduces DummyApi implementing the NodeApi/ProxmoxApi method surface for local runs.
create-a-container/routers/api/v1/containers.js Makes node selection provider-agnostic and routes delete teardown through node.api().
create-a-container/patches/sequelize+6.37.8.patch Adds a patch-package patch for Sequelize v6 SQLite query-interface uniqueness handling.
create-a-container/package.json Adds postinstall patch-package hook and updates db migration script to sequelize-cli.
create-a-container/package-lock.json Locks patch-package and its transitive dependencies.
create-a-container/models/node.js Adds nodeType field and Node.api() dispatch to DummyApi vs ProxmoxApi.
create-a-container/migrations/20260605000000-add-node-type-to-nodes.js Adds nodeType column with default proxmox.
create-a-container/example.env Documents LOG_LEVEL behavior for SQL query logging.
create-a-container/config/config.js Gates Sequelize SQL logging behind LOG_LEVEL=trace (including production).
create-a-container/bin/dev-bootstrap.js Seeds an empty DB with a localhost site, dummy node, and localhost external domain.
Files not reviewed (1)
  • create-a-container/package-lock.json: Generated file

Comment thread create-a-container/routers/api/v1/containers.js Outdated
Comment thread create-a-container/routers/api/v1/containers.js Outdated
Comment thread create-a-container/models/node.js Outdated
Comment thread Makefile Outdated
Comment thread create-a-container/utils/dummy-api.js
Comment thread create-a-container/utils/dummy-api.js Outdated
Comment thread create-a-container/package.json
… + DummyApi

Run the Manager standalone on localhost (SQLite, `make dev`) and create
containers without Proxmox, keeping the real provisioning code path.

- NodeApi abstraction: Node.api() dispatches on a new `nodeType` column.
  `proxmox` (default) returns ProxmoxApi; `dummy` returns a new DummyApi that
  implements the same surface and simulates Proxmox calls. Creation still runs
  POST /containers -> Job -> job-runner -> bin/create-container.js -> node.api().
- DummyApi.clusterResources('lxc') derives the running set from the DB so the
  live container-status resolver (#352) reports dummy containers correctly;
  dev-bootstrap gives the dummy node placeholder creds so nodeHasCreds() passes.
- Provider-agnostic node selection that only excludes unprovisionable nodes
  (dummy, or proxmox with apiUrl/tokenId/secret); DELETE routes teardown through
  node.api(). Node.api() validates apiUrl alongside credentials.
- Postgres-safe migration adding `nodeType` (existing rows default to proxmox).
- Fix the erroneous UNIQUE on Containers.nodeId via patch-package (v6 backport of
  sequelize/sequelize#17583); replaces the SQLite-specific migration.
- bin/dev-bootstrap.js seeds a localhost site + dummy node + external domain,
  no-op if any Site exists, so it never clobbers the docker-compose bootstrap.
- Make idempotent seeders (groups, push-notification) so re-running them in
  `make dev` doesn't violate unique constraints.
- Gate Sequelize SQL query logging behind LOG_LEVEL=trace; off otherwise.
- Replace the unmaintained sqlite3 with @vscode/sqlite3 (dev-only, builds from
  source via dialectModule) so there is no prebuilt-binary glibc mismatch; prod
  (Postgres) never installs or loads it.
- `make dev` lives in create-a-container/Makefile (root delegates): self-contained
  install (--no-package-lock, so it never rewrites the lockfile), migrate, seed,
  then run server (nodemon) + job-runner + client watch together.
- Docs: "Run the Manager Locally (make dev)" section; README + release-pipeline
  pointers; LOG_LEVEL documented in example.env.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants