Skip to content

feat(node-removal): online node removal as inverse of cluster expansion#1104

Open
schmidt-scaled wants to merge 23 commits into
mainfrom
feature/node-removal
Open

feat(node-removal): online node removal as inverse of cluster expansion#1104
schmidt-scaled wants to merge 23 commits into
mainfrom
feature/node-removal

Conversation

@schmidt-scaled

Copy link
Copy Markdown
Contributor

Summary

Implements an online storage-node removal function that works as the inverse of cluster expansion (add_node). It is background, task-driven, idempotent and resumablesn remove validates preconditions then queues an FN_NODE_REMOVAL task; the new tasks_runner_node_removal service drives the flow:

shutdown → in_removal → rewire LVS replicas → remove/fail/migrate devices → removed

This replaces the old remove_storage_node semantics (which only handled an already-offline node and did no LVS rewiring).

Behavior

Preconditions (enforced before anything is queued):

  • target node is online;
  • every other non-removed node is online;
  • FTT headroom allows losing the node (_check_ftt_allows_node_removal);
  • the node hosts no LVols and no snapshots (operator migrates those separately, at a higher level);
  • every secondary/tertiary replica the node hosts for other primaries has a host-disjoint relocation target (catches e.g. 2-node clusters).

LVS rewiring:

  • Case A — the node's own primary LVS: tear down its (empty) secondary/tertiary replicas on the peers that host them, clear cross-references.
  • Case B — replicas this node hosts for other primaries: re-host each on a fresh, anti-affinity-valid node (get_secondary_nodes / get_secondary_nodes_2 + recreate_lvstore_on_non_leader) so those primaries keep their fault tolerance. Bookkeeping back-ref on the removed node is cleared only after a successful rebuild, so retries resume cleanly.

Devices: each data device is driven online → removed → failed (queuing failure-migration on the surviving online nodes), then the flow waits for failed_and_migrated before flipping the node to removed.

Changes

  • models/base_model.py: new STATUS_IN_REMOVAL (code 13)
  • models/job_schedule.py: new FN_NODE_REMOVAL
  • controllers/tasks_controller.py: add_node_removal_task + dedup branch + get_active_node_removal_task; skip IN_REMOVAL nodes when fanning out device-migration tasks (their SPDK is dead)
  • storage_node_ops.py: rewrite remove_storage_node (validate + queue) + node_removal_orchestrate + Case A/B + device-decommission/finalize helpers
  • services/tasks_runner_node_removal.py: new runner service (lease-aware, suspend-and-revisit for the long migration wait)
  • scripts/docker-compose-swarm.yml: register TasksNodeRemovalRunner
  • simplyblock_cli/cli.py: updated sn remove help; --force-remove now only cancels active tasks
  • tests/test_node_removal.py: 21 unit tests

Testing

  • New: tests/test_node_removal.py — 21 tests (preconditions, relocation feasibility/picking, Case A/B bookkeeping incl. idempotency/resume, device completion gate, status mapping). ✅
  • Full non-migration suite: 1137 passed, 213 skipped, no regressions.

Notes / review focus

  • Case B (recreating a replica on a fresh node while the cluster is online) is the net-new, highest-risk piece — it reuses recreate_lvstore_on_non_leader with the primary as the online leader. Worth a careful look from someone close to the LVS restart/recreate machinery.
  • Unit tests mock all data-plane RPCs; this has not yet been validated on a live cluster.

🤖 Generated with Claude Code

Add a background, task-driven node-removal orchestration that mirrors
add_node/cluster expansion in reverse. `sn remove` now validates and
queues an FN_NODE_REMOVAL task; tasks_runner_node_removal drives the
idempotent, resumable flow:

  shutdown -> in_removal -> rewire LVS replicas -> remove/fail/migrate
  devices -> removed

Preconditions (enforced before queueing): target ONLINE, every other
non-removed node ONLINE, FTT headroom OK, no LVols/snapshots on the node
(operator migrates those separately), and every secondary/tertiary
replica the node hosts for other primaries has a host-disjoint
relocation target.

LVS rewiring:
  * Case A - the node's own primary LVS: tear down its (empty)
    secondary/tertiary replicas on the peers that host them.
  * Case B - replicas this node hosts for other primaries: re-host each
    on a fresh, anti-affinity-valid node (get_secondary_nodes /
    get_secondary_nodes_2 + recreate_lvstore_on_non_leader) so those
    primaries keep their fault tolerance.

Devices are driven online -> removed -> failed (queuing failure
migration on the surviving online nodes) and the flow waits for
failed_and_migrated before flipping the node to removed.

Changes:
  * base_model: new STATUS_IN_REMOVAL (code 13)
  * job_schedule: new FN_NODE_REMOVAL
  * tasks_controller: add_node_removal_task + dedup + getter; skip
    IN_REMOVAL nodes when fanning out device-migration tasks
  * storage_node_ops: rewrite remove_storage_node + node_removal_orchestrate
    and Case A/B + device helpers
  * services/tasks_runner_node_removal.py: new runner service
  * docker-compose-swarm.yml: register TasksNodeRemovalRunner
  * tests/test_node_removal.py: 21 unit tests

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread simplyblock_core/storage_node_ops.py Fixed
Comment thread simplyblock_cli/cli.py Fixed

@mxsrc mxsrc left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new functions that are being introduced use return-value based error reporting, since they are new I feel they should use exceptions for the error handling. Other than that I think the code looks good, I just have a few comments on two locations. It requires changes in the helm-chart to add the container there as well.

return cands[0] if cands else None


def node_removal_orchestrate(node_id, force_remove=False):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be moved to the task module, including its helpers. The tasks should be as self-contained as possible, this code should not be executed from anywhere else.

Comment thread simplyblock_core/storage_node_ops.py
# Conflicts:
#	simplyblock_cli/cli.py
#	simplyblock_core/controllers/tasks_controller.py
#	simplyblock_core/scripts/docker-compose-swarm.yml
#	simplyblock_core/storage_node_ops.py
try:
sec = db_controller.get_storage_node_by_id(primary.secondary_node_id)
exclude_mgmt_ips.append(sec.mgmt_ip)
except KeyError:
Comment thread simplyblock_core/storage_node_ops.py Fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants