feat(node-removal): online node removal as inverse of cluster expansion by schmidt-scaled · Pull Request #1104 · simplyblock/sbcli

schmidt-scaled · 2026-06-15T20:25:02Z

Summary

Implements an online storage-node removal function that works as the inverse of cluster expansion (add_node). It is background, task-driven, idempotent and resumable — sn remove validates preconditions then queues an FN_NODE_REMOVAL task; the new tasks_runner_node_removal service drives the flow:

shutdown → in_removal → rewire LVS replicas → remove/fail/migrate devices → removed

This replaces the old remove_storage_node semantics (which only handled an already-offline node and did no LVS rewiring).

Behavior

Preconditions (enforced before anything is queued):

target node is online;
every other non-removed node is online;
FTT headroom allows losing the node (_check_ftt_allows_node_removal);
the node hosts no LVols and no snapshots (operator migrates those separately, at a higher level);
every secondary/tertiary replica the node hosts for other primaries has a host-disjoint relocation target (catches e.g. 2-node clusters).

LVS rewiring:

Case A — the node's own primary LVS: tear down its (empty) secondary/tertiary replicas on the peers that host them, clear cross-references.
Case B — replicas this node hosts for other primaries: re-host each on a fresh, anti-affinity-valid node (get_secondary_nodes / get_secondary_nodes_2 + recreate_lvstore_on_non_leader) so those primaries keep their fault tolerance. Bookkeeping back-ref on the removed node is cleared only after a successful rebuild, so retries resume cleanly.

Devices: each data device is driven online → removed → failed (queuing failure-migration on the surviving online nodes), then the flow waits for failed_and_migrated before flipping the node to removed.

Changes

models/base_model.py: new STATUS_IN_REMOVAL (code 13)
models/job_schedule.py: new FN_NODE_REMOVAL
controllers/tasks_controller.py: add_node_removal_task + dedup branch + get_active_node_removal_task; skip IN_REMOVAL nodes when fanning out device-migration tasks (their SPDK is dead)
storage_node_ops.py: rewrite remove_storage_node (validate + queue) + node_removal_orchestrate + Case A/B + device-decommission/finalize helpers
services/tasks_runner_node_removal.py: new runner service (lease-aware, suspend-and-revisit for the long migration wait)
scripts/docker-compose-swarm.yml: register TasksNodeRemovalRunner
simplyblock_cli/cli.py: updated sn remove help; --force-remove now only cancels active tasks
tests/test_node_removal.py: 21 unit tests

Testing

New: tests/test_node_removal.py — 21 tests (preconditions, relocation feasibility/picking, Case A/B bookkeeping incl. idempotency/resume, device completion gate, status mapping). ✅
Full non-migration suite: 1137 passed, 213 skipped, no regressions.

Notes / review focus

Case B (recreating a replica on a fresh node while the cluster is online) is the net-new, highest-risk piece — it reuses recreate_lvstore_on_non_leader with the primary as the online leader. Worth a careful look from someone close to the LVS restart/recreate machinery.
Unit tests mock all data-plane RPCs; this has not yet been validated on a live cluster.

🤖 Generated with Claude Code

Add a background, task-driven node-removal orchestration that mirrors add_node/cluster expansion in reverse. `sn remove` now validates and queues an FN_NODE_REMOVAL task; tasks_runner_node_removal drives the idempotent, resumable flow: shutdown -> in_removal -> rewire LVS replicas -> remove/fail/migrate devices -> removed Preconditions (enforced before queueing): target ONLINE, every other non-removed node ONLINE, FTT headroom OK, no LVols/snapshots on the node (operator migrates those separately), and every secondary/tertiary replica the node hosts for other primaries has a host-disjoint relocation target. LVS rewiring: * Case A - the node's own primary LVS: tear down its (empty) secondary/tertiary replicas on the peers that host them. * Case B - replicas this node hosts for other primaries: re-host each on a fresh, anti-affinity-valid node (get_secondary_nodes / get_secondary_nodes_2 + recreate_lvstore_on_non_leader) so those primaries keep their fault tolerance. Devices are driven online -> removed -> failed (queuing failure migration on the surviving online nodes) and the flow waits for failed_and_migrated before flipping the node to removed. Changes: * base_model: new STATUS_IN_REMOVAL (code 13) * job_schedule: new FN_NODE_REMOVAL * tasks_controller: add_node_removal_task + dedup + getter; skip IN_REMOVAL nodes when fanning out device-migration tasks * storage_node_ops: rewrite remove_storage_node + node_removal_orchestrate and Case A/B + device helpers * services/tasks_runner_node_removal.py: new runner service * docker-compose-swarm.yml: register TasksNodeRemovalRunner * tests/test_node_removal.py: 21 unit tests Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mxsrc

The new functions that are being introduced use return-value based error reporting, since they are new I feel they should use exceptions for the error handling. Other than that I think the code looks good, I just have a few comments on two locations. It requires changes in the helm-chart to add the container there as well.

mxsrc · 2026-06-17T14:44:46Z

+    return cands[0] if cands else None
+
+
+def node_removal_orchestrate(node_id, force_remove=False):


I think this should be moved to the task module, including its helpers. The tasks should be as self-contained as possible, this code should not be executed from anywhere else.

# Conflicts: # simplyblock_cli/cli.py # simplyblock_core/controllers/tasks_controller.py # simplyblock_core/scripts/docker-compose-swarm.yml # simplyblock_core/storage_node_ops.py

+            try:
+                sec = db_controller.get_storage_node_by_id(primary.secondary_node_id)
+                exclude_mgmt_ips.append(sec.mgmt_ip)
+            except KeyError:


…ions

…e operations

…node removal process

…d improve JM reassignment logic in node removal process

…ted logic

…vol allocation status instead of being deprecated placeholders

# Conflicts: # simplyblock_core/storage_node_ops.py

… operations

…and persist state in DB

…hecks and consolidating logic

… bdev checks

schmidt-scaled requested review from Hamdy-khader and mxsrc June 15, 2026 20:25

github-code-quality Bot found potential problems Jun 15, 2026

View reviewed changes

Comment thread simplyblock_core/storage_node_ops.py Fixed

Comment thread simplyblock_cli/cli.py Fixed

mxsrc requested changes Jun 17, 2026

View reviewed changes

Merge branch 'main' into feature/node-removal

ee05e96

# Conflicts: # simplyblock_cli/cli.py # simplyblock_core/controllers/tasks_controller.py # simplyblock_core/scripts/docker-compose-swarm.yml # simplyblock_core/storage_node_ops.py

github-code-quality Bot found potential problems Jun 22, 2026

View reviewed changes

Comment thread simplyblock_core/storage_node_ops.py

try:

sec = db_controller.get_storage_node_by_id(primary.secondary_node_id)

exclude_mgmt_ips.append(sec.mgmt_ip)

except KeyError:

Hamdy-khader added 3 commits June 23, 2026 00:31

update: point SIMPLY_BLOCK_DOCKER_IMAGE to feature-node-removal

084fb3c

fix: update get_secondary_nodes to handle removed_node parameter

2181fad

remove: comment out phase 2 "in_removal" logic in storage node operat…

4e6ff2b

…ions

github-code-quality Bot found potential problems Jun 24, 2026

View reviewed changes

Comment thread simplyblock_core/storage_node_ops.py Fixed

Hamdy-khader added 5 commits June 24, 2026 23:59

update: exclude node removal tasks in job task listing

d75b049

update: replace device removal logic with state update in storage nod…

ef5170e

…e operations

update: reorder and adjust device removal and finalization phases in …

f264c3f

…node removal process

update: add support for overridden device names during replacement an…

ea86f13

…d improve JM reassignment logic in node removal process

remove: drop deprecated "in_removal" status from StorageNode and rela…

515b7fa

…ted logic

Hamdy-khader approved these changes Jun 30, 2026

View reviewed changes

Hamdy-khader and others added 13 commits June 30, 2026 15:35

Merge branch 'main' into feature/node-removal

e5ab5db

update: reimplement sn suspend and sn resume commands to modify l…

94289eb

…vol allocation status instead of being deprecated placeholders

Merge branch 'main' into feature/node-removal

42d51bc

# Conflicts: # simplyblock_core/storage_node_ops.py

fix: correct indentation for removed_node condition in storage node…

9d102ef

… operations

update: ensure suspend and resume operations toggle auto-restart …

2902dc6

…and persist state in DB

update: streamline hublvol creation by removing redundant secondary c…

2b1f5e7

…hecks and consolidating logic

fix: handle overridden device names during JM reassignment and remote…

8f5cc1e

… bdev checks

Merge branch 'main' into feature/node-removal

2da6fbd

fix linter

60f105e

fix unit tests

a652d44

fix unit tests 2

fa97530

Merge branch 'main' into feature/node-removal

92dfed1

Merge branch 'main' into feature/node-removal

f845f43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(node-removal): online node removal as inverse of cluster expansion#1104

feat(node-removal): online node removal as inverse of cluster expansion#1104
schmidt-scaled wants to merge 23 commits into
mainfrom
feature/node-removal

schmidt-scaled commented Jun 15, 2026

Uh oh!

Uh oh!

Uh oh!

mxsrc left a comment

Uh oh!

mxsrc Jun 17, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		return cands[0] if cands else None


		def node_removal_orchestrate(node_id, force_remove=False):

Uh oh!

Conversation

schmidt-scaled commented Jun 15, 2026

Summary

Behavior

Changes

Testing

Notes / review focus

Uh oh!

Uh oh!

Uh oh!

mxsrc left a comment

Choose a reason for hiding this comment

Uh oh!

mxsrc Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants