feat(node-removal): online node removal as inverse of cluster expansion#1104
Open
schmidt-scaled wants to merge 23 commits into
Open
feat(node-removal): online node removal as inverse of cluster expansion#1104schmidt-scaled wants to merge 23 commits into
schmidt-scaled wants to merge 23 commits into
Conversation
Add a background, task-driven node-removal orchestration that mirrors
add_node/cluster expansion in reverse. `sn remove` now validates and
queues an FN_NODE_REMOVAL task; tasks_runner_node_removal drives the
idempotent, resumable flow:
shutdown -> in_removal -> rewire LVS replicas -> remove/fail/migrate
devices -> removed
Preconditions (enforced before queueing): target ONLINE, every other
non-removed node ONLINE, FTT headroom OK, no LVols/snapshots on the node
(operator migrates those separately), and every secondary/tertiary
replica the node hosts for other primaries has a host-disjoint
relocation target.
LVS rewiring:
* Case A - the node's own primary LVS: tear down its (empty)
secondary/tertiary replicas on the peers that host them.
* Case B - replicas this node hosts for other primaries: re-host each
on a fresh, anti-affinity-valid node (get_secondary_nodes /
get_secondary_nodes_2 + recreate_lvstore_on_non_leader) so those
primaries keep their fault tolerance.
Devices are driven online -> removed -> failed (queuing failure
migration on the surviving online nodes) and the flow waits for
failed_and_migrated before flipping the node to removed.
Changes:
* base_model: new STATUS_IN_REMOVAL (code 13)
* job_schedule: new FN_NODE_REMOVAL
* tasks_controller: add_node_removal_task + dedup + getter; skip
IN_REMOVAL nodes when fanning out device-migration tasks
* storage_node_ops: rewrite remove_storage_node + node_removal_orchestrate
and Case A/B + device helpers
* services/tasks_runner_node_removal.py: new runner service
* docker-compose-swarm.yml: register TasksNodeRemovalRunner
* tests/test_node_removal.py: 21 unit tests
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
mxsrc
requested changes
Jun 17, 2026
mxsrc
left a comment
Contributor
There was a problem hiding this comment.
The new functions that are being introduced use return-value based error reporting, since they are new I feel they should use exceptions for the error handling. Other than that I think the code looks good, I just have a few comments on two locations. It requires changes in the helm-chart to add the container there as well.
| return cands[0] if cands else None | ||
|
|
||
|
|
||
| def node_removal_orchestrate(node_id, force_remove=False): |
Contributor
There was a problem hiding this comment.
I think this should be moved to the task module, including its helpers. The tasks should be as self-contained as possible, this code should not be executed from anywhere else.
# Conflicts: # simplyblock_cli/cli.py # simplyblock_core/controllers/tasks_controller.py # simplyblock_core/scripts/docker-compose-swarm.yml # simplyblock_core/storage_node_ops.py
| try: | ||
| sec = db_controller.get_storage_node_by_id(primary.secondary_node_id) | ||
| exclude_mgmt_ips.append(sec.mgmt_ip) | ||
| except KeyError: |
…node removal process
…d improve JM reassignment logic in node removal process
Hamdy-khader
approved these changes
Jun 30, 2026
…vol allocation status instead of being deprecated placeholders
# Conflicts: # simplyblock_core/storage_node_ops.py
…and persist state in DB
…hecks and consolidating logic
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements an online storage-node removal function that works as the inverse of cluster expansion (
add_node). It is background, task-driven, idempotent and resumable —sn removevalidates preconditions then queues anFN_NODE_REMOVALtask; the newtasks_runner_node_removalservice drives the flow:This replaces the old
remove_storage_nodesemantics (which only handled an already-offline node and did no LVS rewiring).Behavior
Preconditions (enforced before anything is queued):
online;online;_check_ftt_allows_node_removal);LVS rewiring:
get_secondary_nodes/get_secondary_nodes_2+recreate_lvstore_on_non_leader) so those primaries keep their fault tolerance. Bookkeeping back-ref on the removed node is cleared only after a successful rebuild, so retries resume cleanly.Devices: each data device is driven
online → removed → failed(queuing failure-migration on the surviving online nodes), then the flow waits forfailed_and_migratedbefore flipping the node toremoved.Changes
models/base_model.py: newSTATUS_IN_REMOVAL(code 13)models/job_schedule.py: newFN_NODE_REMOVALcontrollers/tasks_controller.py:add_node_removal_task+ dedup branch +get_active_node_removal_task; skipIN_REMOVALnodes when fanning out device-migration tasks (their SPDK is dead)storage_node_ops.py: rewriteremove_storage_node(validate + queue) +node_removal_orchestrate+ Case A/B + device-decommission/finalize helpersservices/tasks_runner_node_removal.py: new runner service (lease-aware, suspend-and-revisit for the long migration wait)scripts/docker-compose-swarm.yml: registerTasksNodeRemovalRunnersimplyblock_cli/cli.py: updatedsn removehelp;--force-removenow only cancels active taskstests/test_node_removal.py: 21 unit testsTesting
tests/test_node_removal.py— 21 tests (preconditions, relocation feasibility/picking, Case A/B bookkeeping incl. idempotency/resume, device completion gate, status mapping). ✅Notes / review focus
recreate_lvstore_on_non_leaderwith the primary as the online leader. Worth a careful look from someone close to the LVS restart/recreate machinery.🤖 Generated with Claude Code