Updates clusteroperator_controller progressing by theobarberbany · Pull Request #473 · openshift/cluster-cloud-controller-manager-operator

theobarberbany · 2026-06-16T13:39:27Z

Updates the logic in the clusteroperator_controller to handle progressing more robustly.

The previous fix showed a jump from 14 -> 85% in sippy, but was still possible to set progressing=True then immediately unset on next reconcile.

This isn't that useful to a cluster admin trying to watch the progress of the rollout / upgrade.

This change keeps Progressing=True until we see the deployment/daemonset actually stabilise.

Summary by CodeRabbit

Bug Fixes
- Improved Cluster Operator reconciliation to wait for actual Deployment and DaemonSet rollout convergence before leaving the progressing state.
- Standardized requeue timing during active rollouts and enhanced status handling for stalled deployments when rollout deadlines are exceeded.
Tests
- Expanded reconciliation and rollout-completion unit coverage with multi-step flows that validate progressing/available transitions based on live rollout conditions and convergence criteria.

coderabbitai · 2026-06-16T13:39:42Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 637221d4-4437-4481-94bd-cf0bbbdca55d

📥 Commits

Reviewing files that changed from the base of the PR and between a71d6a0 and e05d774.

📒 Files selected for processing (3)

pkg/controllers/clusteroperator_controller.go
pkg/controllers/clusteroperator_controller_test.go
pkg/controllers/status.go

🚧 Files skipped from review as they are similar to previous changes (3)

pkg/controllers/status.go
pkg/controllers/clusteroperator_controller_test.go
pkg/controllers/clusteroperator_controller.go

Walkthrough

The controller's sync() function now delays returning non-progressing status until all managed Deployments and DaemonSets complete their rollouts. Four new helpers—isDeploymentRolloutComplete, isDeploymentRolloutStalled, isDaemonSetRolloutComplete, and operandsConverged()—evaluate live object status to gate the Progressing→Available transition. The reconciler requeues with a fixed 30-second delay while operands are progressing, and a new ensureStatusProgressing optimization skips redundant status updates. Tests verify the three-reconcile flow and unit-test rollout completion predicates.

Changes

Operand Convergence and Status Progression

Layer / File(s)	Summary
Rollout completion detection helpers `pkg/controllers/clusteroperator_controller.go`	Adds `apps/v1` import. Implements `isDeploymentRolloutComplete` checking observed-generation currency and `DeploymentProgressing` condition `True`/`NewReplicaSetAvailable`. Implements `isDeploymentRolloutStalled` detecting deadline-exceeded deployments. Implements `isDaemonSetRolloutComplete` verifying observed-generation, `UpdatedNumberScheduled` ≥ `DesiredNumberScheduled`, and zero unavailable/misscheduled counts. Implements `operandsConverged()` fetching live Deployment and DaemonSet objects and returning false if any rollout incomplete or NotFound.
Sync integration and requeue behavior `pkg/controllers/clusteroperator_controller.go`	`sync()` now calls `operandsConverged()` when no resource updates are applied; if operands have not converged, ensures operator status is progressing and continues reconciling. `Reconcile()` returns `ctrl.Result` with `RequeueAfter: 30s` whenever operands are progressing, replacing immediate return behavior.
Status progression optimization `pkg/controllers/status.go`	`ensureStatusProgressing` short-circuits status update when both `OperatorProgressing` and `OperatorUpgradeable` conditions are already `ConditionTrue`, skipping redundant progressing condition applications.
Unit tests for rollout helpers `pkg/controllers/clusteroperator_controller_test.go`	"Rollout completion checks" suite unit-tests `isDeploymentRolloutComplete` across missing conditions, observed-generation lag, and various `DeploymentProgressing` reason/status combinations; `isDeploymentRolloutStalled` for progressing/deadline scenarios; `isDaemonSetRolloutComplete` across generation and scheduled/updated/unavailable/misscheduled thresholds. Adds `operandsConverged` context verifying false→true transition when deployment status is patched with `DeploymentProgressing=True`.
Integration tests for convergence flow `pkg/controllers/clusteroperator_controller_test.go`	Updates "non-final resource changed" test context with `operandsConverged` predicate verification. Refactors "Reconcile progressing flow" from two-step to three-step: second reconcile verifies `Progressing` remains `True` while rollout is incomplete, then deployment status is patched between reconciles and third reconcile expects `Progressing=False` and `Available=True`.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 12 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	Several assertions lack meaningful failure messages per requirement 4. Examples: `Expect(err).To(Succeed())` (lines 632, 641, 662), `Expect(updated).To(BeTrue())` (line 636), and `Expect(cl.Get(......	Add assertion messages to all Expect() calls lacking messages, e.g., `Expect(err).To(Succeed(), "failed to apply resources")` and `Expect(updated).To(BeTrue(), "applyResources should report resources updated")`
Title check	❓ Inconclusive	The title 'Updates clusteroperator_controller progressing' is vague and lacks specificity. While it references the main file and component being updated, it does not clearly convey what the change accomplishes or the key improvement (preventing Progressing status oscillation).	Consider a more descriptive title such as 'Keep Progressing condition true until deployment stabilizes' or 'Fix Progressing status oscillation in clusteroperator controller' that better captures the core improvement.

✅ Passed checks (12 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All Ginkgo test names (It, Describe, Context, When, By) in the modified test file are stable and deterministic with no dynamic values, generated identifiers, timestamps, or pod/node/namespace names...
Microshift Test Compatibility	✅ Passed	PR adds only unit test changes in pkg/controllers/, not new e2e tests. The custom check specifically applies to "Ginkgo e2e tests"; these are unit/integration tests that don't run against a live cl...
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No new Ginkgo e2e tests added. Changes are only to unit tests in pkg/controllers/ using controller-runtime envtest, not cluster-based e2e tests. Check is not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	The PR modifies only controller reconciliation logic to check deployment/daemonset rollout status, without introducing any deployment manifests, scheduling constraints, affinity rules, nodeSelector...
Ote Binary Stdout Contract	✅ Passed	PR modifies controller library code (pkg/controllers/ package), not process-level entry points. All logging uses klog which defaults to stderr. No fmt.Print, log.Print, or os.Stdout writes detect...
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	The PR contains only unit tests using Ginkgo syntax in an envtest environment, not e2e tests. The check applies specifically to "Ginkgo e2e tests," which are not present in this PR.
No-Weak-Crypto	✅ Passed	No weak cryptography, custom crypto implementations, or non-constant-time secret comparisons detected. PR only modifies Kubernetes operator status and rollout completion logic.
Container-Privileges	✅ Passed	PR modifies only Go controller code files; contains no Kubernetes manifests, Dockerfile, or container security configurations to check for privileged settings.
No-Sensitive-Data-In-Logs	✅ Passed	No sensitive data (passwords, tokens, API keys, PII, session IDs, internal hostnames, or customer data) found in logging statements added to clusteroperator_controller.go, status.go, or tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

pkg/controllers/clusteroperator_controller_test.go (1)
1144-1265: ⚡ Quick win

Add edge-case tests for stale generations and daemonset availability.

The new suite doesn’t guard against two important false-positive cases: stale ObservedGeneration and NumberUnavailable > 0. Adding these would protect the convergence contract from regressing.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/clusteroperator_controller_test.go` around lines 1144 - 1265,
Add two edge-case tests to protect against false-positive rollout completion
checks. In the Deployment context, add a test case that verifies
isDeploymentRolloutComplete returns false when ObservedGeneration in the status
is stale (less than the deployment's generation in spec), even if the
Progressing condition indicates success. In the DaemonSet context, add a test
case that verifies isDaemonSetRolloutComplete returns false when
NumberUnavailable is greater than 0, even when UpdatedNumberScheduled matches or
exceeds DesiredNumberScheduled and NumberMisscheduled is 0.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controllers/clusteroperator_controller.go`:
- Around line 168-174: When operands have not yet converged (the `!converged`
condition in the clusteroperator_controller.go around lines 168-174), the
function returns true without updating the ClusterOperator status. This leaves
the Progressing condition as False even though operands are actively rolling
out. Update the ClusterOperator status to set Progressing=True before returning
when converged is false, ensuring the status accurately reflects that operands
are still in the process of rolling out.
- Around line 204-218: The isDeploymentRolloutComplete function must be enhanced
to verify the controller has observed the current generation and that all
replicas are actually updated and available. In addition to checking the
Progressing condition, add validations that deploy.Status.ObservedGeneration >=
deploy.Metadata.Generation, deploy.Status.UpdatedReplicas >=
deploy.Spec.Replicas, and deploy.Status.AvailableReplicas >=
deploy.Spec.Replicas. Similarly, the isDaemonSetRolloutComplete function needs
to replace the >= operator with == when comparing UpdatedNumberScheduled to
DesiredNumberScheduled to ensure all nodes are updated, and add two additional
checks: verify ds.Status.ObservedGeneration >= ds.Metadata.Generation and
confirm ds.Status.NumberAvailable >= ds.Status.DesiredNumberScheduled to ensure
all desired daemon pods are actually running.

---

Nitpick comments:
In `@pkg/controllers/clusteroperator_controller_test.go`:
- Around line 1144-1265: Add two edge-case tests to protect against
false-positive rollout completion checks. In the Deployment context, add a test
case that verifies isDeploymentRolloutComplete returns false when
ObservedGeneration in the status is stale (less than the deployment's generation
in spec), even if the Progressing condition indicates success. In the DaemonSet
context, add a test case that verifies isDaemonSetRolloutComplete returns false
when NumberUnavailable is greater than 0, even when UpdatedNumberScheduled
matches or exceeds DesiredNumberScheduled and NumberMisscheduled is 0.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6c6ccf8e-002e-4339-96f9-4289dbde175a

📥 Commits

Reviewing files that changed from the base of the PR and between d3f8b6a and 3098a39.

📒 Files selected for processing (2)

pkg/controllers/clusteroperator_controller.go
pkg/controllers/clusteroperator_controller_test.go

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

pkg/controllers/clusteroperator_controller_test.go (1)

644-659: ⚡ Quick win

Extract duplicated deployment rollout-completion patch into a shared test helper.

The same status-patch block appears in three tests. Extracting it into one helper will reduce drift when rollout-completion criteria change.

♻️ Proposed refactor

+func markDeploymentRolloutComplete(ctx context.Context, c client.Client, dep *appsv1.Deployment) {
+	live := &appsv1.Deployment{}
+	Expect(c.Get(ctx, client.ObjectKeyFromObject(dep), live)).To(Succeed())
+	live.Status.ObservedGeneration = live.Generation
+	live.Status.Conditions = []appsv1.DeploymentCondition{
+		{
+			Type:   appsv1.DeploymentProgressing,
+			Status: corev1.ConditionTrue,
+			Reason: "NewReplicaSetAvailable",
+		},
+	}
+	Expect(c.Status().Update(ctx, live)).To(Succeed())
+}

Then replace each repeated inline patch block with:

-				live := &appsv1.Deployment{}
-				Expect(cl.Get(context.Background(), client.ObjectKeyFromObject(dep), live)).To(Succeed())
-				live.Status.ObservedGeneration = live.Generation
-				live.Status.Conditions = []appsv1.DeploymentCondition{
-					{
-						Type:   appsv1.DeploymentProgressing,
-						Status: corev1.ConditionTrue,
-						Reason: "NewReplicaSetAvailable",
-					},
-				}
-				Expect(cl.Status().Update(context.Background(), live)).To(Succeed())
+				markDeploymentRolloutComplete(context.Background(), cl, dep)

Also applies to: 698-713, 1112-1127

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/clusteroperator_controller_test.go` around lines 644 - 659,
The deployment status patching logic is duplicated across multiple test cases.
Create a shared test helper function (e.g., patchDeploymentRolloutComplete) that
encapsulates the logic of fetching a Deployment, setting its ObservedGeneration
to match Generation, adding the DeploymentProgressing condition with status True
and reason NewReplicaSetAvailable, and updating the status via the test client.
Then replace all three inline patch blocks with calls to this new helper
function, passing the client and deployment resource as parameters. This will
eliminate drift and make future changes to rollout-completion criteria easier to
maintain.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controllers/clusteroperator_controller.go`:
- Line 254: The error message in the fmt.Errorf call violates the ST1005 linting
rule by starting with an uppercase letter. In the clusteroperator_controller.go
file where the error is returned for the ProgressDeadlineExceeded condition,
change the error message to begin with a lowercase letter instead of uppercase,
so "Deployment" becomes "deployment" to comply with Go error message
conventions.

In `@pkg/controllers/status.go`:
- Around line 113-123: The early return condition in ensureStatusProgressing at
lines 119-121 is too narrow and skips calling setStatusProgressing when the
Progressing condition is already True, which can leave stale
OperatorUpgradeable=False conditions from previous degraded states. Remove or
modify the conditional check that returns early (the if statement checking
cond.Status == configv1.ConditionTrue) to ensure setStatusProgressing is always
called so that all status conditions including Upgradeable are properly
reconciled during healthy-progressing rollouts.

---

Nitpick comments:
In `@pkg/controllers/clusteroperator_controller_test.go`:
- Around line 644-659: The deployment status patching logic is duplicated across
multiple test cases. Create a shared test helper function (e.g.,
patchDeploymentRolloutComplete) that encapsulates the logic of fetching a
Deployment, setting its ObservedGeneration to match Generation, adding the
DeploymentProgressing condition with status True and reason
NewReplicaSetAvailable, and updating the status via the test client. Then
replace all three inline patch blocks with calls to this new helper function,
passing the client and deployment resource as parameters. This will eliminate
drift and make future changes to rollout-completion criteria easier to maintain.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5ab410a5-a9d1-4a24-bbea-c971b554f9cf

📥 Commits

Reviewing files that changed from the base of the PR and between 3098a39 and 620ea7f.

📒 Files selected for processing (3)

pkg/controllers/clusteroperator_controller.go
pkg/controllers/clusteroperator_controller_test.go
pkg/controllers/status.go

Updates the logic in the clusteroperator_controller to handle progressing more robustly. The previous fix showed a jump from 14 -> 85% in sippy, but was still possible to set progressing=True then immediately unset on next reconcile. This isn't that useful to a cluster admin trying to watch the progress of the rollout / upgrade. This change keeps Progressing=True until we see the deployment/daemonset actually stabilise.

nrb · 2026-06-16T16:47:44Z

/approve

openshift-ci · 2026-06-16T16:48:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nrb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [nrb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-06-16T17:57:59Z

@theobarberbany: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/level0-clusterinfra-azure-ipi-proxy-tests	`e05d774`	link	false	`/test level0-clusterinfra-azure-ipi-proxy-tests`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from mdbooth and racheljpg June 16, 2026 13:41

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread pkg/controllers/clusteroperator_controller.go

Comment thread pkg/controllers/clusteroperator_controller.go

theobarberbany force-pushed the progressing-fix-2 branch 2 times, most recently from 620ea7f to a71d6a0 Compare June 16, 2026 14:38

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread pkg/controllers/clusteroperator_controller.go Outdated

Comment thread pkg/controllers/status.go

theobarberbany force-pushed the progressing-fix-2 branch from a71d6a0 to e05d774 Compare June 16, 2026 14:47

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates clusteroperator_controller progressing#473

Updates clusteroperator_controller progressing#473
theobarberbany wants to merge 1 commit into
openshift:mainfrom
theobarberbany:progressing-fix-2

theobarberbany commented Jun 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

nrb commented Jun 16, 2026

Uh oh!

openshift-ci Bot commented Jun 16, 2026

Uh oh!

openshift-ci Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

theobarberbany commented Jun 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nrb commented Jun 16, 2026

Uh oh!

openshift-ci Bot commented Jun 16, 2026

Uh oh!

openshift-ci Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

theobarberbany commented Jun 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading