Skip to content

Updates clusteroperator_controller progressing#473

Open
theobarberbany wants to merge 1 commit into
openshift:mainfrom
theobarberbany:progressing-fix-2
Open

Updates clusteroperator_controller progressing#473
theobarberbany wants to merge 1 commit into
openshift:mainfrom
theobarberbany:progressing-fix-2

Conversation

@theobarberbany

@theobarberbany theobarberbany commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Updates the logic in the clusteroperator_controller to handle progressing more robustly.

The previous fix showed a jump from 14 -> 85% in sippy, but was still possible to set progressing=True then immediately unset on next reconcile.

This isn't that useful to a cluster admin trying to watch the progress of the rollout / upgrade.

This change keeps Progressing=True until we see the deployment/daemonset actually stabilise.

Summary by CodeRabbit

  • Bug Fixes
    • Improved Cluster Operator reconciliation to wait for actual Deployment and DaemonSet rollout convergence before leaving the progressing state.
    • Standardized requeue timing during active rollouts and enhanced status handling for stalled deployments when rollout deadlines are exceeded.
  • Tests
    • Expanded reconciliation and rollout-completion unit coverage with multi-step flows that validate progressing/available transitions based on live rollout conditions and convergence criteria.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 637221d4-4437-4481-94bd-cf0bbbdca55d

📥 Commits

Reviewing files that changed from the base of the PR and between a71d6a0 and e05d774.

📒 Files selected for processing (3)
  • pkg/controllers/clusteroperator_controller.go
  • pkg/controllers/clusteroperator_controller_test.go
  • pkg/controllers/status.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/controllers/status.go
  • pkg/controllers/clusteroperator_controller_test.go
  • pkg/controllers/clusteroperator_controller.go

Walkthrough

The controller's sync() function now delays returning non-progressing status until all managed Deployments and DaemonSets complete their rollouts. Four new helpers—isDeploymentRolloutComplete, isDeploymentRolloutStalled, isDaemonSetRolloutComplete, and operandsConverged()—evaluate live object status to gate the Progressing→Available transition. The reconciler requeues with a fixed 30-second delay while operands are progressing, and a new ensureStatusProgressing optimization skips redundant status updates. Tests verify the three-reconcile flow and unit-test rollout completion predicates.

Changes

Operand Convergence and Status Progression

Layer / File(s) Summary
Rollout completion detection helpers
pkg/controllers/clusteroperator_controller.go
Adds apps/v1 import. Implements isDeploymentRolloutComplete checking observed-generation currency and DeploymentProgressing condition True/NewReplicaSetAvailable. Implements isDeploymentRolloutStalled detecting deadline-exceeded deployments. Implements isDaemonSetRolloutComplete verifying observed-generation, UpdatedNumberScheduledDesiredNumberScheduled, and zero unavailable/misscheduled counts. Implements operandsConverged() fetching live Deployment and DaemonSet objects and returning false if any rollout incomplete or NotFound.
Sync integration and requeue behavior
pkg/controllers/clusteroperator_controller.go
sync() now calls operandsConverged() when no resource updates are applied; if operands have not converged, ensures operator status is progressing and continues reconciling. Reconcile() returns ctrl.Result with RequeueAfter: 30s whenever operands are progressing, replacing immediate return behavior.
Status progression optimization
pkg/controllers/status.go
ensureStatusProgressing short-circuits status update when both OperatorProgressing and OperatorUpgradeable conditions are already ConditionTrue, skipping redundant progressing condition applications.
Unit tests for rollout helpers
pkg/controllers/clusteroperator_controller_test.go
"Rollout completion checks" suite unit-tests isDeploymentRolloutComplete across missing conditions, observed-generation lag, and various DeploymentProgressing reason/status combinations; isDeploymentRolloutStalled for progressing/deadline scenarios; isDaemonSetRolloutComplete across generation and scheduled/updated/unavailable/misscheduled thresholds. Adds operandsConverged context verifying false→true transition when deployment status is patched with DeploymentProgressing=True.
Integration tests for convergence flow
pkg/controllers/clusteroperator_controller_test.go
Updates "non-final resource changed" test context with operandsConverged predicate verification. Refactors "Reconcile progressing flow" from two-step to three-step: second reconcile verifies Progressing remains True while rollout is incomplete, then deployment status is patched between reconciles and third reconcile expects Progressing=False and Available=True.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 12 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Several assertions lack meaningful failure messages per requirement 4. Examples: Expect(err).To(Succeed()) (lines 632, 641, 662), Expect(updated).To(BeTrue()) (line 636), and `Expect(cl.Get(...... Add assertion messages to all Expect() calls lacking messages, e.g., Expect(err).To(Succeed(), "failed to apply resources") and Expect(updated).To(BeTrue(), "applyResources should report resources updated")
Title check ❓ Inconclusive The title 'Updates clusteroperator_controller progressing' is vague and lacks specificity. While it references the main file and component being updated, it does not clearly convey what the change accomplishes or the key improvement (preventing Progressing status oscillation). Consider a more descriptive title such as 'Keep Progressing condition true until deployment stabilizes' or 'Fix Progressing status oscillation in clusteroperator controller' that better captures the core improvement.
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All Ginkgo test names (It, Describe, Context, When, By) in the modified test file are stable and deterministic with no dynamic values, generated identifiers, timestamps, or pod/node/namespace names...
Microshift Test Compatibility ✅ Passed PR adds only unit test changes in pkg/controllers/, not new e2e tests. The custom check specifically applies to "Ginkgo e2e tests"; these are unit/integration tests that don't run against a live cl...
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests added. Changes are only to unit tests in pkg/controllers/ using controller-runtime envtest, not cluster-based e2e tests. Check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed The PR modifies only controller reconciliation logic to check deployment/daemonset rollout status, without introducing any deployment manifests, scheduling constraints, affinity rules, nodeSelector...
Ote Binary Stdout Contract ✅ Passed PR modifies controller library code (pkg/controllers/ package), not process-level entry points. All logging uses klog which defaults to stderr. No fmt.Print*, log.Print*, or os.Stdout writes detect...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR contains only unit tests using Ginkgo syntax in an envtest environment, not e2e tests. The check applies specifically to "Ginkgo e2e tests," which are not present in this PR.
No-Weak-Crypto ✅ Passed No weak cryptography, custom crypto implementations, or non-constant-time secret comparisons detected. PR only modifies Kubernetes operator status and rollout completion logic.
Container-Privileges ✅ Passed PR modifies only Go controller code files; contains no Kubernetes manifests, Dockerfile, or container security configurations to check for privileged settings.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data (passwords, tokens, API keys, PII, session IDs, internal hostnames, or customer data) found in logging statements added to clusteroperator_controller.go, status.go, or tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from mdbooth and racheljpg June 16, 2026 13:41

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/controllers/clusteroperator_controller_test.go (1)

1144-1265: ⚡ Quick win

Add edge-case tests for stale generations and daemonset availability.

The new suite doesn’t guard against two important false-positive cases: stale ObservedGeneration and NumberUnavailable > 0. Adding these would protect the convergence contract from regressing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/clusteroperator_controller_test.go` around lines 1144 - 1265,
Add two edge-case tests to protect against false-positive rollout completion
checks. In the Deployment context, add a test case that verifies
isDeploymentRolloutComplete returns false when ObservedGeneration in the status
is stale (less than the deployment's generation in spec), even if the
Progressing condition indicates success. In the DaemonSet context, add a test
case that verifies isDaemonSetRolloutComplete returns false when
NumberUnavailable is greater than 0, even when UpdatedNumberScheduled matches or
exceeds DesiredNumberScheduled and NumberMisscheduled is 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controllers/clusteroperator_controller.go`:
- Around line 168-174: When operands have not yet converged (the `!converged`
condition in the clusteroperator_controller.go around lines 168-174), the
function returns true without updating the ClusterOperator status. This leaves
the Progressing condition as False even though operands are actively rolling
out. Update the ClusterOperator status to set Progressing=True before returning
when converged is false, ensuring the status accurately reflects that operands
are still in the process of rolling out.
- Around line 204-218: The isDeploymentRolloutComplete function must be enhanced
to verify the controller has observed the current generation and that all
replicas are actually updated and available. In addition to checking the
Progressing condition, add validations that deploy.Status.ObservedGeneration >=
deploy.Metadata.Generation, deploy.Status.UpdatedReplicas >=
deploy.Spec.Replicas, and deploy.Status.AvailableReplicas >=
deploy.Spec.Replicas. Similarly, the isDaemonSetRolloutComplete function needs
to replace the >= operator with == when comparing UpdatedNumberScheduled to
DesiredNumberScheduled to ensure all nodes are updated, and add two additional
checks: verify ds.Status.ObservedGeneration >= ds.Metadata.Generation and
confirm ds.Status.NumberAvailable >= ds.Status.DesiredNumberScheduled to ensure
all desired daemon pods are actually running.

---

Nitpick comments:
In `@pkg/controllers/clusteroperator_controller_test.go`:
- Around line 1144-1265: Add two edge-case tests to protect against
false-positive rollout completion checks. In the Deployment context, add a test
case that verifies isDeploymentRolloutComplete returns false when
ObservedGeneration in the status is stale (less than the deployment's generation
in spec), even if the Progressing condition indicates success. In the DaemonSet
context, add a test case that verifies isDaemonSetRolloutComplete returns false
when NumberUnavailable is greater than 0, even when UpdatedNumberScheduled
matches or exceeds DesiredNumberScheduled and NumberMisscheduled is 0.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6c6ccf8e-002e-4339-96f9-4289dbde175a

📥 Commits

Reviewing files that changed from the base of the PR and between d3f8b6a and 3098a39.

📒 Files selected for processing (2)
  • pkg/controllers/clusteroperator_controller.go
  • pkg/controllers/clusteroperator_controller_test.go

Comment thread pkg/controllers/clusteroperator_controller.go
Comment thread pkg/controllers/clusteroperator_controller.go
@theobarberbany theobarberbany force-pushed the progressing-fix-2 branch 2 times, most recently from 620ea7f to a71d6a0 Compare June 16, 2026 14:38

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/controllers/clusteroperator_controller_test.go (1)

644-659: ⚡ Quick win

Extract duplicated deployment rollout-completion patch into a shared test helper.

The same status-patch block appears in three tests. Extracting it into one helper will reduce drift when rollout-completion criteria change.

♻️ Proposed refactor
+func markDeploymentRolloutComplete(ctx context.Context, c client.Client, dep *appsv1.Deployment) {
+	live := &appsv1.Deployment{}
+	Expect(c.Get(ctx, client.ObjectKeyFromObject(dep), live)).To(Succeed())
+	live.Status.ObservedGeneration = live.Generation
+	live.Status.Conditions = []appsv1.DeploymentCondition{
+		{
+			Type:   appsv1.DeploymentProgressing,
+			Status: corev1.ConditionTrue,
+			Reason: "NewReplicaSetAvailable",
+		},
+	}
+	Expect(c.Status().Update(ctx, live)).To(Succeed())
+}

Then replace each repeated inline patch block with:

-				live := &appsv1.Deployment{}
-				Expect(cl.Get(context.Background(), client.ObjectKeyFromObject(dep), live)).To(Succeed())
-				live.Status.ObservedGeneration = live.Generation
-				live.Status.Conditions = []appsv1.DeploymentCondition{
-					{
-						Type:   appsv1.DeploymentProgressing,
-						Status: corev1.ConditionTrue,
-						Reason: "NewReplicaSetAvailable",
-					},
-				}
-				Expect(cl.Status().Update(context.Background(), live)).To(Succeed())
+				markDeploymentRolloutComplete(context.Background(), cl, dep)

Also applies to: 698-713, 1112-1127

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/clusteroperator_controller_test.go` around lines 644 - 659,
The deployment status patching logic is duplicated across multiple test cases.
Create a shared test helper function (e.g., patchDeploymentRolloutComplete) that
encapsulates the logic of fetching a Deployment, setting its ObservedGeneration
to match Generation, adding the DeploymentProgressing condition with status True
and reason NewReplicaSetAvailable, and updating the status via the test client.
Then replace all three inline patch blocks with calls to this new helper
function, passing the client and deployment resource as parameters. This will
eliminate drift and make future changes to rollout-completion criteria easier to
maintain.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controllers/clusteroperator_controller.go`:
- Line 254: The error message in the fmt.Errorf call violates the ST1005 linting
rule by starting with an uppercase letter. In the clusteroperator_controller.go
file where the error is returned for the ProgressDeadlineExceeded condition,
change the error message to begin with a lowercase letter instead of uppercase,
so "Deployment" becomes "deployment" to comply with Go error message
conventions.

In `@pkg/controllers/status.go`:
- Around line 113-123: The early return condition in ensureStatusProgressing at
lines 119-121 is too narrow and skips calling setStatusProgressing when the
Progressing condition is already True, which can leave stale
OperatorUpgradeable=False conditions from previous degraded states. Remove or
modify the conditional check that returns early (the if statement checking
cond.Status == configv1.ConditionTrue) to ensure setStatusProgressing is always
called so that all status conditions including Upgradeable are properly
reconciled during healthy-progressing rollouts.

---

Nitpick comments:
In `@pkg/controllers/clusteroperator_controller_test.go`:
- Around line 644-659: The deployment status patching logic is duplicated across
multiple test cases. Create a shared test helper function (e.g.,
patchDeploymentRolloutComplete) that encapsulates the logic of fetching a
Deployment, setting its ObservedGeneration to match Generation, adding the
DeploymentProgressing condition with status True and reason
NewReplicaSetAvailable, and updating the status via the test client. Then
replace all three inline patch blocks with calls to this new helper function,
passing the client and deployment resource as parameters. This will eliminate
drift and make future changes to rollout-completion criteria easier to maintain.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5ab410a5-a9d1-4a24-bbea-c971b554f9cf

📥 Commits

Reviewing files that changed from the base of the PR and between 3098a39 and 620ea7f.

📒 Files selected for processing (3)
  • pkg/controllers/clusteroperator_controller.go
  • pkg/controllers/clusteroperator_controller_test.go
  • pkg/controllers/status.go

Comment thread pkg/controllers/clusteroperator_controller.go Outdated
Comment thread pkg/controllers/status.go
Updates the logic in the clusteroperator_controller to handle
progressing more robustly.

The previous fix showed a jump from 14 -> 85% in sippy, but was still
possible to set progressing=True then immediately unset on next
reconcile.

This isn't that useful to a cluster admin trying to watch the progress
of the rollout / upgrade.

This change keeps Progressing=True until we see the deployment/daemonset
actually  stabilise.
@nrb

nrb commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

/approve

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nrb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2026
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@theobarberbany: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/level0-clusterinfra-azure-ipi-proxy-tests e05d774 link false /test level0-clusterinfra-azure-ipi-proxy-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants