OCPBUGS-88561: Increase machineset-controller startup probe failure threshold by mkowalski · Pull Request #1509 · openshift/machine-api-operator

mkowalski · 2026-06-15T13:00:44Z

Problem

During 4.22→5.0 major upgrades on Single Node OpenShift (SNO), the machineset-controller container repeatedly fails its startup probe and enters CrashLoopBackOff. On SNO, all pods restart simultaneously on a single node (~1052 containers in a 38-minute window), overloading the kube-apiserver. The machineset-controller needs API server round-trips for cache sync and leader election, which exceed the current 5-minute startup probe window (30 failures × 10s period).

Each CrashLoopBackOff restart resets the cache sync, making convergence harder. The probe failure events repeat 21-22 times pathologically, triggering the monitor test regression.

Fix

Increase FailureThreshold from 30 to 60, extending the startup probe tolerance from 5 to 10 minutes. This gives the machineset-controller enough time to complete cache sync under heavy API server load without entering CrashLoopBackOff.

The startup probe only runs once during container initialization — once passed, the readiness and liveness probes (unchanged) take over. Increasing the startup threshold has no impact on steady-state behavior.

Evidence

Journal analysis from a failing run (2066200127957110784):

17:03:46 — CRI-O creates the pod sandbox
17:09:56 — First probe failures: connection refused (process not listening yet)
17:13:26 — Probes switch to context deadline exceeded (/readyz too slow under API server load)
17:14:19 — CrashLoopBackOff begins after exhausting 30-failure threshold
17:14→17:51 — Repeated CrashLoopBackOff cycles with exponential backoff
During this window: 1052 containers being created/started on the single node

Pass rates: Base (4.22) 87/87 (100%) → Sample (5.0) 19/23 (82.6%) on the SNO upgrade job.

Bug

https://redhat.atlassian.net/browse/OCPBUGS-88561

🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.

Summary by CodeRabbit

Chores
- Improved startup stability of the operator component by increasing failure tolerance during initialization.

During major upgrades on Single Node OpenShift (SNO), the machineset-controller container repeatedly fails its startup probe and enters CrashLoopBackOff. On SNO, all pods restart simultaneously on a single node (~1000 containers), overloading the API server. The machineset-controller needs API server round-trips for cache sync and leader election, which exceed the current 5-minute startup probe window (30 failures × 10s period). Increase FailureThreshold from 30 to 60, extending the startup probe tolerance from 5 to 10 minutes. This gives the machineset-controller enough time to complete cache sync under heavy API server load without entering CrashLoopBackOff. Bug: https://redhat.atlassian.net/browse/OCPBUGS-88561 Signed-off-by: Mateusz Kowalski <mko@redhat.com> Generated-by: AI Signed-off-by: Mateusz Kowalski <mko@redhat.com>

coderabbitai · 2026-06-15T13:00:59Z

Walkthrough

In pkg/operator/sync.go, the StartupProbe FailureThreshold for the machineset-controller container is increased from 30 to 60. All other probe parameters remain unchanged.

Changes

machineset-controller StartupProbe FailureThreshold update

Layer / File(s)	Summary
StartupProbe FailureThreshold increase `pkg/operator/sync.go`	`FailureThreshold` in the `machineset-controller` container's `StartupProbe` is changed from `30` to `60`; all other probe fields are unchanged.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (14 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR only modifies pkg/operator/sync.go (FailureThreshold change). All Ginkgo test names in codebase are stable/deterministic with no dynamic content like pod names, timestamps, UUIDs, or node names...
Test Structure And Quality	✅ Passed	No Ginkgo test code in this PR; only standard Go unit tests in sync_test.go. Custom check for Ginkgo test quality is not applicable.
Microshift Test Compatibility	✅ Passed	PR only modifies pkg/operator/sync.go to change a numeric constant (FailureThreshold 30→60) in container probe configuration; no new Ginkgo e2e tests are added.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No new Ginkgo e2e tests are added in this PR. The change only modifies a container probe configuration parameter in pkg/operator/sync.go, making this check not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	This PR only increases the StartupProbe FailureThreshold from 30 to 60. It introduces no new scheduling constraints (affinity, topology spread, nodeSelector, PDB, etc.), and actually improves SNO s...
Ote Binary Stdout Contract	✅ Passed	PR changes only a Kubernetes Probe FailureThreshold constant (30→60) in a library package file with no main(), init(), logging setup, or stdout writes. Not applicable to OTE Binary Stdout Contract...
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR modifies container probe configuration in pkg/operator/sync.go, not Ginkgo e2e tests. Check only applies to new test additions.
No-Weak-Crypto	✅ Passed	The PR only changes a Kubernetes probe FailureThreshold from 30 to 60. No weak cryptography, custom crypto implementations, or non-constant-time secret comparisons are introduced.
Container-Privileges	✅ Passed	The PR modifies only a StartupProbe FailureThreshold value (30→60) in pkg/operator/sync.go. No privileged: true, hostPID, hostNetwork, hostIPC, SYS_ADMIN capability, allowPrivilegeEscalation, or ro...
No-Sensitive-Data-In-Logs	✅ Passed	PR changes only a FailureThreshold configuration value (30→60) with no logging modifications or sensitive data exposure. Existing logging statements are generic error/info messages without sensitiv...
Title check	✅ Passed	The PR title directly and clearly summarizes the main change: increasing the machineset-controller startup probe failure threshold from 30 to 60, which is the exact modification described in the raw summary.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-15T13:01:20Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign damdo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-06-15T13:04:44Z

@mkowalski: This pull request references Jira Issue OCPBUGS-88561, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Problem

During 4.22→5.0 major upgrades on Single Node OpenShift (SNO), the machineset-controller container repeatedly fails its startup probe and enters CrashLoopBackOff. On SNO, all pods restart simultaneously on a single node (~1052 containers in a 38-minute window), overloading the kube-apiserver. The machineset-controller needs API server round-trips for cache sync and leader election, which exceed the current 5-minute startup probe window (30 failures × 10s period).

Each CrashLoopBackOff restart resets the cache sync, making convergence harder. The probe failure events repeat 21-22 times pathologically, triggering the monitor test regression.

Fix

Increase FailureThreshold from 30 to 60, extending the startup probe tolerance from 5 to 10 minutes. This gives the machineset-controller enough time to complete cache sync under heavy API server load without entering CrashLoopBackOff.

The startup probe only runs once during container initialization — once passed, the readiness and liveness probes (unchanged) take over. Increasing the startup threshold has no impact on steady-state behavior.

Evidence

Journal analysis from a failing run (2066200127957110784):

17:03:46 — CRI-O creates the pod sandbox

17:09:56 — First probe failures: connection refused (process not listening yet)

17:13:26 — Probes switch to context deadline exceeded (/readyz too slow under API server load)

17:14:19 — CrashLoopBackOff begins after exhausting 30-failure threshold

17:14→17:51 — Repeated CrashLoopBackOff cycles with exponential backoff

During this window: 1052 containers being created/started on the single node

Pass rates: Base (4.22) 87/87 (100%) → Sample (5.0) 19/23 (82.6%) on the SNO upgrade job.

Bug

https://redhat.atlassian.net/browse/OCPBUGS-88561

🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.

Summary by CodeRabbit

Chores

Improved startup stability of the operator component by increasing failure tolerance during initialization.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

mkowalski · 2026-06-15T13:05:30Z

/jira refresh

openshift-ci-robot · 2026-06-15T13:05:36Z

@mkowalski: This pull request references Jira Issue OCPBUGS-88561, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-06-15T17:03:42Z

@mkowalski: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from mdbooth and racheljpg June 15, 2026 13:01

mkowalski changed the title ~~Increase machineset-controller startup probe failure threshold~~ OCPBUGS-88561: Increase machineset-controller startup probe failure threshold Jun 15, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 15, 2026

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 15, 2026

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-88561: Increase machineset-controller startup probe failure threshold#1509

OCPBUGS-88561: Increase machineset-controller startup probe failure threshold#1509
mkowalski wants to merge 1 commit into
openshift:mainfrom
mkowalski:fix/increase-machineset-startup-probe-tolerance

mkowalski commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented Jun 15, 2026

Uh oh!

openshift-ci-robot commented Jun 15, 2026

Problem

Fix

Evidence

Bug

Summary by CodeRabbit

Uh oh!

mkowalski commented Jun 15, 2026

Uh oh!

openshift-ci-robot commented Jun 15, 2026

Uh oh!

openshift-ci Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mkowalski commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Evidence

Bug

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented Jun 15, 2026

Uh oh!

openshift-ci-robot commented Jun 15, 2026

Problem

Fix

Evidence

Bug

Summary by CodeRabbit

Uh oh!

mkowalski commented Jun 15, 2026

Uh oh!

openshift-ci-robot commented Jun 15, 2026

Uh oh!

openshift-ci Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mkowalski commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading