OCPBUGS-88561: Increase machineset-controller startup probe failure threshold#1509
Conversation
During major upgrades on Single Node OpenShift (SNO), the machineset-controller container repeatedly fails its startup probe and enters CrashLoopBackOff. On SNO, all pods restart simultaneously on a single node (~1000 containers), overloading the API server. The machineset-controller needs API server round-trips for cache sync and leader election, which exceed the current 5-minute startup probe window (30 failures × 10s period). Increase FailureThreshold from 30 to 60, extending the startup probe tolerance from 5 to 10 minutes. This gives the machineset-controller enough time to complete cache sync under heavy API server load without entering CrashLoopBackOff. Bug: https://redhat.atlassian.net/browse/OCPBUGS-88561 Signed-off-by: Mateusz Kowalski <mko@redhat.com> Generated-by: AI Signed-off-by: Mateusz Kowalski <mko@redhat.com>
WalkthroughIn Changesmachineset-controller StartupProbe FailureThreshold update
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@mkowalski: This pull request references Jira Issue OCPBUGS-88561, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@mkowalski: This pull request references Jira Issue OCPBUGS-88561, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@mkowalski: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Problem
During 4.22→5.0 major upgrades on Single Node OpenShift (SNO), the
machineset-controllercontainer repeatedly fails its startup probe and enters CrashLoopBackOff. On SNO, all pods restart simultaneously on a single node (~1052 containers in a 38-minute window), overloading the kube-apiserver. Themachineset-controllerneeds API server round-trips for cache sync and leader election, which exceed the current 5-minute startup probe window (30 failures × 10s period).Each CrashLoopBackOff restart resets the cache sync, making convergence harder. The probe failure events repeat 21-22 times pathologically, triggering the monitor test regression.
Fix
Increase
FailureThresholdfrom 30 to 60, extending the startup probe tolerance from 5 to 10 minutes. This gives themachineset-controllerenough time to complete cache sync under heavy API server load without entering CrashLoopBackOff.The startup probe only runs once during container initialization — once passed, the readiness and liveness probes (unchanged) take over. Increasing the startup threshold has no impact on steady-state behavior.
Evidence
Journal analysis from a failing run (2066200127957110784):
connection refused(process not listening yet)context deadline exceeded(/readyztoo slow under API server load)Pass rates: Base (4.22) 87/87 (100%) → Sample (5.0) 19/23 (82.6%) on the SNO upgrade job.
Bug
https://redhat.atlassian.net/browse/OCPBUGS-88561
🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.
Summary by CodeRabbit