chore: sync main→staging infra (drop Faro, CSS isolate, cloudflared probes)#169
Merged
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
The cloudflared deployment was hitting ~21 restarts/hour (20,640
restarts over 41 days). Root cause is in the `livenessProbe`:
```yaml
livenessProbe:
httpGet: { path: /ready, port: 2000 }
failureThreshold: 1 # ← one failure kills the pod
periodSeconds: 10
```
`/ready` only returns 200 when at least one tunnel connection is active.
Cloudflare rotates edge connections periodically (and packet loss / DNS
jitter can briefly drop them too), so a transient non-200 is **expected
behavior, not a fault**. With `failureThreshold: 1`, kubelet kills the
pod on the first miss → cloudflared graceful-shutdowns (exit 0) → pod
restarts → reconnects → next rotation kills it again. Death loop.
## Changes
- **Drop `livenessProbe`.** cloudflared handles reconnection internally;
k8s doesn't need to force-kill it.
- **Add `startupProbe`** (`failureThreshold: 30`, `periodSeconds: 10`) —
5-min budget for the initial tunnel connection.
- **Add `readinessProbe`** (`failureThreshold: 3`) — gates rolling
deploys so we don't terminate the old pod until the new one is actually
connected.
- **Pin image** `cloudflare/cloudflared:latest` → `2026.5.0`. `:latest`
blocks rollback and lets a bad upstream push break us silently.
- **`--loglevel debug` → `info`** — debug isn't appropriate for prod and
is noisy in logs.
- **Add `resources` requests/limits** — `50m/64Mi` request, `500m/256Mi`
limit. cloudflared is lightweight; these leave headroom for spikes.
Kept `replicas: 2` for HA during rolling updates and connection
rotation.
## Test plan
- [ ] `helm template helm-chart` renders without error
- [ ] Deploy to staging, confirm both pods reach Ready
- [ ] Watch `kubectl get pod -l pod=cloudflared -w` for 24h, expect 0
restarts
- [ ] Trigger a rolling update, confirm new pod becomes Ready before old
terminates
- [ ] Verify tunnel still serves traffic end-to-end
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- **Slim code highlighting** — replace `@streamdown/code` (bundles all 50+ Shiki grammars) with a custom plugin loading only 13 languages + 2 themes. `paperdebugger.js` 17 MB → 9.4 MB. - **Remove Grafana Faro** observability SDK from the extension bundle and build pipeline. - **Isolate extension CSS from Overleaf** — scope all CSS under a `.pd-scope` class (via `postcss-prefix-selector`, content-script build only) so Tailwind preflight + heroui no longer leak into and mutate the Overleaf page. All UI surfaces (floating window, embed sidebar, heroui popovers via a dedicated `#pd-portal`, toolbar buttons) are self-scoped. - **Telemetry** — LLM TTFT, JS heap gauge, event capture. Automated: tsc + build clean. Manual (needs a browser on Overleaf): - Overleaf's own styling no longer altered - All 5 display modes render styled (floating, right-fixed, bottom-fixed, fullscreen, embed) - Dark mode in floating + embed - heroui modals/tooltips/selects styled - Toolbar button + "Add to Chat" styled 🤖 Generated with [Claude Code](https://claude.com/claude-code) https://claude.ai/code/session_01ECe2qZWextwVCycC9EveFe --------- Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Junyi-99
added a commit
that referenced
this pull request
Jun 25, 2026
Reconcile the Tailwind v4 upgrade with the CSS-isolation + OTEL work that landed on staging (via #169 back-porting #168). Resolution: - CSS isolation kept on the PostCSS path: postcss.config.js runs @tailwindcss/postcss then postcss-prefix-selector (.pd-scope) for the default content-script build. Dropped @tailwindcss/vite — the Vite plugin bypasses PostCSS, which would break the prefix step. Verified: 3086 .pd-scope prefixes emitted in the default bundle. - tailwind.config.js deleted (v4 ports it to CSS: @theme/@plugin/@source/ @custom-variant in index.css), per the v4 upgrade guide. - vite.config.ts: staging's produce()-based config + #125 path aliases. Added setAutoFreeze(false) — immer froze the config and Vite mutates resolve.conditions at startup. - package.json: dropped Faro (staging replaced it with OpenTelemetry), kept OTEL + web-vitals + apple-sign-in, took #125's version bumps, dropped autoprefixer (v4 built-in). Aligned @eslint/js to eslint 9 (eslint 10 pulls react-hooks 7 which floods 44 new-rule errors — separate migration). - Regenerated bun.lock and package-lock.json. Build: exit 0, all 5 targets. Lint: 0 errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ECe2qZWextwVCycC9EveFe
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
main and staging had diverged (12 main-only / 81 staging-only commits). Feature-wise main ⊇ staging — the only thing staging genuinely lacked was the infrastructure / release-hygiene batch landed on main. This back-ports just that batch so the divergence stops growing.
A full
main→stagingmerge was rejected: it drags in 6–7 conflicts from squash-duplicated feature work (BYOK, model-selection, cost-track) that already exists on both lines under different PR numbers. Cherry-picking only the genuinely-missing infra commits keeps the change surgical.What (cherry-picked from main)
#168slim bundle / drop Faro / isolate CSS from Overleaf — staging was still shipping@grafana/faro-*and had no.pd-scopeCSS isolation (Overleaf style-bleed bug). 20 webapp files applied cleanly; onlypackage-lock.jsonconflicted (regenerated).bun.lockalso re-synced to drop Faro.#167cloudflared probe loop fix — replaces thefailureThreshold:1livenessProbe with startup/readiness probes.Skipped (already on staging under different PR numbers): CI-workflow dedup (#138), backend-caller permissions (#146), contributing docs (#127), nodeSelector add (#137).
Verification
npm run build→ exit 0, all 5 targets emitted, no Faro errors.tsc -bclean. No Faro refs remain insrc/,package.json,bun.lock, orpackage-lock.json.🤖 Generated with Claude Code