Skip to content

chore: sync main→staging infra (drop Faro, CSS isolate, cloudflared probes)#169

Merged
Junyi-99 merged 3 commits into
stagingfrom
chore/sync-main-to-staging
Jun 25, 2026
Merged

chore: sync main→staging infra (drop Faro, CSS isolate, cloudflared probes)#169
Junyi-99 merged 3 commits into
stagingfrom
chore/sync-main-to-staging

Conversation

@Junyi-99

Copy link
Copy Markdown
Member

Why

main and staging had diverged (12 main-only / 81 staging-only commits). Feature-wise main ⊇ staging — the only thing staging genuinely lacked was the infrastructure / release-hygiene batch landed on main. This back-ports just that batch so the divergence stops growing.

A full main→staging merge was rejected: it drags in 6–7 conflicts from squash-duplicated feature work (BYOK, model-selection, cost-track) that already exists on both lines under different PR numbers. Cherry-picking only the genuinely-missing infra commits keeps the change surgical.

What (cherry-picked from main)

  • #168 slim bundle / drop Faro / isolate CSS from Overleaf — staging was still shipping @grafana/faro-* and had no .pd-scope CSS isolation (Overleaf style-bleed bug). 20 webapp files applied cleanly; only package-lock.json conflicted (regenerated). bun.lock also re-synced to drop Faro.
  • #167 cloudflared probe loop fix — replaces the failureThreshold:1 livenessProbe with startup/readiness probes.
  • nodeSelector cleanup — remove hardcoded values from prod values.

Skipped (already on staging under different PR numbers): CI-workflow dedup (#138), backend-caller permissions (#146), contributing docs (#127), nodeSelector add (#137).

Verification

  • npm run build → exit 0, all 5 targets emitted, no Faro errors.
  • tsc -b clean. No Faro refs remain in src/, package.json, bun.lock, or package-lock.json.

🤖 Generated with Claude Code

Junyi-99 and others added 3 commits June 25, 2026 18:19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary

The cloudflared deployment was hitting ~21 restarts/hour (20,640
restarts over 41 days). Root cause is in the `livenessProbe`:

```yaml
livenessProbe:
  httpGet: { path: /ready, port: 2000 }
  failureThreshold: 1   # ← one failure kills the pod
  periodSeconds: 10
```

`/ready` only returns 200 when at least one tunnel connection is active.
Cloudflare rotates edge connections periodically (and packet loss / DNS
jitter can briefly drop them too), so a transient non-200 is **expected
behavior, not a fault**. With `failureThreshold: 1`, kubelet kills the
pod on the first miss → cloudflared graceful-shutdowns (exit 0) → pod
restarts → reconnects → next rotation kills it again. Death loop.

## Changes

- **Drop `livenessProbe`.** cloudflared handles reconnection internally;
k8s doesn't need to force-kill it.
- **Add `startupProbe`** (`failureThreshold: 30`, `periodSeconds: 10`) —
5-min budget for the initial tunnel connection.
- **Add `readinessProbe`** (`failureThreshold: 3`) — gates rolling
deploys so we don't terminate the old pod until the new one is actually
connected.
- **Pin image** `cloudflare/cloudflared:latest` → `2026.5.0`. `:latest`
blocks rollback and lets a bad upstream push break us silently.
- **`--loglevel debug` → `info`** — debug isn't appropriate for prod and
is noisy in logs.
- **Add `resources` requests/limits** — `50m/64Mi` request, `500m/256Mi`
limit. cloudflared is lightweight; these leave headroom for spikes.

Kept `replicas: 2` for HA during rolling updates and connection
rotation.

## Test plan

- [ ] `helm template helm-chart` renders without error
- [ ] Deploy to staging, confirm both pods reach Ready
- [ ] Watch `kubectl get pod -l pod=cloudflared -w` for 24h, expect 0
restarts
- [ ] Trigger a rolling update, confirm new pod becomes Ready before old
terminates
- [ ] Verify tunnel still serves traffic end-to-end

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- **Slim code highlighting** — replace `@streamdown/code` (bundles all
50+ Shiki grammars) with a custom plugin loading only 13 languages + 2
themes. `paperdebugger.js` 17 MB → 9.4 MB.
- **Remove Grafana Faro** observability SDK from the extension bundle
and build pipeline.
- **Isolate extension CSS from Overleaf** — scope all CSS under a
`.pd-scope` class (via `postcss-prefix-selector`, content-script build
only) so Tailwind preflight + heroui no longer leak into and mutate the
Overleaf page. All UI surfaces (floating window, embed sidebar, heroui
popovers via a dedicated `#pd-portal`, toolbar buttons) are self-scoped.
- **Telemetry** — LLM TTFT, JS heap gauge, event capture.

Automated: tsc + build clean. Manual (needs a browser on Overleaf):
- Overleaf's own styling no longer altered
- All 5 display modes render styled (floating, right-fixed,
bottom-fixed, fullscreen, embed)
- Dark mode in floating + embed
- heroui modals/tooltips/selects styled
- Toolbar button + "Add to Chat" styled

🤖 Generated with [Claude Code](https://claude.com/claude-code)

https://claude.ai/code/session_01ECe2qZWextwVCycC9EveFe

---------

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@Junyi-99 Junyi-99 merged commit 24438ec into staging Jun 25, 2026
1 check passed
@Junyi-99 Junyi-99 deleted the chore/sync-main-to-staging branch June 25, 2026 10:24
Junyi-99 added a commit that referenced this pull request Jun 25, 2026
Reconcile the Tailwind v4 upgrade with the CSS-isolation + OTEL work that
landed on staging (via #169 back-porting #168).

Resolution:
- CSS isolation kept on the PostCSS path: postcss.config.js runs
  @tailwindcss/postcss then postcss-prefix-selector (.pd-scope) for the
  default content-script build. Dropped @tailwindcss/vite — the Vite plugin
  bypasses PostCSS, which would break the prefix step. Verified: 3086
  .pd-scope prefixes emitted in the default bundle.
- tailwind.config.js deleted (v4 ports it to CSS: @theme/@plugin/@source/
  @custom-variant in index.css), per the v4 upgrade guide.
- vite.config.ts: staging's produce()-based config + #125 path aliases.
  Added setAutoFreeze(false) — immer froze the config and Vite mutates
  resolve.conditions at startup.
- package.json: dropped Faro (staging replaced it with OpenTelemetry), kept
  OTEL + web-vitals + apple-sign-in, took #125's version bumps, dropped
  autoprefixer (v4 built-in). Aligned @eslint/js to eslint 9 (eslint 10
  pulls react-hooks 7 which floods 44 new-rule errors — separate migration).
- Regenerated bun.lock and package-lock.json.

Build: exit 0, all 5 targets. Lint: 0 errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01ECe2qZWextwVCycC9EveFe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant