Skip to content

cert-checker: fix logging & push metrics#8763

Open
lenaunderwood22 wants to merge 21 commits into
mainfrom
cert-checker-logs
Open

cert-checker: fix logging & push metrics#8763
lenaunderwood22 wants to merge 21 commits into
mainfrom
cert-checker-logs

Conversation

@lenaunderwood22
Copy link
Copy Markdown
Contributor

Fixes #8753

@lenaunderwood22 lenaunderwood22 requested a review from a team as a code owner May 26, 2026 21:25
@lenaunderwood22 lenaunderwood22 requested a review from ezekiel May 26, 2026 21:25
ezekiel
ezekiel previously approved these changes May 26, 2026
@ezekiel ezekiel requested review from a team and aarongable and removed request for a team May 26, 2026 21:56
@github-actions
Copy link
Copy Markdown
Contributor

@lenaunderwood22, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values.

@ezekiel ezekiel requested review from a team and removed request for aarongable May 26, 2026 21:56
Comment thread cmd/cert-checker/main.go Outdated
Comment thread cmd/cert-checker/main.go Outdated
CertChecker struct {
DB cmd.DBConfig
DB cmd.DBConfig
PushgatewayURL string `validate:"omitempty,url"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put this lower in the config. The cmd.DBConfig and cmd.HostnamePolicyConfig are big structs that are shared among many configs, we should keep them together at the top. This new config item is very specific to just cert-checker, so it should go lower.

Personally I'd put it right after Workers.

Unrelated question: last I heard, the plan was for cert-checker to be configured with the push gateway's Consul service name, and to do a SRV record lookup to get the real endpoint to push to. Is it actually easy to just configure this with the URL directly, no need for Consul?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to right after Workers.

We can configure this with just the URL. We can do consul lookup on the deploy side, and in fact we do this already for the deploy of an internal tool.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Could possibly lead to some slippage if cert-checker remains deployed with the same URL for a whole week, while the push gateway gets redeployed and changes ports or something midweek, but I assume that's not very likely (e.g. the push gateway is, in practice, always at the same place).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is uglier and harder to do in practice than I thought. I think it would be better to either have pushgatewayURL as a flag instead of being part of cert-checker's config, or remove the config altogether and have cert-checker do the consul lookups. My leaning is for the former, which would match how the internal tool does this, but I'm open to either.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the internal tool get the pushgateway service as a flag, and then does Consul SRV lookups on that to get the actual address? IMO we should do the exact same thing. (Which is also how our ratelimits code discovers Redis, and how the WFEs discover nonce servers.)

You can see some example code here:

That all sounds a bit complicated, and it sorta is. But I think it's the only way we can ensure that the SRV lookup happens right before cert-checker attempts to push metrics; populating the pushgateway url directly into the config or as the command line flag means the value will be determined a long time before cert-checker is ready to push.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the example links! Implemented pushgateway lookup in 02e95e5

Comment thread cmd/cert-checker/main.go Outdated
Comment thread cmd/cert-checker/main.go Outdated
Comment thread cmd/cert-checker/main.go
Comment thread cmd/cert-checker/main.go
Comment thread cmd/cert-checker/main.go Outdated
Comment thread cmd/cert-checker/main.go Outdated
Comment thread cmd/cert-checker/main.go Outdated
Comment thread cmd/shell.go
Co-authored-by: Aaron Gable <aaron@letsencrypt.org>
aarongable
aarongable previously approved these changes May 28, 2026
@aarongable aarongable requested a review from ezekiel May 28, 2026 23:33
ezekiel
ezekiel previously approved these changes May 28, 2026
@lenaunderwood22 lenaunderwood22 dismissed stale reviews from ezekiel and aarongable via 15ce586 May 29, 2026 14:50
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates cert-checker to emit structured audit logs (instead of printing a large multi-line JSON blob to stdout) and to publish run metrics (good/bad cert counts, latency, last-run timestamp) to a Prometheus Pushgateway discovered via DNS SRV (intended to replace an external wrapper script per issue #8753).

Changes:

  • Replace stdout JSON report output with per-certificate audit error logs plus a final audit summary.
  • Add Prometheus metrics for cert-checker results and push them to a Pushgateway (incl. SRV discovery + shared cmd.PushMetrics helper).
  • Update tests/config fixtures and vendor the Prometheus push subpackage.

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
cmd/cert-checker/main.go Switch to audit logging, add metrics, add SRV-based Pushgateway discovery and push behavior.
cmd/cert-checker/main_test.go Update tests to assert audit log emission rather than JSON report contents.
cmd/shell.go Add PushMetrics helper using Prometheus push client.
test/config/cert-checker.json Remove unexpiredOnly field from fixture config.
test/config-next/cert-checker.json Remove unexpiredOnly field from fixture config.
vendor/modules.txt Add vendored package path for prometheus/push.
vendor/github.com/prometheus/client_golang/prometheus/push/push.go Vendor Prometheus Pushgateway client implementation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cmd/cert-checker/main.go Outdated
Comment thread cmd/cert-checker/main.go Outdated
Comment thread cmd/cert-checker/main.go Outdated
Comment thread cmd/shell.go
Comment thread cmd/cert-checker/main.go Outdated
Comment thread cmd/cert-checker/main.go Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve cert-checker's log and metric output

4 participants