cert-checker: fix logging & push metrics#8763
Conversation
|
@lenaunderwood22, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values. |
| CertChecker struct { | ||
| DB cmd.DBConfig | ||
| DB cmd.DBConfig | ||
| PushgatewayURL string `validate:"omitempty,url"` |
There was a problem hiding this comment.
nit: put this lower in the config. The cmd.DBConfig and cmd.HostnamePolicyConfig are big structs that are shared among many configs, we should keep them together at the top. This new config item is very specific to just cert-checker, so it should go lower.
Personally I'd put it right after Workers.
Unrelated question: last I heard, the plan was for cert-checker to be configured with the push gateway's Consul service name, and to do a SRV record lookup to get the real endpoint to push to. Is it actually easy to just configure this with the URL directly, no need for Consul?
There was a problem hiding this comment.
Moved to right after Workers.
We can configure this with just the URL. We can do consul lookup on the deploy side, and in fact we do this already for the deploy of an internal tool.
There was a problem hiding this comment.
Nice! Could possibly lead to some slippage if cert-checker remains deployed with the same URL for a whole week, while the push gateway gets redeployed and changes ports or something midweek, but I assume that's not very likely (e.g. the push gateway is, in practice, always at the same place).
There was a problem hiding this comment.
Actually this is uglier and harder to do in practice than I thought. I think it would be better to either have pushgatewayURL as a flag instead of being part of cert-checker's config, or remove the config altogether and have cert-checker do the consul lookups. My leaning is for the former, which would match how the internal tool does this, but I'm open to either.
There was a problem hiding this comment.
Doesn't the internal tool get the pushgateway service as a flag, and then does Consul SRV lookups on that to get the actual address? IMO we should do the exact same thing. (Which is also how our ratelimits code discovers Redis, and how the WFEs discover nonce servers.)
You can see some example code here:
- First, take the dns server address and service name to look up as config parameters
- Use the dns server address to build a custom resolver that will hit that nameserver when doing lookups
- Get a list of targets by doing a SRV lookup
- Convert those targets to concrete addresses by doing a A/AAAA lookup
- Use one of the resolved addresses as the target for the push.
That all sounds a bit complicated, and it sorta is. But I think it's the only way we can ensure that the SRV lookup happens right before cert-checker attempts to push metrics; populating the pushgateway url directly into the config or as the command line flag means the value will be determined a long time before cert-checker is ready to push.
There was a problem hiding this comment.
Thank you for all the example links! Implemented pushgateway lookup in 02e95e5
Co-authored-by: Aaron Gable <aaron@letsencrypt.org>
Co-authored-by: Aaron Gable <aaron@letsencrypt.org>
There was a problem hiding this comment.
Pull request overview
This PR updates cert-checker to emit structured audit logs (instead of printing a large multi-line JSON blob to stdout) and to publish run metrics (good/bad cert counts, latency, last-run timestamp) to a Prometheus Pushgateway discovered via DNS SRV (intended to replace an external wrapper script per issue #8753).
Changes:
- Replace stdout JSON report output with per-certificate audit error logs plus a final audit summary.
- Add Prometheus metrics for cert-checker results and push them to a Pushgateway (incl. SRV discovery + shared
cmd.PushMetricshelper). - Update tests/config fixtures and vendor the Prometheus
pushsubpackage.
Reviewed changes
Copilot reviewed 5 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
cmd/cert-checker/main.go |
Switch to audit logging, add metrics, add SRV-based Pushgateway discovery and push behavior. |
cmd/cert-checker/main_test.go |
Update tests to assert audit log emission rather than JSON report contents. |
cmd/shell.go |
Add PushMetrics helper using Prometheus push client. |
test/config/cert-checker.json |
Remove unexpiredOnly field from fixture config. |
test/config-next/cert-checker.json |
Remove unexpiredOnly field from fixture config. |
vendor/modules.txt |
Add vendored package path for prometheus/push. |
vendor/github.com/prometheus/client_golang/prometheus/push/push.go |
Vendor Prometheus Pushgateway client implementation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Fixes #8753