TLS certificate expiry on self-hosted services
The single most common "my self-hosted thing broke" cause isn't a software bug. It's a TLS certificate that expired because the auto-renewal cron quietly stopped running three months ago. This post is about making that stop happening.
Why this keeps going wrong
- Let's Encrypt certs are 90 days. That's short by design, to keep you renewing. It also means every configuration mistake shows up ~85 days later, long after you'd remember what you changed.
- Renewal crons are silent on failure.
certbot renewwrites to a log nobody reads unless the cert is already dead. - Reverse proxies cache the old cert. Caddy and Traefik usually get this right. Raw nginx + certbot doesn't always reload on renewal; you end up serving an expired cert even though the on-disk file is fresh.
- DNS-01 renewals depend on API tokens. Cloudflare rotated the required API scopes a year ago; every Cloudflare-DNS-01 setup older than that is silently broken.
The 60-30-14 rule
Alert on three progressively louder thresholds:
- 60 days: info-level notification. Probably fine, but a heads-up if you're going on holiday.
- 30 days: warning. Something's probably wrong with auto-renewal. Investigate within a week.
- 14 days: critical. Get up and fix this today.
Most renewals happen around 30 days remaining (Let's Encrypt's default), so a 30-day alert that doesn't clear within 48 hours means something is actually broken — not just "hasn't triggered yet."
How to actually monitor it
Option A: run openssl from cron
for host in grafana.example.com git.example.com plex.example.com; do
expiry=$(echo | openssl s_client -servername $host -connect $host:443 2>/dev/null \
| openssl x509 -noout -enddate | cut -d= -f2)
days=$(( ($(date -j -f "%b %e %H:%M:%S %Y %Z" "$expiry" +%s) - $(date +%s)) / 86400 ))
if [ $days -lt 30 ]; then
echo "ALERT: $host expires in $days days" | mail -s "TLS expiry" [email protected]
fi
done
Works. Writes itself out of your memory in six months.
Option B: Noxen
Every Noxen scan inspects the TLS certificate on each TLS-capable open port (443, 465, 636, 993, 995, 8443, 9443), parses the full X.509 structure, and emits findings for:
- Expires in < 14 days (HIGH).
- Expires in < 30 days (MEDIUM).
- Weak signature algorithm (SHA-1, MD5).
- RSA key < 2048 bits.
- Self-signed with no SAN.
- Negotiated protocol is TLS 1.0 / 1.1 / SSLv3.
- Negotiated cipher is RC4 / 3DES / CBC-without-GCM.
The diff-from-yesterday view highlights when a cert has renewed — so you don't have to guess whether the renewal cron ran. Absence of a "cert expiry changed" entry after 85 days is itself the alert: "renewal should have happened by now; it didn't."
Longer-term fixes
- Switch to Caddy. Automatic TLS is the default; there's no separate renewal job to forget. For most homelab reverse-proxy use, Caddy removes the whole problem class.
- Use DNS-01 wherever possible. HTTP-01 requires port 80 open, which breaks when your ISP blocks it or your LAN-only service isn't reachable from the internet.
- Pin certbot to DNS-01 with a long-lived API token. For Cloudflare: a scoped token with just DNS Edit + Zone Read on the specific zone.
- Monitor renewals, not just expiry. If you know renewal happens at T-30 days, alert on "cert didn't renew" at T-25, not on "cert expired" at T-0.
The deeper point
Expiring certs are a visible failure mode. Every single one of them was preceded by something silent: a renewal that didn't run, a token that got rotated, a config that didn't reload. The fix is to make the silent thing loud — a week-by-week monitoring loop that notices "hey, this should have changed by now, why hasn't it?"
That's the diff-from-yesterday pattern, applied to TLS. And it generalises: every silent failure is fixable once you can answer, quickly, "what should have changed and didn't?"