Your control plane stopped converging — silently? Catch the reconcile loop that dies, no-ops, or rolls back.
A self-converging control plane — ansible-pull on a systemd timer, reconciling api.steada.dev and api.keynest.dev — fails in the worst way: invisibly. The timer dies, a run throws and swallows the error, a bad commit triggers a rollback, or a run exits 0 having applied nothing. The API still answers 200 the whole time. Here is the recipe that turns a silently broken reconcile loop into an alert — a health-gated heartbeat plus a converge-status freshness check.
A reconcile loop fails as an absence, not an error.
The whole point of a self-converging control plane is that it keeps the desired state applied without anyone watching. So when the convergence loop stops, the symptom is… nothing. The box stays up, the API keeps serving the last-applied config, and drift accumulates silently until the next deploy fails in a confusing way. You cannot monitor that with an uptime check — the endpoint is fine. You monitor it the way you monitor any guaranteed-to-happen event: a dead-man’s switch that expects a heartbeat on success and raises an incident when it doesn’t arrive, plus proof the last reconcile actually applied.
Five ways a convergence loop breaks in silence
The convergence timer stopped or was masked
Symptom: steada-converge.timer is dead or masked; nobody notices until a deploy is needed.
Catch: The success heartbeat stops arriving. Nightlamp opens an incident at ~1.5x the expected interval — that is the dead-man's switch for a timer that simply stopped firing.
ansible-pull threw, but the error was swallowed
Symptom: A git, SSM, or network error aborts the run, and the failure never surfaces.
Catch: Because the heartbeat fires only after a health-gated success, a thrown run withholds the ping — the missed heartbeat is the alert. Surface the error in converge-status too.
A bad commit fails validation and converge rolls back
Symptom: caddy validate (or your config check) rejects a commit; converge rolls back but keeps failing.
Catch: The rollback withholds the success ping and leaves converge-status ok:false. Both the heartbeat miss and the converge-status keyword check fire, pointing at the offending config_rev.
The node's SSM-reader credential expired
Symptom: Secrets can't be fetched, so convergence can't complete even though the box is up.
Catch: No health-gated success means no ping; the heartbeat incident fires while /healthz may still be green — exactly the gap a plain uptime check misses.
The timer fires but convergence is a silent no-op
Symptom: Runs show green and exit 0, but the desired state was never applied — drift grows.
Catch: Don't trust exit 0. converge-status reports ok:true only when last_success is within the last 20 minutes; a no-op lets it go stale and the http_keyword check fails.
Ping on a health-gated success, not on every run
The key idea: POST the heartbeat URL only after the post-converge health gate passes — both api.steada.dev/healthz and api.keynest.dev/healthz returning 200. A failed or rolled-back converge withholds the ping, and Nightlamp opens an incident at roughly 1.5x the expected interval. Pinging on every timer fire would hide the exact rollback you want to catch.
# after ansible-pull applies desired state:
if curl -fsS https://api.steada.dev/healthz >/dev/null \
&& curl -fsS https://api.keynest.dev/healthz >/dev/null; then
# both vhosts healthy → record success + ping Nightlamp
write_converge_status ok=true last_success="$(date -Is)" config_rev="$(git rev-parse HEAD)"
curl -fsS -X POST "$NIGHTLAMP_HEARTBEAT_URL" >/dev/null
else
write_converge_status ok=false last_success="$LAST_SUCCESS"
# no ping → Nightlamp opens an incident at ~1.5x the interval
fiThree Nightlamp checks, one for each failure shape
Map the recipe onto Nightlamp’s real check schema: a heartbeat for “it stopped succeeding,” an http_keyword on the status artifact for “it ran but did nothing,” and a plain status check for external liveness independent of the converge loop.
# 1 — dead-man's switch: success heartbeat (job pings after the health gate)
check_type: heartbeat
expected_interval_seconds: 600 # job cadence: every 10 min
# grace defaults to ~half the interval → incident if no ping within ~900s
# 2 — "ran but did nothing": converge-status freshness
check_type: http_keyword
url: https://api.steada.dev/converge-status
keyword: '"ok":true' # true only when last_success is recent
interval_seconds: 300 # evaluate every 5 min
# 3 — external liveness, independent of the converge loop
check_type: http_status
url: https://api.steada.dev/healthz
expected_status_codes: [200]
interval_seconds: 60Pause the switch during a deliberate stop
When you intentionally stop convergence for a deploy or migration, pause the heartbeat so it doesn’t page: set paused_until to the end of the window via PATCH /monitors/{id} (or the Pause control on the monitor page). The evaluator skips paused monitors and resumes automatically when the window passes — the same mechanism the recipe’s suppress_when: scheduled-maintenance intent maps onto today.
Want the reconcile loop to page you the first interval it fails?
Start a trial, create a heartbeat monitor for the converge job and an http_keyword check on its status artifact, and Nightlamp will alert you the first interval convergence stops succeeding — and the first time it runs but applies nothing — instead of at the next failed deploy.
Start 14-day trial · no card