Why doesn't a normal uptime check catch a dead convergence loop?

Uptime checks prove the API still answers 200 — but a control plane whose convergence timer died keeps serving the last-applied state perfectly while drift quietly accumulates. The failure is the absence of a successful reconcile, not a down endpoint, so you need a dead-man's switch (heartbeat) plus proof the last converge actually applied, not just an HTTP probe.

How do I alert when converge runs but silently does nothing?

Have the node serve a converge-status artifact — e.g. {"ok":true,"last_success":" ","config_rev":" "} where ok is true only when last_success is recent — and point a Nightlamp http_keyword check at it asserting "ok":true. A timer that fires, exits 0, and applies nothing leaves last_success stale, ok flips to false, and the keyword check fails.

Should the heartbeat ping fire when the timer runs, or only when convergence succeeds?

Only on a health-gated success. POST the heartbeat URL after the post-converge health gate passes — api.steada.dev/healthz returning 200. A failed or rolled-back converge withholds the ping, so Nightlamp opens an incident at roughly 1.5x the expected interval. Pinging on every timer fire would mask exactly the rollback you are trying to detect.

How do I avoid alerts during planned maintenance?

Pause the monitor for the maintenance window — set paused_until to a future timestamp via PATCH /monitors/{id} (or the Pause control in the UI). A paused monitor is skipped by the evaluator, so a deliberately stopped convergence loop during a deploy won't page on-call. It resumes automatically when the window passes.

Control plane & self-convergence

Your control plane stopped converging — silently? Catch the reconcile loop that dies, no-ops, or rolls back.

A self-converging control plane — ansible-pull on a systemd timer, reconciling a managed datastore like Steada's api.steada.dev — fails in the worst way: invisibly. The timer dies, a run throws and swallows the error, a bad commit triggers a rollback, or a run exits 0 having applied nothing. The API still answers 200 the whole time. Here is the recipe that turns a silently broken reconcile loop into an alert — a health-gated heartbeat plus a converge-status freshness check.

Start 14-day trial Copy the recipe

why it happens

A reconcile loop fails as an absence, not an error.

The whole point of a self-converging control plane is that it keeps the desired state applied without anyone watching. So when the convergence loop stops, the symptom is… nothing. The box stays up, the API keeps serving the last-applied config, and drift accumulates silently until the next deploy fails in a confusing way. You cannot monitor that with an uptime check — the endpoint is fine. You monitor it the way you monitor any guaranteed-to-happen event: a dead-man’s switch that expects a heartbeat on success and raises an incident when it doesn’t arrive, plus proof the last reconcile actually applied.

the failure modes

Five ways a convergence loop breaks in silence

The convergence timer stopped or was masked

Symptom: steada-converge.timer is dead or masked; nobody notices until a deploy is needed.

Catch: The success heartbeat stops arriving. Nightlamp opens an incident at ~1.5x the expected interval — that is the dead-man's switch for a timer that simply stopped firing.

ansible-pull threw, but the error was swallowed

Symptom: A git, SSM, or network error aborts the run, and the failure never surfaces.

Catch: Because the heartbeat fires only after a health-gated success, a thrown run withholds the ping — the missed heartbeat is the alert. Surface the error in converge-status too.

A bad commit fails validation and converge rolls back

Symptom: caddy validate (or your config check) rejects a commit; converge rolls back but keeps failing.

Catch: The rollback withholds the success ping and leaves converge-status ok:false. Both the heartbeat miss and the converge-status keyword check fire, pointing at the offending config_rev.

The node's SSM-reader credential expired

Symptom: Secrets can't be fetched, so convergence can't complete even though the box is up.

Catch: No health-gated success means no ping; the heartbeat incident fires while /healthz may still be green — exactly the gap a plain uptime check misses.

The timer fires but convergence is a silent no-op

Symptom: Runs show green and exit 0, but the desired state was never applied — drift grows.

Catch: Don't trust exit 0. converge-status reports ok:true only when last_success is within the last 20 minutes; a no-op lets it go stale and the http_keyword check fails.

the recipe

Ping on a health-gated success, not on every run

The key idea: POST the heartbeat URL only after the post-converge health gate passes — api.steada.dev/healthz returning 200. A failed or rolled-back converge withholds the ping, and Nightlamp opens an incident at roughly 1.5x the expected interval. Pinging on every timer fire would hide the exact rollback you want to catch.

# after ansible-pull applies desired state:
if curl -fsS https://api.steada.dev/healthz >/dev/null; then
  # vhost healthy → record success + ping Nightlamp
  write_converge_status ok=true last_success="$(date -Is)" config_rev="$(git rev-parse HEAD)"
  curl -fsS -X POST "$NIGHTLAMP_HEARTBEAT_URL" >/dev/null
else
  write_converge_status ok=false last_success="$LAST_SUCCESS"
  # no ping → Nightlamp opens an incident at ~1.5x the interval
fi

how to catch it automatically

Three Nightlamp checks, one for each failure shape

Map the recipe onto Nightlamp’s real check schema: a heartbeat for “it stopped succeeding,” an http_keyword on the status artifact for “it ran but did nothing,” and a plain status check for external liveness independent of the converge loop.

# 1 — dead-man's switch: success heartbeat (job pings after the health gate)
check_type: heartbeat
expected_interval_seconds: 600     # job cadence: every 10 min
# grace defaults to ~half the interval → incident if no ping within ~900s

# 2 — "ran but did nothing": converge-status freshness
check_type: http_keyword
url: https://api.steada.dev/converge-status
keyword: '"ok":true'               # true only when last_success is recent
interval_seconds: 300              # evaluate every 5 min

# 3 — external liveness, independent of the converge loop
check_type: http_status
url: https://api.steada.dev/healthz
expected_status_codes: [200]
interval_seconds: 60

Timer stopped / masked

Swallowed ansible-pull error

Failed validate → rollback

Expired SSM credential

Silent no-op (stale config_rev)

Drift on either vhost

planned maintenance

Pause the switch during a deliberate stop

When you intentionally stop convergence for a deploy or migration, pause the heartbeat so it doesn’t page: set paused_until to the end of the window via PATCH /monitors/{id} (or the Pause control on the monitor page). The evaluator skips paused monitors and resumes automatically when the window passes — the same mechanism the recipe’s suppress_when: scheduled-maintenance intent maps onto today.

Want the reconcile loop to page you the first interval it fails?

Start a trial, create a heartbeat monitor for the converge job and an http_keyword check on its status artifact, and Nightlamp will alert you the first interval convergence stops succeeding — and the first time it runs but applies nothing — instead of at the next failed deploy.

Start 14-day trial · no card

Newsletter

One real no-code incident + the fix, monthly

A post-mortem from a real no-code outage — what broke, how it was found, and the fix — once a month. Confirming also gets you the no-code incident runbook pack.

Double opt-in. One-click unsubscribe. No spam, ever.