Skip to main content

Health Model

GSLB continuously probes every IP behind a record and drives each one through a quad-state health model. Only healthy states appear in DNS answers, so the model is what makes GSLB answers reflect reality. This page covers the probe configuration, the states and transitions, and how state maps to DNS inclusion.

Probe types

A record's probe object defines its health check. Three probe types are supported:

TypeWhat it doesSuccess criteria
httpHTTP GET to path on portResponse status matches expected_status_codes.
httpsHTTPS GET (TLS) to path on portAs HTTP, plus TLS handshake. Optional skip_ssl_verify for self-signed certs.
tcpTCP connect to portConnection establishes successfully.

Probe configuration

FieldApplies toNotes
typeallhttp, https, or tcp.
portallTarget port.
pathhttp/httpsRequest path (default /).
host_headerhttp/httpsExplicit Host header for the probe.
intervalallSeconds between probes: one of 10, 20, 30, 60, 90, 120, 180, 300.
timeoutallProbe timeout, 0.13.0 seconds.
enabledallfalse pauses probing without deleting the config.
warning_thresholdallConsecutive failures before warning.
critical_thresholdallConsecutive failures before critical.
passing_thresholdallConsecutive successes required to fully recover (default 1, max 10). Values > 1 enable the anti-flapping recovery state.
expected_status_codeshttp/httpsAccepted status ranges, e.g. ["200-299", "301"]. Default ["200-399"].
follow_redirectshttp/httpsFollow HTTP redirects (default true).
skip_ssl_verifyhttpsSkip TLS certificate verification (default false).

Pausing a probe (enabled: false) keeps the configuration and the last known health state but stops issuing checks — useful for planned maintenance where you don't want probe noise. To stop checking and drop the config entirely, send probe: null on a record update.

Health states

Each IP is always in exactly one of four states:

StateMeaningIn DNS answers?
passingHealthy — no consecutive failures.✅ Yes
warningDegraded — failing but not yet evicted. Probed more aggressively.✅ Yes
criticalUnhealthy — evicted from DNS. Circuit breaker active.❌ No
recoveryComing back — succeeding again but held until stable (only when passing_threshold > 1).❌ No

The rule that matters for traffic: passing and warning IPs are served; critical and recovery IPs are excluded. warning is deliberately still served — it is an early-warning signal, not an eviction. Eviction happens only at critical.

At a glance — probe results walk each IP through the four states, and only two are served in DNS:

State transitions

Failure path

passing ──(failures ≥ warning_threshold)──► warning ──(failures ≥ critical_threshold)──► critical

Consecutive probe failures walk an IP down: passing → warning once failures reach warning_threshold, then warning → critical once they reach critical_threshold.

Recovery path

Recovery depends on passing_threshold:

Default (passing_threshold = 1) — direct recovery. A single successful probe returns a warning or critical IP straight to passing.

Anti-flapping mode (passing_threshold > 1) — recovery is gated through the recovery state:

critical/warning ──(first success)──► recovery ──(passing_threshold consecutive successes)──► passing
▲ │
└──────────(any failure)───────────┘

A recovering IP must string together passing_threshold consecutive successes before it is trusted back into DNS. A single failure during recovery drops it back to critical and resets the recovery counter. This prevents an unstable endpoint from flapping in and out of rotation.

Example with warning_threshold=1, critical_threshold=3, passing_threshold=2:

critical + success → recovery (1/2 successes) ← still excluded from DNS
recovery + success → passing (2/2 successes) ← back in DNS
recovery + failure → critical (reset)

Manual overrides

An operator can force a state via PUT /api/v3/gslb/:id/ips/:ip (see Records & IPs). The override sets a manual_reset_at timestamp and resets the failure counter, so the next probe establishes a fresh baseline. Overrides are not sticky: the health checker keeps probing and will move the IP again on real results. Forcing passing on a genuinely-down endpoint simply flips back to critical on the next failed probe.

Probe scheduling and the circuit breaker

State also governs how often an IP is probed:

StateNext probe
passingNormal interval.
warninginterval / 2 — increased monitoring for faster failover detection.
recoveryinterval / 2 — faster verification while recovering.
criticalGraduated backoff (circuit breaker).

For critical IPs, a circuit breaker applies exponential backoff so Elchi stops hammering a persistently-dead endpoint while still checking often enough to detect recovery quickly. Backoff scales with the probe interval using multipliers and is capped at 300 seconds (5 minutes):

10s interval: 10s → 20s → 30s → 50s → 80s → 120s (cap)
30s interval: 30s → 60s → 90s → 150s → 240s → 300s (cap)
60s interval: 60s → 120s → 180s → 300s (cap)

A manual health-state change clears the backoff so the IP is re-probed immediately (within roughly one second).

How state reaches DNS

When elchi-coredns polls the snapshot API, the Controller builds each record's answer from its IP health documents, excluding any IP in critical or recovery. The result:

  • Some IPs healthy → the record's A answer contains the passing/warning IPs.
  • All IPs critical/recovery (or the record is disabled) → the answer has empty IPs, and the snapshot returns the record's failover FQDN so CoreDNS can steer to the backup zone.

Because DNS answers are cached by resolvers and clients for up to the record's TTL, choose a TTL that balances failover speed against query volume. Shorter TTLs propagate evictions faster but increase resolver traffic. Combined with warning-state monitoring at half interval and the circuit breaker, this gives fast, resource-proportional failover.

Each probe result is appended to the IP's status_history (state, timestamp, response code, response time, and an error message on failure), which you can review per IP or clear — see Records & IPs. Aggregate probe health is on the Statistics page.