Skip to main content

GSLB Statistics

The GSLB Statistics page is the observability surface for the health checker. It surfaces the probe and health-check metrics that each Controller pushes — how many IPs are healthy, whether probes are succeeding, how fast they respond, and how the internal probing machinery is holding up. Use it to confirm GSLB is doing its job and to catch trouble before it reaches DNS.

Scope: controller and time range

Two selectors at the top scope every chart:

  • ControllerAll Controllers, or a specific controller. Because the health checker shards its work across controllers (each owns a subset of shards), per-controller scoping lets you spot an imbalanced or unhealthy node. Aggregate with All.
  • Time rangeLast 15 minutes, Last 1 hour (default), Last 6 hours, or Last 24 hours. Timelines and rates are computed over this window; a Refresh button re-pulls on demand.

IP health at a glance

Four headline cards summarize the IP population:

CardMeaning
Total IPsAll IPs under health checking.
Healthy IPsIPs in passing (served in DNS), shown against the total.
Critical IPsIPs evicted from DNS. A non-zero, rising count is your primary alert signal.
Backoff ActiveIPs currently in circuit-breaker backoff (repeatedly-failing critical IPs being probed on graduated backoff).

The IP Health Distribution donut breaks the population into Healthy / Warning / Critical, and a footnote surfaces the Warning IPs count — the early-warning tier that is still served but failing checks.

Probe success and errors

  • Probe Success Rate — a gauge (red < 30%, amber 30–70%, green ≥ 70%) with raw success and failure counts beneath it. This is the single best "is GSLB healthy?" indicator.
  • Success Rate Timeline — success percentage over the selected window; a dip localizes when things went wrong.
  • Error Breakdown — a donut of failure causes (e.g. timeout, connection refused, DNS failure). The health checker categorizes 21+ error types, so this tells you why probes fail — a wave of connection_refused points at dead backends, timeout at network or overloaded targets, dns_failure at resolution problems.

Latency

  • Probe Latency Timeline — average probe latency over the window, with Min / Avg / Max tags. Rising latency often precedes failures — endpoints slow down before they fall over — so treat a climbing average as an early warning even while the success rate still looks fine.

Health-checker internals

Three panels expose the probing machinery itself. These are capacity/saturation signals — watch them when probe throughput is high or you're scaling the IP population.

Worker Pool — the shared, CPU-aware probe worker pool:

MetricWatch for
Active WorkersCurrent pool size (auto-scales with load).
Queue DepthPending probe tasks — sustained non-zero means workers can't keep up.
Result QueuePending probe results awaiting processing.
Result Queue CapacityPercent full; the bar turns red above 80% (backpressure risk).

Timewheel Scheduler — the per-IP probe scheduler:

MetricMeaning
Current LoadTasks currently scheduled in the wheel.
Current SlotThe slot the wheel is executing.
Scheduled TotalCumulative tasks scheduled.
Executed TotalCumulative probes fired — should climb steadily; a flat line means no probing is happening.

Write Buffer — batched persistence of health-state changes to MongoDB:

MetricWatch for
Buffer SizePending buffered updates.
Flush TotalCumulative flushes to the database.
Flush ErrorsNon-zero (red) indicates database write problems.
Avg Flush DurationPer-flush latency; rising values point at database pressure.
Buffer CapacityPercent full; red above 80%.

A footer strip also reports Owned Shards (how many partitions this controller is responsible for) and Write Buffer Updates Total.

What to monitor

For day-to-day operation, the signals that matter most:

  1. Critical IPs and the Success Rate gauge — the health outcome. A jump in critical count or a drop below your success-rate floor means real endpoints are failing.
  2. Error Breakdown — the cause, so you know where to look (backend, network, or DNS).
  3. Latency Timeline — the leading indicator, catching degradation before eviction.
  4. Owned Shards balance across controllers — an uneven distribution, or a controller with zero executed probes, signals a sharding or HA problem (see Registry & HA).
  5. Flush Errors and queue capacity bars — infrastructure saturation that can stall state updates from reaching DNS.

For the per-IP detail behind these aggregates — individual health states, backoff, and probe error messages — drill into a record's IPs in Records & IPs.