Skip to main content

Metrics, Audit & Health

Shield is built to be operated: every decision, failure, and resource bound is observable. This page catalogs the Prometheus metrics, the audit pipeline and its ClickHouse schema, and the loopback health endpoints. The Overview dashboard in the UI is built on exactly these series.

Prometheus metrics

All metrics live under the elchi_shield_ namespace and carry a constant instance label (--instance-id, default <hostname>-shield) so a fleet of sidecars never mixes series. Per-request series additionally carry a listener label — the first ext_proc request_attribute Envoy sends (by convention the node id, listener::project::ip), falling back to --listener-id. Histograms use exponential buckets sized for sub-millisecond work.

Request tallies

MetricLabelsMeaning
requests_totallistenerRequests processed (counted once per request, on the request direction)
requests_allowed_totallistenerRequests allowed
requests_blocked_totallistenerRequests blocked (a block on either direction counts)
detections_totallistenerDetect-mode would-block detections (request still allowed)
shadow_detections_totallistenerShadow-mode would-block detections

Findings

MetricLabelsMeaning
findings_totallistener, engine, actionFindings by the engine that produced them and the action taken (block / detect / shadow). Structural body checks carry their own engine labels: dlp, body_size (truncation guard), body_decode (undecodable encoding); the cross-engine scorer reports as anomaly

Body handling

MetricLabelsMeaning
body_inspected_bytes_totallistenerBody bytes decoded and inspected
body_mutations_totallistenerBodies rewritten by DLP redaction
body_budget_rejections_totallistener, reasonBodies truncated/blocked at intake by a memory bound: per_request_cap or inflight_budget

Latency and pipeline

MetricLabelsMeaning
processing_latency_secondslistener, phaseHistogram of per-phase processing latency (request_headers, request_body, response_headers, response_body)
stage_latency_secondsstageHistogram of per-pipeline-stage latency
stage_actions_totalstage, actionStage results by action — which check does the work

Reliability

MetricLabelsMeaning
extproc_errors_totalkindext_proc stream errors by kind (recovered panic, transport drop, pipeline-build failure, …)
timeouts_totalPer-request processing timeouts
fail_open_totalFail-open posture applications (inspection failed, request allowed)
fail_close_totalFail-close posture applications (inspection failed, request denied)

Config

MetricLabelsMeaning
active_config_versionversionActive config version (value is always 1; join on the label)
config_reload_success_total / config_reload_failure_totalReload outcomes
config_last_reload_success_timestamp_secondsUnix time of the last successful reload
config_reload_failures_consecutiveConsecutive failed reloads (0 after a success) — the "edge is stuck on last-good" signal
config_age_secondsSeconds since the active config was built

Audit pipeline

MetricLabelsMeaning
audit_enabled1 if an audit sink is active, 0 if audit is off — including a configured sink that failed to init and silently degraded. Alert on 0 where a sink is expected
audit_events_dropped_totalEvents dropped by the bounded queue or rate cap (a forensic gap)
audit_export_errors_totalEvents the sink failed to write (e.g. ClickHouse unreachable)
audit_queue_depthCurrent depth of the async audit queue

Live gauges and build info

MetricLabelsMeaning
streams_in_flightext_proc streams currently being served
inflight_body_bytesBody bytes currently buffered across all streams
build_infoversion, revision, goversion, build_timeBuild metadata (value 1) — join to confirm rollouts

The registry also exports the standard go_* collectors (goroutines, heap/GC detail, scheduler latency) and process_* collectors (CPU, open FDs, RSS), all carrying the instance label. go_goroutines is the canonical leak signal.

Metric delivery: scrape or push

/metrics on the loopback HTTP server (--http-addr, default 127.0.0.1:9001) is always scrapeable. Setting --metrics-otlp-endpoint host:port additionally pushes the same registry to an OTel Collector over OTLP/gRPC on a fixed interval (--metrics-otlp-interval, default 15s) — the collector forwards them on (e.g. to VictoriaMetrics), matching Envoy's stats-sink pipeline. Push init is non-fatal: a down collector never stops Shield, and the scrape endpoint keeps working. The OTLP resource carries service.name=elchi-shield and service.instance.id=<instance>.

Audit sinks

Audit events are emitted asynchronously (bounded queue, drop-on-full — never blocking the request path) to one of two sinks:

  • ClickHouse — the default whenever --audit-clickhouse-dsn is set. Batched inserts (default 500 rows, 1s time-based flush) into the central ClickHouse.
  • OTLP--audit-otel-endpoint sends events to an OTel Collector instead.
There is no local-file sink

When neither sink is configured, audit is simply off: events are skipped, never written to a local file. A misconfigured or unreachable sink degrades to no-audit (non-fatal — traffic is unaffected); watch audit_enabled to catch a sidecar that booted without the audit you expected.

Volume is bounded at the source: findings (block/detect/shadow) are always audited, while the allow stream is sampled per policy (sampling_rate, default 0.05) with an optional global cap (--audit-max-per-sec).

The ClickHouse table

Shield provisions elchi_shield_audit (name overridable with --audit-clickhouse-table) best-effort at startup — a pre-provisioned table with an INSERT-only user also works:

CREATE TABLE IF NOT EXISTS elchi_shield_audit (
ts DateTime64(3) CODEC(DoubleDelta, ZSTD),
instance LowCardinality(String),
node_id LowCardinality(String),
project_id LowCardinality(String),
listener LowCardinality(String),
request_id String CODEC(ZSTD),
phase LowCardinality(String),
direction LowCardinality(String),
action LowCardinality(String),
severity LowCardinality(String),
reason String CODEC(ZSTD),
rule_id LowCardinality(String),
policy_id LowCardinality(String),
engine LowCardinality(String),
host LowCardinality(String),
path String CODEC(ZSTD),
method LowCardinality(String),
status_code UInt16 CODEC(ZSTD),
config_version LowCardinality(String)
) ENGINE = MergeTree
PARTITION BY toYYYYMMDD(ts)
ORDER BY (project_id, ts)
TTL toDateTime(ts) + INTERVAL 7 DAY

Key properties: dictionary-encoded (LowCardinality) and ZSTD-compressed columns keep it small; daily partitions with a row TTL (default 7 days, --audit-clickhouse-ttl-days) bound it in time — old partitions are dropped whole. Shield also maintains a per-minute rollup (elchi_shield_audit_1m + a materialized view) that the backend's event summaries can aggregate instead of scanning raw rows. And by design, the table stores no header or body values — the path is query-stripped and reason strings carry no request content (see Security Events for the full redaction model).

Health endpoints

All on the loopback HTTP server (never exposed off-box):

EndpointPurpose
/healthzLiveness — the process is up
/readyzReadiness — has a non-empty, valid config to enforce. A sidecar with no policy is deliberately not ready
/configzThe active config: version (content hash), hash, source files, domain count, built-at/age, instance and build info, plus the last reload error and cumulative rejected-reload count — this is what elchi-client polls to confirm a push (see Deploying Policies to Edges)
/policyz?host=&path=&method=&content_type=Decision explainability: resolves a request shape to its policy and reports the structure — policy id, mode, fail posture, timeout, body-inspection flags, engine names, and the exact stage order per pipeline. Structure only; never rules, secrets, or payloads
/debug/pprof/*Go profiling (on by default, --pprof)

Sample PromQL

# Blocked share of traffic per edge (5m):
sum by (instance) (rate(elchi_shield_requests_blocked_total[5m]))
/ sum by (instance) (rate(elchi_shield_requests_total[5m]))

# p99 processing latency per phase:
histogram_quantile(0.99,
sum by (le, phase) (rate(elchi_shield_processing_latency_seconds_bucket[5m])))

# Which engine is blocking (top 5):
topk(5, sum by (engine) (rate(elchi_shield_findings_total{action="block"}[5m])))

# ALERT: an edge is rejecting config pushes (stuck on last-good):
elchi_shield_config_reload_failures_consecutive > 0

# ALERT: audit evidence is being lost:
rate(elchi_shield_audit_events_dropped_total[5m]) > 0
or rate(elchi_shield_audit_export_errors_total[5m]) > 0

# ALERT: inspection failures are closing traffic:
rate(elchi_shield_fail_close_total[5m]) > 0