Metrics, Audit & Health

Shield is built to be operated: every decision, failure, and resource bound is observable. This page catalogs the Prometheus metrics, the audit pipeline and its ClickHouse schema, and the loopback health endpoints. The Overview dashboard in the UI is built on exactly these series.

Prometheus metrics

All metrics live under the elchi_shield_ namespace and carry a constant instance label (--instance-id, default <hostname>-shield) so a fleet of sidecars never mixes series. Per-request series additionally carry a listener label — the first ext_proc request_attribute Envoy sends (by convention the node id, listener::project::ip), falling back to --listener-id. Histograms use exponential buckets sized for sub-millisecond work.

Request tallies

Metric	Labels	Meaning
`requests_total`	`listener`	Requests processed (counted once per request, on the request direction)
`requests_allowed_total`	`listener`	Requests allowed
`requests_blocked_total`	`listener`	Requests blocked (a block on either direction counts)
`detections_total`	`listener`	Detect-mode would-block detections (request still allowed)
`shadow_detections_total`	`listener`	Shadow-mode would-block detections

Findings

Metric	Labels	Meaning
`findings_total`	`listener`, `engine`, `action`	Findings by the engine that produced them and the action taken (`block` / `detect` / `shadow`). Structural body checks carry their own engine labels: `dlp`, `body_size` (truncation guard), `body_decode` (undecodable encoding); the cross-engine scorer reports as `anomaly`

Body handling

Metric	Labels	Meaning
`body_inspected_bytes_total`	`listener`	Body bytes decoded and inspected
`body_mutations_total`	`listener`	Bodies rewritten by DLP redaction
`body_budget_rejections_total`	`listener`, `reason`	Bodies truncated/blocked at intake by a memory bound: `per_request_cap` or `inflight_budget`

Latency and pipeline

Metric	Labels	Meaning
`processing_latency_seconds`	`listener`, `phase`	Histogram of per-phase processing latency (`request_headers`, `request_body`, `response_headers`, `response_body`)
`stage_latency_seconds`	`stage`	Histogram of per-pipeline-stage latency
`stage_actions_total`	`stage`, `action`	Stage results by action — which check does the work

Reliability

Metric	Labels	Meaning
`extproc_errors_total`	`kind`	ext_proc stream errors by kind (recovered panic, transport drop, pipeline-build failure, …)
`timeouts_total`	—	Per-request processing timeouts
`fail_open_total`	—	Fail-open posture applications (inspection failed, request allowed)
`fail_close_total`	—	Fail-close posture applications (inspection failed, request denied)

Config

Metric	Labels	Meaning
`active_config_version`	`version`	Active config version (value is always 1; join on the label)
`config_reload_success_total` / `config_reload_failure_total`	—	Reload outcomes
`config_last_reload_success_timestamp_seconds`	—	Unix time of the last successful reload
`config_reload_failures_consecutive`	—	Consecutive failed reloads (0 after a success) — the "edge is stuck on last-good" signal
`config_age_seconds`	—	Seconds since the active config was built

Audit pipeline

Metric	Labels	Meaning
`audit_enabled`	—	1 if an audit sink is active, 0 if audit is off — including a configured sink that failed to init and silently degraded. Alert on 0 where a sink is expected
`audit_events_dropped_total`	—	Events dropped by the bounded queue or rate cap (a forensic gap)
`audit_export_errors_total`	—	Events the sink failed to write (e.g. ClickHouse unreachable)
`audit_queue_depth`	—	Current depth of the async audit queue

Live gauges and build info

Metric	Labels	Meaning
`streams_in_flight`	—	ext_proc streams currently being served
`inflight_body_bytes`	—	Body bytes currently buffered across all streams
`build_info`	`version`, `revision`, `goversion`, `build_time`	Build metadata (value 1) — join to confirm rollouts

The registry also exports the standard go_* collectors (goroutines, heap/GC detail, scheduler latency) and process_* collectors (CPU, open FDs, RSS), all carrying the instance label. go_goroutines is the canonical leak signal.

Metric delivery: scrape or push

/metrics on the loopback HTTP server (--http-addr, default 127.0.0.1:9001) is always scrapeable. Setting --metrics-otlp-endpoint host:port additionally pushes the same registry to an OTel Collector over OTLP/gRPC on a fixed interval (--metrics-otlp-interval, default 15s) — the collector forwards them on (e.g. to VictoriaMetrics), matching Envoy's stats-sink pipeline. Push init is non-fatal: a down collector never stops Shield, and the scrape endpoint keeps working. The OTLP resource carries service.name=elchi-shield and service.instance.id=<instance>.

Audit sinks

Audit events are emitted asynchronously (bounded queue, drop-on-full — never blocking the request path) to one of two sinks:

ClickHouse — the default whenever --audit-clickhouse-dsn is set. Batched inserts (default 500 rows, 1s time-based flush) into the central ClickHouse.
OTLP — --audit-otel-endpoint sends events to an OTel Collector instead.

There is no local-file sink

When neither sink is configured, audit is simply off: events are skipped, never written to a local file. A misconfigured or unreachable sink degrades to no-audit (non-fatal — traffic is unaffected); watch audit_enabled to catch a sidecar that booted without the audit you expected.

Volume is bounded at the source: findings (block/detect/shadow) are always audited, while the allow stream is sampled per policy (sampling_rate, default 0.05) with an optional global cap (--audit-max-per-sec).

The ClickHouse table

Shield provisions elchi_shield_audit (name overridable with --audit-clickhouse-table) best-effort at startup — a pre-provisioned table with an INSERT-only user also works:

CREATE TABLE IF NOT EXISTS elchi_shield_audit (
    ts             DateTime64(3) CODEC(DoubleDelta, ZSTD),
    instance       LowCardinality(String),
    node_id        LowCardinality(String),
    project_id     LowCardinality(String),
    listener       LowCardinality(String),
    request_id     String CODEC(ZSTD),
    phase          LowCardinality(String),
    direction      LowCardinality(String),
    action         LowCardinality(String),
    severity       LowCardinality(String),
    reason         String CODEC(ZSTD),
    rule_id        LowCardinality(String),
    policy_id      LowCardinality(String),
    engine         LowCardinality(String),
    host           LowCardinality(String),
    path           String CODEC(ZSTD),
    method         LowCardinality(String),
    status_code    UInt16 CODEC(ZSTD),
    config_version LowCardinality(String)
) ENGINE = MergeTree
PARTITION BY toYYYYMMDD(ts)
ORDER BY (project_id, ts)
TTL toDateTime(ts) + INTERVAL 7 DAY

Key properties: dictionary-encoded (LowCardinality) and ZSTD-compressed columns keep it small; daily partitions with a row TTL (default 7 days, --audit-clickhouse-ttl-days) bound it in time — old partitions are dropped whole. Shield also maintains a per-minute rollup (elchi_shield_audit_1m + a materialized view) that the backend's event summaries can aggregate instead of scanning raw rows. And by design, the table stores no header or body values — the path is query-stripped and reason strings carry no request content (see Security Events for the full redaction model).

Health endpoints

All on the loopback HTTP server (never exposed off-box):

Endpoint	Purpose
`/healthz`	Liveness — the process is up
`/readyz`	Readiness — has a non-empty, valid config to enforce. A sidecar with no policy is deliberately not ready
`/configz`	The active config: version (content hash), hash, source files, domain count, built-at/age, instance and build info, plus the last reload error and cumulative rejected-reload count — this is what `elchi-client` polls to confirm a push (see Deploying Policies to Edges)
`/policyz?host=&path=&method=&content_type=`	Decision explainability: resolves a request shape to its policy and reports the structure — policy id, mode, fail posture, timeout, body-inspection flags, engine names, and the exact stage order per pipeline. Structure only; never rules, secrets, or payloads
`/debug/pprof/*`	Go profiling (on by default, `--pprof`)

Sample PromQL

# Blocked share of traffic per edge (5m):
sum by (instance) (rate(elchi_shield_requests_blocked_total[5m]))
  / sum by (instance) (rate(elchi_shield_requests_total[5m]))

# p99 processing latency per phase:
histogram_quantile(0.99,
  sum by (le, phase) (rate(elchi_shield_processing_latency_seconds_bucket[5m])))

# Which engine is blocking (top 5):
topk(5, sum by (engine) (rate(elchi_shield_findings_total{action="block"}[5m])))

# ALERT: an edge is rejecting config pushes (stuck on last-good):
elchi_shield_config_reload_failures_consecutive > 0

# ALERT: audit evidence is being lost:
rate(elchi_shield_audit_events_dropped_total[5m]) > 0
  or rate(elchi_shield_audit_export_errors_total[5m]) > 0

# ALERT: inspection failures are closing traffic:
rate(elchi_shield_fail_close_total[5m]) > 0

Prometheus metrics​

Request tallies​

Findings​

Body handling​

Latency and pipeline​

Reliability​

Config​

Audit pipeline​

Live gauges and build info​

Metric delivery: scrape or push​

Audit sinks​

The ClickHouse table​

Health endpoints​

Sample PromQL​