Skip to main content

Modes & Fail Postures

Every resolved policy carries two orthogonal settings: the mode (what happens when an inspection finds something) and the fail posture (what happens when an engine errors or times out). Getting the distinction right is the difference between a safe rollout and a self-inflicted outage — this page covers both, plus the recommended rollout sequence.

Both fields live in PolicySpec and can be set at any scope (spec.defaults, domain.policy, route.policy) — the most-specific set value wins, per the inheritance model.

OptionTypeDefaultDescription
modeenumblockblock | detect | shadow | off — enforcement posture.
fail_modeenumfail_openfail_open | fail_close — behavior when an engine errors or times out.

The four modes

ModeWhat happens to the requestMetric fed
blockEnforce. A finding blocks the request — the client gets an immediate 403.requests_blocked_total
detectEvaluate everything, record the finding as a would-block, but allow the request (monitor mode).detections_total
shadowEvaluate as if blocking, log what would have been blocked, allow the request.shadow_detections_total
offSkip inspection entirely — the request continues untouched.

Behavior details worth internalizing:

  • In block mode the first finding short-circuits: later stages do not run, and one finding is emitted.
  • In detect and shadow modes the pipeline keeps running after a finding, so you get the full detection trail — every finding is recorded and audited individually, not just the highest-severity one. This is exactly what makes these modes useful for tuning: you see everything that would fire.
  • Findings in detect/shadow are always audited regardless of sampling_rate (sampling applies only to plain allow decisions — see Built-in Checks).
  • mode: off is a per-policy switch and still goes through policy resolution. For paths that should skip resolution entirely (health checks, metrics), use spec.exclude instead — see The SecurityPolicy Model.
note

mode: off combined with inspect_request_body: true or inspect_response_body: true is rejected at load — inspecting a body while off can never do anything.

fail_open vs fail_close

The fail posture governs internal engine errors only — for example, a Coraza body-processing failure, a GeoIP database read error, or the per-request timeout expiring.

PostureOn engine error / timeoutMetric fed
fail_openAllow the request.fail_open_total
fail_closeBlock the request with a fixed reason (fail_close_error / fail_close_timeout). The underlying error string is never leaked into the audit reason.fail_close_total
Fail posture is NOT an auth bypass valve

A missing or invalid credential always blocks, regardless of fail_mode. Auth engines (JWT, JWKS, API key, HMAC signing, HTTP signatures, mTLS/XFCC) report a failed or absent credential as a finding, never as an error — there is no anonymous pass-through. fail_open will not let an unauthenticated request through; it only governs the rare case where an engine itself malfunctions.

Two more rules that shape behavior under errors:

  • A confirmed block always beats an error. If one engine blocks while another errors, the request blocks — the fail posture never overrides a real finding.
  • When no policy resolves at all, the default is fail-open, so a Shield bug never blackholes traffic to hosts you haven't onboarded.

Posture guidance:

  • fail_open (the default) is right for most WAF-style policies — a WAF bug must never take down legitimate traffic.
  • fail_close is right for positive-security policies where an error would otherwise silently disable the control: OpenAPI validation, country allow-lists, and DLP on sensitive response surfaces (Data Loss Prevention).
fail_open silently disables positive-security allow-lists

A fail_open policy with a country allow-list in IP reputation fails open on a GeoIP database read error — the geo-fence quietly stops fencing. Any allow-list-style control (country allow-lists, OpenAPI positive security, DLP on a sensitive surface) should be paired with fail_close, because for these an error that allows is exactly the failure you deployed them to prevent.

Never turn on block for a new policy against live traffic. The safe sequence:

1. detect — establish the baseline and tune

policy:
mode: detect

Detect evaluates everything, records each finding as a would-block, and allows the request — the monitor mode for initial tuning against real traffic. Watch:

  • detections_total — the would-block rate. If it is a meaningful fraction of traffic, you have false positives to hunt down before going further.
  • Individual would-block events in Security Events — each carries the engine, rule id, and reason, so you can identify noisy rules (for Coraza, collect exclude_rule_ids candidates here — see Coraza WAF).

Use this phase to tune thresholds (Coraza paranoia/anomaly levels, bot scores, anomaly_threshold) until the detection stream contains only traffic you genuinely want to block. Also confirm fail_open_total/timeouts_total stay at zero — a nonzero timeout rate means your timeout budget is too tight for the body sizes you inspect (see Body Inspection & Limits).

2. shadow — dress-rehearse enforcement

policy:
mode: shadow

Shadow evaluates the tuned policy exactly as if it were blocking and logs what would have been blocked (shadow_detections_total), still with zero traffic impact. This is the final verification pass: after tuning in detect, the shadow stream should contain only traffic you are willing to reject. Any remaining legitimate request showing up in Security Events here is a block you just avoided shipping. Promote when the shadow stream has been clean (or attack-only) for a representative traffic window — including whatever weekly batch jobs, webhook retries, and traffic peaks your service sees.

3. block — enforce

policy:
mode: block

Flip to enforcement. shadow_detections_total for this policy drops to zero and requests_blocked_total picks up the same traffic. Keep Security Events open for the first hours — every block is always audited, so a false positive shows up immediately with the exact rule and reason.

Per-route mode overrides

Because mode is a scalar PolicySpec field, you can run different postures on different surfaces of the same domain — enforce where you are confident, monitor where you are still tuning:

apiVersion: sentinel.elchi.io/v1
kind: SecurityPolicy
metadata:
name: api-rollout
spec:
defaults:
mode: block # enforced everywhere by default
fail_mode: fail_open

domains:
- hosts: ["api.example.com"]
routes:
# Mature surface: enforce.
- match:
path_prefix: "/v1/"
policy:
mode: block

# New surface still being tuned: monitor only, fail closed on errors.
- match:
path_prefix: "/v2/"
policy:
mode: detect
fail_mode: fail_close

# Not yet onboarded: skip inspection for this route.
- match:
path_prefix: "/legacy/"
policy:
mode: off
tip

The Coraza engine always runs in enforcing mode internally (Shield forces SecRuleEngine On); the policy mode is the single switch that decides whether a CRS hit actually blocks. Never try to do monitor-mode via SecLang DetectionOnly — use mode: detect. See Coraza WAF.