/etc/elchi/validate.sh — per-node post-install audit
Read-only. The installer drops this on EVERY node so you can confirm the install end-to-end without leaning on the orchestrator. Run it on each machine after install (or any time you want a sanity check):
sudo /etc/elchi/validate.sh
What it walks (in order):
- Topology context — this node's index, role flags (
runs_mongo,runs_otel,runs_coredns, …),backend_variantsset. - Systemd — every
elchi-*unit + mongod / grafana-server / nginx (where present). Active = ✓, activating = warning, failed/inactive = ✗. Watchdog timer state checked separately. - Listening ports — compares
ss -lntpagainst expected per-node + M1-singleton + per-variant control-plane ports. Flags any M1-only port that shows up on Mn (and vice-versa). - Service health — OTel
:13133health (every node). M1 only:mongodping via mongosh, VictoriaMetrics/api/v1/query, Grafana/api/health. - Envoy admin —
/ready,/clustershealth flags for every cluster (registry / controller-rest / otel / grafana / vm + every per-node controller and control-plane),/listenersbind verification. - Config integrity — sha256 of
envoy.yaml,topology.full.yaml,ports.full.json,nodes.list,tls/server.crt. Compare hashes by hand across nodes to confirm the bundle distributed cleanly. - Stale variant detection — flags any
/etc/elchi/<variant>/dir orelchi-control-plane-<sanitized>@.serviceunit whose tag isn't in the activebackend_variantslist. - System tuning — sysctl probe (somaxconn, max_map_count, swappiness, fs.file-max, fs.inotify.*), THP state, swap state, mongod LimitNOFILE/LimitMEMLOCK, envoy LimitNOFILE.
- CoreDNS GSLB ports (when enabled) —
53/tcp,53/udp,8053.
Output is colored, with a final PASS / WARN / FAIL count. Exit code is non-zero on any FAIL.
Why per-node?
The installer renders Envoy bootstrap + bundle on M1 and SCPs to Mn — drift between nodes is the most common "weird symptom" cause. Running validate on each box and diffing the sha256 lines surfaces it in one shell command.