Monitoring and Self-Observation

What It Is

FortrOS monitors itself. No Prometheus, no Grafana, no external monitoring agents. Every maintainer collects health metrics from its host and shares them via gossip. The org knows its own health without any infrastructure beyond what it already has.

The interface is a topology-aware map: pannable, zoomable, showing the org's physical structure (regions, sites, racks, nodes) with color-coded health states and tiered notifications that suppress noise by correlating failures to root causes.

Why It Matters

Traditional monitoring is bolted on: install Prometheus, configure scrape targets, deploy Grafana, set up alerting rules, maintain the monitoring infrastructure alongside the thing it monitors. The monitoring system itself can fail, go stale, or disagree with reality.

FortrOS's maintainer already talks to every node via gossip. It already knows who's alive, who's unreachable, and what state the org is in. Making health metrics part of gossip is a natural extension, not a separate system.

How It Works

Self-Reporting

Each maintainer collects metrics from its own host at regular intervals (10 seconds to 5 minutes depending on the metric type):

Category	Metrics
System	CPU per-core utilization, RAM (used/available/zram ratio), swap per-device breakdown, disk I/O
Hardware	SMART attributes, temperatures, fan speeds, GPU utilization
Services	VM states, service health, process supervision events (s6 restarts)
Org	Gossip round-trip times, CRDT sync state, cert validity, pending-confirmation counts

Metrics are stored in ephemeral in-memory ring buffers on each node. On reboot, local history is lost -- but other nodes retain gossip-derived summaries. This is intentional: long-term storage is an optional org service, not baked into the monitoring layer.

Topology-Aware Alerting

The Topology Map drives notification hierarchy. Alerts are tiered by scope and correlated by infrastructure:

Tier	Scope	Example	Suppresses
Tier 1	Org-wide	Network partition, multi-site outage	Everything below
Tier 2	Site/rack	Switch group down, PDU failure	Individual host alerts behind the failed infrastructure
Tier 3	Individual host	Disk SMART warning, service crash	Nothing

Root cause correlation: If 3 hosts on the same switch go offline simultaneously, the alert is "switch-group-1 connectivity issue" -- not 3 separate "host unreachable" alerts. A service failure preceded by disk errors links the two: "service X failed due to disk errors on /dev/sdb."

This correlation uses the topology map: nodes that share a failure domain (same rack, same switch, same PDU) are correlated when they fail together. Individual failures are reported individually. Infrastructure failures are reported as infrastructure, with affected nodes listed underneath.

The Map Interface

The admin interface is a visual topology map served by any maintainer (no dedicated monitoring server). The UI is web-based (WebSocket for live updates) and shows:

Org level: Regions and sites, with aggregate health per site
Site level: Racks and nodes, with color-coded health states (green = healthy, yellow = degraded, red = critical, gray = unreachable, pulsing = state change in progress)
Node level: Click a node to see its metrics, services, VMs, disk health, and gossip state
Drill-down: The admin interface asks the relevant node (or a peer in its zone) for detailed metrics. Detail stays where it's relevant, not replicated everywhere.

No External Dependencies

The monitoring system requires nothing beyond what FortrOS already provides:

Data collection: Maintainer reads local /proc, /sys, SMART, s6 state
Data distribution: Gossip carries health summaries
Alerting: Maintainer evaluates alert rules locally + correlates via topology
UI: Any maintainer serves the web interface

If the org wants long-term analytics (capacity planning, trend analysis), Prometheus can be deployed as a tier 2 org service that scrapes maintainers. But it's optional -- the built-in system handles real-time health and alerting without it.

How FortrOS Uses It

Partition detection: If gossip splits into isolated groups, the alert shows the partition boundary, not just "hosts unreachable."
Hardware lifecycle: SMART prediction surfaces "disk likely to fail within weeks" alerts. The placement service proactively re-replicates shards off degrading disks before they fail.
Rolling upgrade tracking: During a rolling upgrade (10 Sustaining the Org), the map shows which nodes are upgraded, which are pending, and which are being drained. The admin sees the upgrade's progress geographically.
Workload health: VM and container health (from the reconciler's observed state) is overlaid on the node map. A failed workload shows on the node that was running it.