TLS and Certificates

What It Is

TLS (Transport Layer Security) is the protocol that puts the "S" in HTTPS. It provides encrypted, authenticated communication between two endpoints. If you've built a web app that uses HTTPS, you've used TLS -- your browser verified the server's certificate, negotiated an encrypted channel, and sent data through it.

X.509 certificates are the identity documents of TLS. A certificate binds a public key to an identity (a domain name, an IP address, or a machine name) and is signed by a Certificate Authority (CA) that vouches for the binding.

Why It Matters

In a cluster of machines, every connection between nodes needs to answer two questions:

Am I talking to who I think I'm talking to? (Authentication)
Can anyone else read this? (Encryption)

TLS answers both. Without it, a compromised network switch or a rogue machine on the same subnet could intercept, read, or modify any traffic between legitimate nodes.

How It Works

The TLS Handshake (Simplified)

When a client connects to a server over TLS:

Client Hello: Client sends supported TLS versions and cipher suites
Server Hello: Server picks a version and cipher suite, sends its certificate
Client verifies: Client checks the certificate's signature against its trusted CA list. If valid, the server is who it claims to be.
Key exchange: Client and server agree on a shared secret using Diffie-Hellman (or similar). This produces session keys that encrypt all subsequent traffic.
Encrypted communication: All data is encrypted with the session keys.

Certificates and CAs

A certificate contains:

The subject's public key
The subject's identity (Common Name, Subject Alternative Names)
The issuer's identity (the CA that signed it)
Validity period (not before / not after)
The CA's digital signature over all of the above

The trust model is hierarchical: you trust a root CA, the root CA signs intermediate CAs, intermediates sign leaf certificates. You verify the chain from leaf back to root. If any link is invalid (expired, wrong issuer, tampered), the chain breaks.

For public websites, root CAs are companies like DigiCert, Let's Encrypt, and Sectigo. Your browser trusts ~100-150 root CAs by default.

For internal clusters, you run your own CA. This means:

You control exactly which certificates are trusted
No dependency on external companies
No cost per certificate
But you must manage the CA key securely

mTLS (Mutual TLS)

In standard TLS, only the server presents a certificate. The client is anonymous at the TLS layer (authentication happens at the application layer -- usernames, API keys, etc.).

In mutual TLS (mTLS), both sides present certificates:

Server presents its certificate -> client verifies
Client presents its certificate -> server verifies

Both verifications use the same CA chain. After the handshake, each side knows cryptographically who the other is. This is the standard for machine-to-machine communication in:

Kubernetes (API server <-> kubelets)
Service meshes (Istio, Linkerd)
HashiCorp Consul (agent RPC)
FortrOS (all inter-node communication)

Certificate Pinning

CA-based trust says "trust any certificate signed by a trusted CA." This is flexible but vulnerable to CA compromise -- if an attacker obtains a certificate from any trusted CA, they can impersonate any server.

Certificate pinning says "for this specific connection, only trust this specific certificate (or this specific CA)." FortrOS uses pinning for the preboot's TLS connection to the org gateway: the org CA's public key is embedded in the preboot UKI's initramfs. Even if a global CA like DigiCert is compromised, it can't issue a certificate that the preboot would accept.

Short-Lived Certificates

Traditional certificates last months or years. Revoking them is hard -- Certificate Revocation Lists (CRLs) and OCSP are unreliable in distributed systems.

Short-lived certificates (hours to days) solve this differently: don't revoke, just stop renewing. If a node is compromised, its certificate expires within hours. No CRL distribution, no OCSP responders, no race conditions. The tradeoff is that the CA must be highly available for renewals.

FortrOS does not issue per-node X.509 certificates. The org CA exists as a signing key for enrollment records and invite manifests, but connection- level authentication runs on raw Ed25519 public keys rather than X.509 certs.

How FortrOS Uses It

Transport TLS: FortrOS uses TLS where the transport actually needs it:

WAN reach-in via the Gateway rides through Cloudflare (TLS terminated at the edge).
fortros-generation-authority listens on localhost with a server cert baked at org creation time; the provisioner is the only client, and it verifies that cert.
Intra-overlay (node-to-node) traffic is already encrypted by WireGuard, so no second TLS layer is added.

Connection authentication: conn_auth (Ed25519). Every protocol between maintainers, and every WebSocket route on the provisioner, runs a raw-Ed25519 challenge-response handshake before any application data:

Client -> Server:  pubkey=<hex> action=<name> [kv...]
Server -> Client:  <32 random bytes>       (challenge)
Client -> Server:  sig=<hex>                (Ed25519 over challenge)
Server verifies:   sig valid against pubkey AND pubkey in org member list

A revocation is a CRDT update to the member list; the next handshake from that pubkey fails. No CRL, no OCSP, no cert expiry clock.

Preboot -> org gateway: Transport is server-authenticated TLS (via Cloudflare). The preboot authenticates itself via H(preboot_secret) as a Bearer-ish token over the WebSocket, followed by conn_auth proving possession of its preboot Ed25519 signing key for returning-boot flows. The preboot never needs a "current certificate" -- its identity is the TPM-sealed secret + signing key, both cert-free.

Main OS -> other nodes: conn_auth over the WireGuard overlay. Each node's Ed25519 signing key is its identity; the org's CRDT member list is the trust store. Used for TreeSync (state replication), shard transfer (storage), and service-to-service communication.

Revocation lifecycle: No renewal, because there's nothing to renew. Revoking a node writes a CRDT update marking its pubkey as revoked; gossip propagates the update; the next conn_auth handshake from that pubkey fails at every org service.

Alternatives

Pre-shared keys (PSK): Symmetric keys shared in advance. Simpler than certificates but key distribution and rotation are harder. Used by WireGuard for tunnel encryption (but FortrOS uses certificates for identity on top of WireGuard encryption).

OAuth / API tokens: Common in web services but stateful (require a token store/validator). Not suitable for peer-to-peer systems without a central authority.

SPIFFE/SVID: A standard identity format layered on top of X.509. SPIFFE IDs are URIs in the certificate's SAN field. SPIRE issues SVIDs automatically based on platform attestation. More automated than manual cert management but adds infrastructure complexity.