Architecture

A walk-through of what actually happens when you run UTM. Starts from the outside (what the user sees) and drills into each layer. Written for a new engineer joining the project or a beta tester curious about the internals.

1. What UTM is, one paragraph

A self-hosted WireGuard overlay VPN with a central coordinator that tracks who’s allowed on the mesh, and per-peer agents that set up and maintain local tunnels. Peers get an overlay IP (e.g. 10.77.0.5) and can reach other peers by that IP regardless of what physical network they’re on. Designed to work over CGNAT and through tactical radio meshes (Silvus), which means it can’t assume any peer has a reachable public IP.

2. Topology

                 +---------------------+
                 |   Coordinator       |
                 |   wg0: 10.77.0.1    |
                 |   api: :8080        |
                 |   wg:  :51820 UDP   |
                 +---------+-----------+
                           |  (hub relay, every peer starts here)
            +--------------+--------------+
            |              |              |
       +----v----+   +-----v----+   +----v----+
       | Peer A  |   | Peer B   |   | Peer C  |
       | 10.77.  |   | 10.77.   |   | 10.77.  |
       | 0.2     |   | 0.3      |   | 0.4     |
       +---------+   +----------+   +---------+
            ^              ^              ^
            +---- direct p2p when possible ---+

The coord is a WireGuard peer like any other — its wg0 address is always 10.77.0.1. When peer A sends a packet to peer B, the default route is through the coord (AllowedIPs on the coord peer = the full /24). The coord IP-forwards the packet back out to peer B’s tunnel. Hub-and-spoke.

When two peers are both reachable to each other directly (both have good NAT, or they’re on the same LAN), the agent detects it and adds a direct peer entry with AllowedIPs pinned to just that peer’s /32. Traffic bypasses the coord. We still keep the /24 coord route as a fallback if the direct handshake goes stale.

3. Components

cmd/
├── coordinator/     Linux-only daemon. Owns the mesh state (sqlite DB),
│                    runs its own wg0, enforces ACLs via nftables.
├── agent/           Runs on every peer. Brings up wg0, talks to the coord,
│                    reconciles peer config. Linux + Windows.
├── utm-client/      Native GUI for peers (Join/Leave mesh, status).
│                    WebKitGTK on Linux, WebView2 on Windows.
├── utm-admin/       Native GUI for the admin (same technology, points at
│                    the coord's /ui).
└── utmctl/          CLI for the admin — scriptable, pairs with utm-admin.

internal/
├── api/             Shared request/response types. One source of truth
│                    for what coord and agent agree to send each other.
├── db/              SQLite layer (peers, rules, tokens, coordinators).
├── wg/              Kernel-WireGuard wrapper (coord only — needs root).
├── wgagent/         Userspace WireGuard wrapper (agents, cross-platform).
├── acl/             Compiles role-based rules into nftables rulesets.
├── replication/     Primary → replica state push for multi-coord clusters.
└── update/          Self-update pipeline (manifest fetch, SHA verify,
                     install.sh --upgrade, rollback).

The shared internal/api/types.go matters a lot — it’s the contract. Any field change there affects both coord and agent.

4. Enrollment: from “add peer” click to working tunnel

Admin side

Admin UI sends POST /admin/peers with a name + role.
Coord allocates the next free IP in 10.77.0.0/24 (skips .1 which is always the coord’s), writes a DB row for the peer with a blank pubkey.
Coord issues a one-shot enrollment token (24h expiry) linked to that peer row.
Admin UI shows a URL + token pair. Admin sends it to the end user (email, Signal, QR code, however).

Peer side

End user opens the UTM app on their laptop. The app polls the local agent at http://127.0.0.1:51821/status.
Agent is running in “unenrolled mode” — no tunnel, just the local API answering with {"enrolled": false}. App renders the Join form.
User pastes the URL + token. App POSTs to the agent’s /enroll.
Agent generates a WireGuard keypair locally (private key never leaves this machine), then POSTs {token, pubkey} to the coord’s /enroll.

Coord atomically consumes the token (can only be used once), writes the peer’s pubkey to the DB, pushes the peer into wg0 with allowed_ips = <peer_ip>/32, and replies with:

{
  "assigned_ip": "10.77.0.5",
  "server_pubkey": "...",
  "server_endpoint": "73.140.176.8:51820",
  "server_endpoints": ["73.140.176.8:51820", "10.0.0.61:51820"],
  "network_cidr": "10.77.0.0/24",
  "coordinator_ip": "10.77.0.1"
}

Agent writes this to state.json, then exits cleanly (status 0). systemd on Linux / SCM with failure actions on Windows restarts the service. On the next startup the agent sees state.json exists and boots in “enrolled mode”.
Enrolled mode: ip link add wg0 ..., add the coord as a peer with allowed-ips = the network CIDR (catch-all route through the coord), then start polling /config on a 30s ticker.

5. Endpoint probing (the v0.15 feature)

The coord advertises every address it’s reachable on — public first, then each RFC1918 LAN IP on its non-loopback interfaces. This is in the enrollment response (server_endpoints) and in every /config response (on the coord’s own entry in the coordinators[] list).

The agent’s runHandshakeProbeLoop ticks every 5 seconds:

every 5s:
  look up the current coord peer on wg0
  if handshake age < 15s:
    mark the current endpoint sticky (save as active_wg_endpoint)
  else if we've been stale for 15s+ and cooldown has elapsed:
    rotate to next endpoint in the ring
    push the new endpoint to wg0 via AddPeer (upsert)
    save state

Why this matters: if your laptop is on the same LAN as the coord, your router probably doesn’t support NAT hairpin — the public IP isn’t reachable from inside your own network. Without probing, you’d be stuck. With it: first endpoint fails for 15s, rotate to the LAN IP, handshake in ~1s, connected.

The sticky part (active_wg_endpoint in state.json) means a reboot doesn’t re-probe from scratch — we remember the one that worked.

6. Direct peer-to-peer path

Hub-and-spoke is the default because it always works, but it’s inefficient for peer-to-peer traffic (every packet goes through the coord and back). The coord tries to help peers find each other directly:

Coord’s runEndpointTracker samples wg show wg0 every 10s and records each peer’s current endpoint (the ip:port their UDP packets are arriving from, as observed by the kernel).
In /config responses, the coord includes every other enrolled peer’s pubkey + observed endpoint.
Agent’s applyConfig adds those peers as direct WG peers with allowed_ips = peer_ip/32. The more-specific /32 wins against the coord’s /24 catch-all.
If a direct peer’s handshake goes stale (>90s), the agent drops it from the peer set — traffic falls back through the coord via the /24 route.

This is the “try direct, fall back to hub” pattern — simpler than ICE or STUN and works because we don’t need symmetric-NAT traversal, just good enough NAT that UDP state survives.

7. ACL enforcement

Roles are strings attached to each peer: user, operator, admin by default (extensible). Rules are role→role pairs: “users can reach operators”. A 3×3 grid in the admin UI.

The enforcement happens on the coord via nftables. When any peer tries to route a packet through the coord (hub path), the coord’s nftables forward chain checks:

source IP’s role
destination IP’s role
is there a rule allowing src_role → dst_role?

No rule = packet dropped. Return traffic for an established flow is auto-allowed via conntrack.

syncACL() runs after every peer create/delete/role change or rule change. It rebuilds the entire ruleset from the DB and atomically swaps it in. Rebuilding is O(peers × rules) but the DB is small (100 nodes tops) so it’s microseconds.

Direct peer-to-peer traffic bypasses the coord, so ACLs don’t apply. Acceptable tradeoff: if two peers are allowed to talk per the ACL at all, direct is fine; if they’re not, the coord-side firewall would drop hub traffic and there’s no return path for direct.

8. State persistence

Nothing is in-memory-only. Reboots always bring back exactly the state that was last persisted:

Coord: /var/lib/utm/utm.db (SQLite: peers, rules, tokens, coordinators). Config in /etc/utm/coordinator.env (tokens, endpoints, WG port). Server keypair in /var/lib/utm/server-keys.json.
Agent (Linux): /etc/utm-agent/state.json (pubkey, assigned IP, current coord, known coords list, sticky endpoint).
Agent (Windows): C:\ProgramData\UTM\state.json (same shape).

The state files are plain JSON, intentionally — humans can read them and we can hand-edit in an emergency without needing any UTM tooling. They’re also small (~1KB for a typical agent).

9. Self-updates

The coord fetches a manifest JSON every 6 hours from a configurable URL (UTM_UPDATE_MANIFEST_URL). The manifest lists the latest version + SHA- 256 + download URL for each component.

When admin clicks “Apply” in Settings:

Coord downloads the tarball to /opt/utm/staging/.
Verifies SHA-256 matches the manifest.
Extracts to /opt/utm/<version>/.
Runs that version’s install.sh --upgrade — because the installer ships alongside the binaries, it always knows the current version’s prerequisites. If v0.15 needs a new apt package, the new install.sh installs it automatically.
install.sh --upgrade atomically swaps binaries in /usr/local/bin/ keeping the old ones as .prev. Then it exec’s systemctl restart utm-coordinator.
If the new coord fails to come back up (health check), there’s a rollback button that moves .prev back into place.

The key design decision: every tarball contains its own installer, so we don’t have to maintain forward-compat in an older installer. The running coord downloads the new tarball and runs the new script, which knows everything the new version needs.

10. Multi-coord clusters (primary + replicas)

A single coord is a single point of failure. UTM supports an active/passive cluster:

One coord is role=primary, handles all writes.
Zero or more coords are role=replica, read-only.
Primary pushes state changes to replicas via HTTP (POST /internal/push) on every write — peer upsert, rule upsert, coord upsert.
Replicas heartbeat back to the primary every 15s so the primary knows who’s alive.
Peers know about every coord in the cluster via the /config response’s coordinators[] list. If the current coord stops responding, the agent’s pollOnce rotates through the list until one answers.

This is intentionally simple — not Raft, not Paxos. Primary writes, replicas read-only, failover is manual (admin runs a command to promote a replica to primary). A 100-node mesh doesn’t need distributed consensus; it needs one reliable coord with a warm standby.

11. Why certain choices

Userspace WireGuard (wireguard-go) on agents, not kernel wg. Windows doesn’t have kernel WG without Wintun. Kernel WG also requires CAP_NET_ADMIN which is awkward for user-mode tooling. Userspace works everywhere (Linux, Windows, later macOS) with zero OS-level config. Coord uses kernel wg because it’s server-class and performance matters.

Two separate processes on each peer (agent + utm-client). The agent is a system service that needs to survive user logins/logouts. The utm-client is a per-user GUI app that opens a webview. Keeping them separate means the agent doesn’t need any GUI dependencies (webkit2gtk on Linux, WebView2 on Windows) — just a plain Go binary. They talk via the loopback HTTP API on 127.0.0.1:51821.

Embed the web UI in the Go binary. //go:embed all:web bundles HTML/CSS/JS into the executable, served over a local HTTP listener on a random port. Single binary, no “oops forgot to copy the assets” bug.

Apple-style CSS in vanilla HTML. No React/Vue/framework. The UI is small enough (5 pages) that a framework would be more code than the pages themselves. Reduces build complexity and binary size.

Folder-per-version on disk. ~/Documents/Projects/UTM-vX.Y.Z/ is a full copy, not a git branch. We don’t need branching for a small project with one maintainer; we need to be able to rm -rf a version and know nothing else got touched. build-dist.sh vendors all Go deps so builds are reproducible even offline.

12. Failure modes + debugging

“agent not responding” in the app → agent service is stopped or crashed. Check sudo systemctl status utm-agent (Linux) or Get-Service UTMAgent (Windows).
“connecting — no handshake yet” → network layer reaching the coord but WG handshake not completing. Check sudo wg show wg0 on the coord for a peer with the agent’s pubkey; if it’s missing, enrollment didn’t land. If it’s there but latest handshake is never, UDP 51820 isn’t reaching the coord (router not forwarding, or peer on same LAN as coord without hairpin — v0.15 probe should rotate to LAN endpoint within 20s).
“enrolled peer but coord has no record” → token was consumed on a different coord, or peer was deleted after enrollment. /admin/peers is the source of truth; Leave mesh on the peer and rejoin with a fresh token.
Settings Save → “Load failed” → coord exited for self-restart but systemd didn’t bring it back (must be Restart=always not Restart=on-failure because the coord exits cleanly).

13. Ports used

8080/tcp — coord HTTP API (configurable via --listen). Admin UI, agent polls, enrollment. Needs internet reachability for remote peers.
51820/udp — coord WireGuard. Needs internet reachability for remote peers to handshake.
51821/tcp — agent loopback-only status API. Never exposed externally.
Random high port — utm-client’s local web server (embedded UI). Loopback-only.

14. Cryptography

WireGuard: Curve25519 + ChaCha20-Poly1305 + BLAKE2s. Standard WG, no custom crypto.
Admin token: 32-char base64 random, rotated via UI, stored in /etc/utm/coordinator.env. Bearer token on every admin API call.
Enrollment token: 24-char base64 random, one-shot, 24h expiry, stored in coord DB. Consumed atomically at /enroll.
Tarball integrity: SHA-256 in the manifest, verified before extraction. Prevents tampered downloads even over plain HTTP (we use HTTPS anyway).
Server identity: coord’s WireGuard public key is in the enrollment response; the agent pins it. A MITM on the HTTP channel can’t replace the coord without the pubkey mismatch being immediate.

15. Code reading order

If you want to understand the whole system, read in this order:

internal/api/types.go — the contract.
cmd/coordinator/main.go handlers — handleEnroll, handleConfig, handleCreatePeer.
cmd/agent/main.go — main, doEnroll, runPollLoop, applyConfig, runHandshakeProbeLoop.
cmd/utm-client/main.go + web/app.js — native window → webview → polls the loopback API.
internal/update/update.go — self-update pipeline.

Total code is ~3000 lines of Go + ~1500 lines of JS/CSS/HTML. Small enough to read cover-to-cover in an afternoon.