Architecture
A walk-through of what actually happens when you run UTM. Starts from the outside (what the user sees) and drills into each layer. Written for a new engineer joining the project or a beta tester curious about the internals.
1. What UTM is, one paragraph
A self-hosted WireGuard overlay VPN with a central coordinator that
tracks who’s allowed on the mesh, and per-peer agents that set up and
maintain local tunnels. Peers get an overlay IP (e.g. 10.77.0.5) and can
reach other peers by that IP regardless of what physical network they’re on.
Designed to work over CGNAT and through tactical radio meshes (Silvus), which
means it can’t assume any peer has a reachable public IP.
2. Topology
+---------------------+
| Coordinator |
| wg0: 10.77.0.1 |
| api: :8080 |
| wg: :51820 UDP |
+---------+-----------+
| (hub relay, every peer starts here)
+--------------+--------------+
| | |
+----v----+ +-----v----+ +----v----+
| Peer A | | Peer B | | Peer C |
| 10.77. | | 10.77. | | 10.77. |
| 0.2 | | 0.3 | | 0.4 |
+---------+ +----------+ +---------+
^ ^ ^
+---- direct p2p when possible ---+
The coord is a WireGuard peer like any other — its wg0 address is always
10.77.0.1. When peer A sends a packet to peer B, the default route is
through the coord (AllowedIPs on the coord peer = the full /24). The
coord IP-forwards the packet back out to peer B’s tunnel. Hub-and-spoke.
When two peers are both reachable to each other directly (both have good
NAT, or they’re on the same LAN), the agent detects it and adds a direct
peer entry with AllowedIPs pinned to just that peer’s /32. Traffic
bypasses the coord. We still keep the /24 coord route as a fallback if
the direct handshake goes stale.
3. Components
cmd/
├── coordinator/ Linux-only daemon. Owns the mesh state (sqlite DB),
│ runs its own wg0, enforces ACLs via nftables.
├── agent/ Runs on every peer. Brings up wg0, talks to the coord,
│ reconciles peer config. Linux + Windows.
├── utm-client/ Native GUI for peers (Join/Leave mesh, status).
│ WebKitGTK on Linux, WebView2 on Windows.
├── utm-admin/ Native GUI for the admin (same technology, points at
│ the coord's /ui).
└── utmctl/ CLI for the admin — scriptable, pairs with utm-admin.
internal/
├── api/ Shared request/response types. One source of truth
│ for what coord and agent agree to send each other.
├── db/ SQLite layer (peers, rules, tokens, coordinators).
├── wg/ Kernel-WireGuard wrapper (coord only — needs root).
├── wgagent/ Userspace WireGuard wrapper (agents, cross-platform).
├── acl/ Compiles role-based rules into nftables rulesets.
├── replication/ Primary → replica state push for multi-coord clusters.
└── update/ Self-update pipeline (manifest fetch, SHA verify,
install.sh --upgrade, rollback).
The shared internal/api/types.go matters a lot — it’s the contract. Any
field change there affects both coord and agent.
4. Enrollment: from “add peer” click to working tunnel
Admin side
- Admin UI sends
POST /admin/peerswith a name + role. - Coord allocates the next free IP in
10.77.0.0/24(skips.1which is always the coord’s), writes a DB row for the peer with a blank pubkey. - Coord issues a one-shot enrollment token (24h expiry) linked to that peer row.
- Admin UI shows a URL + token pair. Admin sends it to the end user (email, Signal, QR code, however).
Peer side
- End user opens the UTM app on their laptop. The app polls the local
agent at
http://127.0.0.1:51821/status. - Agent is running in “unenrolled mode” — no tunnel, just the local API
answering with
{"enrolled": false}. App renders the Join form. - User pastes the URL + token. App POSTs to the agent’s
/enroll. - Agent generates a WireGuard keypair locally (private key never leaves
this machine), then POSTs
{token, pubkey}to the coord’s/enroll. - Coord atomically consumes the token (can only be used once), writes the
peer’s pubkey to the DB, pushes the peer into wg0 with
allowed_ips = <peer_ip>/32, and replies with:{ "assigned_ip": "10.77.0.5", "server_pubkey": "...", "server_endpoint": "73.140.176.8:51820", "server_endpoints": ["73.140.176.8:51820", "10.0.0.61:51820"], "network_cidr": "10.77.0.0/24", "coordinator_ip": "10.77.0.1" } - Agent writes this to
state.json, then exits cleanly (status 0). systemd on Linux / SCM with failure actions on Windows restarts the service. On the next startup the agent seesstate.jsonexists and boots in “enrolled mode”. - Enrolled mode:
ip link add wg0 ..., add the coord as a peer with allowed-ips = the network CIDR (catch-all route through the coord), then start polling/configon a 30s ticker.
5. Endpoint probing (the v0.15 feature)
The coord advertises every address it’s reachable on — public first,
then each RFC1918 LAN IP on its non-loopback interfaces. This is in the
enrollment response (server_endpoints) and in every /config response
(on the coord’s own entry in the coordinators[] list).
The agent’s runHandshakeProbeLoop ticks every 5 seconds:
every 5s:
look up the current coord peer on wg0
if handshake age < 15s:
mark the current endpoint sticky (save as active_wg_endpoint)
else if we've been stale for 15s+ and cooldown has elapsed:
rotate to next endpoint in the ring
push the new endpoint to wg0 via AddPeer (upsert)
save state
Why this matters: if your laptop is on the same LAN as the coord, your router probably doesn’t support NAT hairpin — the public IP isn’t reachable from inside your own network. Without probing, you’d be stuck. With it: first endpoint fails for 15s, rotate to the LAN IP, handshake in ~1s, connected.
The sticky part (active_wg_endpoint in state.json) means a reboot
doesn’t re-probe from scratch — we remember the one that worked.
6. Direct peer-to-peer path
Hub-and-spoke is the default because it always works, but it’s inefficient for peer-to-peer traffic (every packet goes through the coord and back). The coord tries to help peers find each other directly:
- Coord’s
runEndpointTrackersampleswg show wg0every 10s and records each peer’s currentendpoint(theip:porttheir UDP packets are arriving from, as observed by the kernel). - In
/configresponses, the coord includes every other enrolled peer’s pubkey + observed endpoint. - Agent’s
applyConfigadds those peers as direct WG peers withallowed_ips = peer_ip/32. The more-specific /32 wins against the coord’s /24 catch-all. - If a direct peer’s handshake goes stale (>90s), the agent drops it from the peer set — traffic falls back through the coord via the /24 route.
This is the “try direct, fall back to hub” pattern — simpler than ICE or STUN and works because we don’t need symmetric-NAT traversal, just good enough NAT that UDP state survives.
7. ACL enforcement
Roles are strings attached to each peer: user, operator, admin by
default (extensible). Rules are role→role pairs: “users can reach
operators”. A 3×3 grid in the admin UI.
The enforcement happens on the coord via nftables. When any peer
tries to route a packet through the coord (hub path), the coord’s
nftables forward chain checks:
- source IP’s role
- destination IP’s role
- is there a rule allowing src_role → dst_role?
No rule = packet dropped. Return traffic for an established flow is auto-allowed via conntrack.
syncACL() runs after every peer create/delete/role change or rule
change. It rebuilds the entire ruleset from the DB and atomically swaps
it in. Rebuilding is O(peers × rules) but the DB is small (100 nodes
tops) so it’s microseconds.
Direct peer-to-peer traffic bypasses the coord, so ACLs don’t apply. Acceptable tradeoff: if two peers are allowed to talk per the ACL at all, direct is fine; if they’re not, the coord-side firewall would drop hub traffic and there’s no return path for direct.
8. State persistence
Nothing is in-memory-only. Reboots always bring back exactly the state that was last persisted:
- Coord:
/var/lib/utm/utm.db(SQLite: peers, rules, tokens, coordinators). Config in/etc/utm/coordinator.env(tokens, endpoints, WG port). Server keypair in/var/lib/utm/server-keys.json. - Agent (Linux):
/etc/utm-agent/state.json(pubkey, assigned IP, current coord, known coords list, sticky endpoint). - Agent (Windows):
C:\ProgramData\UTM\state.json(same shape).
The state files are plain JSON, intentionally — humans can read them and we can hand-edit in an emergency without needing any UTM tooling. They’re also small (~1KB for a typical agent).
9. Self-updates
The coord fetches a manifest JSON every 6 hours from a configurable URL
(UTM_UPDATE_MANIFEST_URL). The manifest lists the latest version + SHA-
256 + download URL for each component.
When admin clicks “Apply” in Settings:
- Coord downloads the tarball to
/opt/utm/staging/. - Verifies SHA-256 matches the manifest.
- Extracts to
/opt/utm/<version>/. - Runs that version’s
install.sh --upgrade— because the installer ships alongside the binaries, it always knows the current version’s prerequisites. If v0.15 needs a new apt package, the new install.sh installs it automatically. install.sh --upgradeatomically swaps binaries in/usr/local/bin/keeping the old ones as.prev. Then it exec’ssystemctl restart utm-coordinator.- If the new coord fails to come back up (health check), there’s a
rollback button that moves
.prevback into place.
The key design decision: every tarball contains its own installer, so we don’t have to maintain forward-compat in an older installer. The running coord downloads the new tarball and runs the new script, which knows everything the new version needs.
10. Multi-coord clusters (primary + replicas)
A single coord is a single point of failure. UTM supports an active/passive cluster:
- One coord is
role=primary, handles all writes. - Zero or more coords are
role=replica, read-only. - Primary pushes state changes to replicas via HTTP (
POST /internal/push) on every write — peer upsert, rule upsert, coord upsert. - Replicas heartbeat back to the primary every 15s so the primary knows who’s alive.
- Peers know about every coord in the cluster via the
/configresponse’scoordinators[]list. If the current coord stops responding, the agent’spollOncerotates through the list until one answers.
This is intentionally simple — not Raft, not Paxos. Primary writes, replicas read-only, failover is manual (admin runs a command to promote a replica to primary). A 100-node mesh doesn’t need distributed consensus; it needs one reliable coord with a warm standby.
11. Why certain choices
Userspace WireGuard (wireguard-go) on agents, not kernel wg.
Windows doesn’t have kernel WG without Wintun. Kernel WG also requires
CAP_NET_ADMIN which is awkward for user-mode tooling. Userspace works
everywhere (Linux, Windows, later macOS) with zero OS-level config.
Coord uses kernel wg because it’s server-class and performance matters.
Two separate processes on each peer (agent + utm-client).
The agent is a system service that needs to survive user logins/logouts.
The utm-client is a per-user GUI app that opens a webview. Keeping them
separate means the agent doesn’t need any GUI dependencies (webkit2gtk
on Linux, WebView2 on Windows) — just a plain Go binary. They talk via
the loopback HTTP API on 127.0.0.1:51821.
Embed the web UI in the Go binary. //go:embed all:web bundles
HTML/CSS/JS into the executable, served over a local HTTP listener on a
random port. Single binary, no “oops forgot to copy the assets” bug.
Apple-style CSS in vanilla HTML. No React/Vue/framework. The UI is small enough (5 pages) that a framework would be more code than the pages themselves. Reduces build complexity and binary size.
Folder-per-version on disk. ~/Documents/Projects/UTM-vX.Y.Z/ is
a full copy, not a git branch. We don’t need branching for a small
project with one maintainer; we need to be able to rm -rf a version
and know nothing else got touched. build-dist.sh vendors all Go deps
so builds are reproducible even offline.
12. Failure modes + debugging
- “agent not responding” in the app → agent service is stopped or
crashed. Check
sudo systemctl status utm-agent(Linux) orGet-Service UTMAgent(Windows). - “connecting — no handshake yet” → network layer reaching the coord
but WG handshake not completing. Check
sudo wg show wg0on the coord for a peer with the agent’s pubkey; if it’s missing, enrollment didn’t land. If it’s there butlatest handshakeis never, UDP 51820 isn’t reaching the coord (router not forwarding, or peer on same LAN as coord without hairpin — v0.15 probe should rotate to LAN endpoint within 20s). - “enrolled peer but coord has no record” → token was consumed on
a different coord, or peer was deleted after enrollment.
/admin/peersis the source of truth; Leave mesh on the peer and rejoin with a fresh token. - Settings Save → “Load failed” → coord exited for self-restart but
systemd didn’t bring it back (must be
Restart=alwaysnotRestart=on-failurebecause the coord exits cleanly).
13. Ports used
8080/tcp— coord HTTP API (configurable via--listen). Admin UI, agent polls, enrollment. Needs internet reachability for remote peers.51820/udp— coord WireGuard. Needs internet reachability for remote peers to handshake.51821/tcp— agent loopback-only status API. Never exposed externally.- Random high port — utm-client’s local web server (embedded UI). Loopback-only.
14. Cryptography
- WireGuard: Curve25519 + ChaCha20-Poly1305 + BLAKE2s. Standard WG, no custom crypto.
- Admin token: 32-char base64 random, rotated via UI, stored in
/etc/utm/coordinator.env. Bearer token on every admin API call. - Enrollment token: 24-char base64 random, one-shot, 24h expiry,
stored in coord DB. Consumed atomically at
/enroll. - Tarball integrity: SHA-256 in the manifest, verified before extraction. Prevents tampered downloads even over plain HTTP (we use HTTPS anyway).
- Server identity: coord’s WireGuard public key is in the enrollment response; the agent pins it. A MITM on the HTTP channel can’t replace the coord without the pubkey mismatch being immediate.
15. Code reading order
If you want to understand the whole system, read in this order:
internal/api/types.go— the contract.cmd/coordinator/main.gohandlers —handleEnroll,handleConfig,handleCreatePeer.cmd/agent/main.go—main,doEnroll,runPollLoop,applyConfig,runHandshakeProbeLoop.cmd/utm-client/main.go+web/app.js— native window → webview → polls the loopback API.internal/update/update.go— self-update pipeline.
Total code is ~3000 lines of Go + ~1500 lines of JS/CSS/HTML. Small enough to read cover-to-cover in an afternoon.