A multi-tenant AI agent operating system running as stateless compute on a hardened Ubuntu VPS, with Postgres as the single source of truth and a zero-trust private mesh as the default access plane.
Built and operated solo. This page exists to document the architecture and the three engineering decisions I'd defend in any room.
The system originally split state between a SQLite database and a JSON file on disk. The application wrote to both. This is a dual-write architecture, and dual-write architectures are a failure class — not a bug. The two stores drifted. The UI showed one version of reality, the database showed another, and I ended up SSH'd in at midnight reconciling state by hand. More than once.
I stopped patching individual symptoms and moved Postgres to the role of single source of truth. The VPS became stateless compute. Every read and write goes to one place. There's no second store to drift against.
The tradeoff: I now depend on Postgres availability and pay a network round-trip on every state access. I took that trade because the failure mode I was leaving behind — silent data drift I only noticed when the UI lied to me — is much worse than the failure mode I was moving toward, which is a loud, obvious outage if Postgres goes down. Loud failures are debuggable. Silent drift erodes trust in the whole system.
The system runs on a public VPS, but exactly one port is reachable from the public internet. Everything else — admin, SSH, the UI, inter-service traffic — runs over a private Tailscale mesh. The one public port exists because edge functions in a managed serverless environment have to call back into the dispatch engine, and they don't live on my Tailnet. There's no way around it.
Every other service binds to the Tailscale interface, never to 0.0.0.0. That's enforced as an invariant: every deploy runs a check that greps the running container list for any binding to all interfaces, and the deploy fails if anything comes back. It's not a code review item. It's a build gate.
I structured it that way because public exposure isn't a thing you decide once at architecture time. It's a thing that creeps in. Someone — including me, at midnight, debugging — adds a port mapping "just for now," and it stays. The only way I've found to prevent that is to make the unsafe state mechanically impossible to ship.
The tradeoff: friction. Every device I want to use has to be on the mesh first. If I'm somewhere without it, I can't get in. I treated that friction as the feature, not the cost. The system handles sensitive data; anything that lets me in casually lets an attacker in casually.
When I started running the agent system in production, average per-call context was around 8,000 tokens. At any meaningful volume, that's the cost line that matters — not infrastructure, not storage, the model calls themselves.
Two things were going wrong. First, I was re-sending the same system prompts and agent personas on every call, even when nothing about them had changed. Second, I was doing bulk retrieval — pulling in large chunks of context "just in case the agent needs it" rather than figuring out what it actually needed for this specific call.
I made two changes:
Prompt caching for the stable parts of context (system prompts, agent definitions, things that don't change call-to-call). Full token cost only on the first call.
Surgical retrieval instead of bulk. Stopped pulling whole documents into context. Started pulling only the specific passages relevant to the current task.
Roughly a 65% reduction, and the monthly model bill moved with it. Nothing got dumber — response quality improved, because the model wasn't being asked to find a needle in a haystack of irrelevant context on every call.
Live components from the production system. These are visual representations of the real interfaces — task dispatch, workforce management, and cost tracking.
This is a solo-operated production system. It's not a team project, it's not a research artifact, and it's not architected for hyperscale. It's architected for the constraints I actually have: one operator, real users, sensitive data, and a tight cost budget against frontier models.
The decisions above are the ones that matter most, told honestly, with the tradeoffs visible. If any of them are wrong, I'd rather find out in a conversation than in production.