System Online

Agent VOS

A multi-tenant AI agent operating system running as stateless compute on a hardened Ubuntu VPS, with Postgres as the single source of truth and a zero-trust private mesh as the default access plane.

Built and operated solo. This page exists to document the architecture and the three engineering decisions I'd defend in any room.

How the system is wired

flowchart TB subgraph Devices["Authorized Devices"] L["Laptop — primary"] D["Desktop — local inference"] P["iPhone"] end subgraph Mesh["Tailscale Private Mesh"] TS(("zero-trust\naccess plane")) end subgraph VPS["Hardened VPS — Stateless Compute"] direction TB DISP["Dispatch Engine"] UI["Command Center UI"] AGENTS["Agent Workforce\nmulti-provider routing"] end subgraph SoT["Single Source of Truth"] PG[("Postgres")] end subgraph Public["Public Internet — One Door"] EDGE["Edge Function Callbacks"] end L --> TS D --> TS P --> TS TS --> VPS EDGE -->|"single audited port"| DISP DISP <--> PG UI <--> PG AGENTS <--> PG AGENTS -->|"fallback chain"| PROVIDERS["Anthropic / OpenAI /\nOpenRouter / local Ollama"] classDef stateless fill:#1a2332,stroke:#38bdf8,color:#e2e8f0 classDef sot fill:#1a2332,stroke:#fbbf24,color:#e2e8f0 classDef public fill:#1a2332,stroke:#f87171,color:#e2e8f0 classDef mesh fill:#1a2332,stroke:#818cf8,color:#e2e8f0 classDef devices fill:#1a2332,stroke:#4ade80,color:#e2e8f0 class VPS,DISP,UI,AGENTS stateless class PG sot class EDGE public class TS mesh class L,D,P devices
Reading the diagram: every device reaches the system through Tailscale. The VPS holds no durable state — if it burned down tomorrow, nothing of value would be lost. Postgres is the only thing that has to survive. Exactly one port on the VPS is reachable from the public internet, and it exists for one specific reason explained below.

Three decisions I'd defend cold

Decision 01

Stateless VPS, stateful Postgres

The system originally split state between a SQLite database and a JSON file on disk. The application wrote to both. This is a dual-write architecture, and dual-write architectures are a failure class — not a bug. The two stores drifted. The UI showed one version of reality, the database showed another, and I ended up SSH'd in at midnight reconciling state by hand. More than once.

I stopped patching individual symptoms and moved Postgres to the role of single source of truth. The VPS became stateless compute. Every read and write goes to one place. There's no second store to drift against.

The tradeoff: I now depend on Postgres availability and pay a network round-trip on every state access. I took that trade because the failure mode I was leaving behind — silent data drift I only noticed when the UI lied to me — is much worse than the failure mode I was moving toward, which is a loud, obvious outage if Postgres goes down. Loud failures are debuggable. Silent drift erodes trust in the whole system.

When you find yourself writing reconciliation scripts, the bug isn't in the script you're about to write. It's in the architecture that made the script necessary.
Decision 02

One door, watched carefully

The system runs on a public VPS, but exactly one port is reachable from the public internet. Everything else — admin, SSH, the UI, inter-service traffic — runs over a private Tailscale mesh. The one public port exists because edge functions in a managed serverless environment have to call back into the dispatch engine, and they don't live on my Tailnet. There's no way around it.

Every other service binds to the Tailscale interface, never to 0.0.0.0. That's enforced as an invariant: every deploy runs a check that greps the running container list for any binding to all interfaces, and the deploy fails if anything comes back. It's not a code review item. It's a build gate.

I structured it that way because public exposure isn't a thing you decide once at architecture time. It's a thing that creeps in. Someone — including me, at midnight, debugging — adds a port mapping "just for now," and it stays. The only way I've found to prevent that is to make the unsafe state mechanically impossible to ship.

The tradeoff: friction. Every device I want to use has to be on the mesh first. If I'm somewhere without it, I can't get in. I treated that friction as the feature, not the cost. The system handles sensitive data; anything that lets me in casually lets an attacker in casually.

One door, watched carefully, is more defensible than ten doors you mean to watch.
Decision 03

Token efficiency: ~8K to ~2.8K average per call

When I started running the agent system in production, average per-call context was around 8,000 tokens. At any meaningful volume, that's the cost line that matters — not infrastructure, not storage, the model calls themselves.

Two things were going wrong. First, I was re-sending the same system prompts and agent personas on every call, even when nothing about them had changed. Second, I was doing bulk retrieval — pulling in large chunks of context "just in case the agent needs it" rather than figuring out what it actually needed for this specific call.

I made two changes:

Prompt caching for the stable parts of context (system prompts, agent definitions, things that don't change call-to-call). Full token cost only on the first call.

Surgical retrieval instead of bulk. Stopped pulling whole documents into context. Started pulling only the specific passages relevant to the current task.

Before
~8K
avg tokens / call
After
~2.8K
avg tokens / call

Roughly a 65% reduction, and the monthly model bill moved with it. Nothing got dumber — response quality improved, because the model wasn't being asked to find a needle in a haystack of irrelevant context on every call.

The default instinct with LLMs is to give the model more context, because more context feels safer. It's actually the opposite. More context is more cost, more latency, and more noise to filter through. The discipline is asking, on every call, what's the minimum the model needs to do this job well — and designing the retrieval layer to deliver exactly that.

Inside the Command Center

Live components from the production system. These are visual representations of the real interfaces — task dispatch, workforce management, and cost tracking.

command-center / duty-board
Draft 2
Audit Q2 campaign performance data
Q
Queen
P2
Draft API docs for client endpoints
B
Bert
P2
Active 3
Build landing page for Lumora rebrand
D
Dua
P0
Implement multi-provider fallback routing
B
Bert
P1
Design social media content calendar
Q
Queen
P1
Review 2
SEO optimization for client portfolio site
D
Dua
P1
Cost tracking dashboard aggregation
B
Bert
P2
Approved 2
Email automation workflow setup
Q
Queen
P0
Database migration to managed Postgres
B
Bert
P0
dispatch-engine / live-feed
command-center / org-chart
C
Cooper Orchestrator
Q
Queen Marketing Lead
D
Dua Marketing — Creative
B
Bert CTO / Development
command-center / cost-tracker
Today
$1.47
23 dispatches
This Week
$8.32
142 dispatches
Avg / Call
$0.058
~2,800 tokens avg
Cost by Model
Sonnet 4
$5.16
Haiku 3.5
$2.00
GPT-4o mini
$0.83
Ollama local
$0.00
Cost by Agent
Bert
$3.74
Queen
$2.50
Dua
$1.66
Cooper
$0.42

High level

Host
Ubuntu 24.04 LTS on a single VPS
Network
Tailscale mesh, UFW + Fail2ban + iptables, single audited public ingress
Runtime
Docker Compose, all services bound to private interface
State
Managed Postgres as single source of truth
Models
Multi-provider routing — Anthropic, OpenAI, OpenRouter, local Ollama
Security
SSH key-only, root disabled, GPG-encrypted backups, context sanitization
Deploy
10-step sequence with health check and binding-invariant gate

What this is, and what it isn't

This is a solo-operated production system. It's not a team project, it's not a research artifact, and it's not architected for hyperscale. It's architected for the constraints I actually have: one operator, real users, sensitive data, and a tight cost budget against frontier models.

The decisions above are the ones that matter most, told honestly, with the tradeoffs visible. If any of them are wrong, I'd rather find out in a conversation than in production.