System Overview
Multi-tenant AWS platform that deploys hardened, fully isolated customer environments running AI agents. Two customer templates ship today: a baseline environment for custom apps, and the packaged M8trx Agent (Paperclip + Hermes + Claude Code) reachable on a per-customer HTTPS subdomain. A shared Brain Telemetry control plane connects every Agent instance to a central Brain over a Tailscale mesh — for monitoring, key rotation, and operator visibility — without ever exposing a customer instance to another customer.
Executive Summary
m8trxDeployer deploys and manages fully isolated, hardened AWS environments for customers running AI-powered business applications. Today every customer gets their own VPC, KMS key, S3 bucket, IAM role, and EC2 instance — zero shared infrastructure between customer data planes. AI agents that process sensitive data (email, financials, business documents) run in Docker containers managed by the customer's own compose stack, with seccomp, AppArmor, non-root users, and strict resource and network limits.
Two customer templates ship today: baseline (operator-managed Docker compose, single port) and the packaged M8trx Agent (Paperclip orchestrator + Caddy TLS + Postgres + local Llama, reachable at {customer}.{platform-domain} over HTTPS via Let's Encrypt). The Agent template auto-provisions a Cloudflare DNS record, a per-customer Tailscale device on the Brain mesh, and a tag-bound ephemeral auth key on every deploy.
The platform is designed for minimal customer interaction — customers receive a URL and credentials, nothing else. All infrastructure, security, monitoring, and the per-customer Brain key rotation are handled by your team through a unified operations dashboard. Admins reach the dashboard over Tailscale (recommended) or AWS SSM port-forwarding — zero ports open to the public internet. The dashboard supports multi-admin use; any team member can deploy a new customer, rotate the agent UI password, run shell commands via SSM, view logs, take snapshots, or quarantine an instance without ever touching the AWS console.
Where Cloudflare fits. We use Cloudflare for one thing: DNS only (records created with proxied=false). Each agent gets a subdomain under our platform zone (m8trx.ai). The Cloudflare API token, zone ID, and platform domain live in dashboard settings. We do not use Cloudflare's HTTPS proxy, Tunnel, or Access — origin TLS is terminated on the customer instance by Caddy with a Let's Encrypt cert obtained at first boot.
Planned (not yet shipped): a multi-tenant compute mode that packs multiple low-traffic customers onto a single hardened EC2 host. Today every customer is on its own EC2 — but the agent stack is already a Docker compose project (Path B), the project name is namespaced per-customer, the storage is a per-customer KMS-encrypted volume, and each agent already runs its own tailscaled with a tag-bound identity. The remaining work is host-level orchestration of multiple compose projects, a credential broker for per-customer IAM, and a SNI-routing reverse proxy. Goal: drop per-customer cost from ~$37/mo to ~$8–12/mo for small customers without weakening the per-customer trust boundary. See the Compute tab for the full breakdown.
cust_<name>) that drives the brain key, Tailscale tag, SSM path, and subdomain. A customer can run one agent or many — all share the same tag, forming a per-customer Tailscale "VLAN". Cross-customer traffic is denied by ACL; cross-customer AWS access is denied by IAM and KMS.{name}.m8trx.ai in ~5 minutes.✓ PLATFORM READINESS — current state of one-time setup
base.tfvars.json; M8trx Agent template uses the same AMI and bootstraps the agent stack at first boot (Path B — git+compose, no agent-specific AMI bake).m8trx-terraform-state (versioned, KMS-encrypted, public access blocked) + DynamoDB lock table terraform-locks in us-east-2. Backend configured in terraform/main.tf.m8trx.ai is live on Cloudflare. API token + zone ID stored in dashboard settings (or env: CLOUDFLARE_API_TOKEN, CLOUDFLARE_ZONE_ID, PLATFORM_DOMAIN). Subdomains created on deploy with proxied=false — Caddy on the customer instance terminates origin TLS via Let's Encrypt./m8trx/<name>/*. Destroy reverses all of it. See Brain & Tailscale tab.ubuntu (not root, not ssm-user)./api/health. DASHBOARD_PASS required at start (no default — process refuses to boot without it). PID lock prevents duplicate instances. cron watchdog auto-restarts every minute.policies/scp-guardrails.json is complete. Action: attach to the OU in AWS Organizations console — cannot verify application status from this dashboard.Dedicated VPC with public subnet, internet gateway, and VPC flow logs. Cost-optimized (no NAT/ALB). No cross-customer network path exists at the AWS layer.
Dedicated KMS CMK per customer with auto-rotation. EBS, S3, and Brain-telemetry SSM SecureStrings encrypted. One customer's key cannot decrypt another's data.
For the M8trx Agent template, the Paperclip stack runs as a Docker compose project (Caddy + Paperclip + Postgres + local Llama + bridge), each container non-root with seccomp, dropped caps, and resource limits. AppArmor on the host.
Each Agent customer joins a Tailscale tailnet under a tag-bound, ephemeral 90-day auth key, and gets a unique Brain customer key for telemetry. Cross-customer ACLs reject all traffic; the Brain reaches each instance only via its tag.
Network Architecture
Each customer is fully isolated at the AWS network layer. There are no VPC peering connections, no shared subnets, and no AWS-internal paths between customer VPCs. Operator and Brain reach each customer only over Tailscale (an encrypted overlay) using customer-specific tags. Each Agent customer is also reachable from the public internet on its own subdomain over HTTPS, with TLS terminated by Caddy on the instance.
VPC Design (Per Customer)
| Component | CIDR / Config | Purpose |
|---|---|---|
| VPC | 10.{octet}.0.0/16 (octet allocated stably from name hash) | Isolated network per customer; never reused after deletion until octet is freed |
| Public Subnet (x1) | /24 in AZ-a | EC2 instance with EIP — outbound for SSM, GHCR pulls, Brain, Tailscale; inbound only on the template-specific port(s) |
| Internet Gateway | Free, attached to VPC | Outbound internet (no NAT — saves ~$32/mo per customer) |
| Flow Logs | CloudWatch, 30-day retention | Full traffic audit trail for all VPC traffic |
| VPC Peering | None | Operator and Brain reach customers over Tailscale, not VPC peering. (mgmt-proxy module retired 2026-04-21.) |
Security Group Per Template
| Template | Ingress | Egress | Notes |
|---|---|---|---|
| Baseline (custom-app) | :8080/tcp from 0.0.0.0/0 | All | Single application port. Operator brings their own compose stack. |
| M8trx Agent | :80/tcp + :443/tcp from 0.0.0.0/0 | All | Caddy serves HTTPS on :443 with a Let's Encrypt cert. :80 stays open for ACME HTTP-01 challenges and direct-IP HTTP fallback (auto-redirect disabled). |
DNS & Subdomain Architecture
Every M8trx Agent instance gets its own HTTPS subdomain under the platform zone (m8trx.ai). Subdomains are created and torn down by the dashboard through the Cloudflare API (dashboard/services/domain.py) — DNS only (proxied=false). Cloudflare does not proxy traffic; TLS is terminated on the customer instance by Caddy using a Let's Encrypt cert obtained at first boot via HTTP-01. Endpoints: GET / POST / DELETE /api/platform/domains/{customer_name}.
When a customer has one agent the subdomain label defaults to the customer name (acme.m8trx.ai). When a customer has multiple agents, each instance is deployed with a distinct subdomain label so its DNS record points at its own EIP — the convention is {customer-name}-{role} (e.g. acme-inbox.m8trx.ai, acme-finance.m8trx.ai) or {customer-name}-{n}. The deploy form's Subdomain field is the operator's lever for this — it accepts any DNS-safe label, and the same label is used as the customer's display name on the welcome card.
{name}.m8trx.ai → Cloudflare DNS (A record, proxied=false)
→ Customer EIP :443
→ Caddy (origin TLS via Let's Encrypt)
→ Paperclip / Bridge container (basic-auth)
| Property | Detail |
|---|---|
| Domain cost | ~$10/yr for the platform root domain; subdomains free and unlimited |
| HTTPS | Caddy + Let's Encrypt on the customer instance — auto-renewed every ~60 days. No platform-side cert management. |
| Why proxied=false | Origin TLS gives end-to-end encryption to the instance and avoids Cloudflare-edge plan limits (cert renewal, payload size, websocket quirks). |
| On deploy | Dashboard upserts an A record after Terraform apply succeeds. auto_https disable_redirects + a customer-hostname block + basic_auth are appended to compose/Caddyfile on first boot. |
| On destroy | Dashboard removes the A record and rebuilds consolidated Cloudflare rules. |
| Settings | Cloudflare API token + zone ID + platform domain stored in dashboard settings (state.json), with env-var fallback (CLOUDFLARE_API_TOKEN, CLOUDFLARE_ZONE_ID, PLATFORM_DOMAIN). |
Tailscale Overlay = Per-Customer "VLAN"
Operator and Brain traffic to customer instances does not traverse the public internet — every M8trx Agent instance joins a private Tailscale tailnet under a customer-specific tag (tag:m8trx-cust-<customer-uid>). All of one customer's agents share the same tag, so an ACL rule that says "src=tag, dst=tag:*" gives them a private network they can use to call each other. There is no equivalent rule across different customer tags, so cross-customer traffic is denied by default. Functionally, this is a per-customer VLAN built in software.
The dashboard mints a tag-bound, ephemeral, reusable, 90-day auth key per customer and writes it to SSM (the same key is reused across that customer's agents — that's why it's reusable). Each customer instance's user_data.sh joins the tailnet at first boot with that key, hostname set to the EC2 instance-id so the dashboard can find it for Tailnet Lock approval.
| Multi-agent scenario | What that means on the network |
|---|---|
| Customer "acme" has one agent | One tailscaled device, one tag, one subdomain. The accept rule on the tag is in place but unused. |
Customer "acme" adds a second agent (e.g. acme-finance) | Second instance joins the tailnet under the same tag. Operator picks a distinct subdomain (acme-finance.m8trx.ai). The two agents can now reach each other on Tailscale-internal IPs by hostname (e.g. http://i-0abc...) without any further config — the intra-tag accept rule allows it. |
| Customer "beta" deploys an agent | Different tag. Cannot see "acme" agents on Tailscale at all. No ACL rule connects the two tags. |
| Operator (you) | Tagged tag:mgmt, with explicit accept rules to all customer tags. Reaches every agent for monitoring and shell. |
| Property | Detail |
|---|---|
| Auth key | Tag-bound (only registers as tag:m8trx-cust-<id>), ephemeral (auto-revoked at expiry), reusable, 90-day TTL |
| SSH | --ssh=false on customer device (we use SSM Session Manager for shell access) |
| ACL | Per-tag tagOwners + intra-customer-only accept rule. No cross-customer rule exists, so Tailscale rejects traffic between different customer tags by default. |
| Approval | Tailnet Lock — dashboard polls Tailscale for the device by hostname (=instance-id) and approves it after Terraform apply succeeds. Manual approval still possible from Tailscale admin console as a fallback. |
| Decommission | Customer destroy: ephemeral keys auto-revoke at expiry; the device drops off the tailnet when the EC2 is gone (no manual revoke needed in normal flow). |
Compute Architecture
Two customer templates ship today. Both run on the same hardened Ubuntu 22.04 base AMI (Packer-baked) with identical OS controls. They differ in what runs above the OS and how the workload bootstraps. A planned third mode will pack multiple low-traffic customers onto a shared host without weakening per-customer isolation.
Two Customer Templates
| Aspect | Baseline (custom-app) | M8trx Agent |
|---|---|---|
| Terraform module | terraform/modules/ec2 | terraform/modules/ec2-m8trx-agent |
| tfvars map | customers | m8trx_agent_customers |
| Resource naming suffix | none — m8trx-{name}-* | -agent — m8trx-{name}-agent-* (avoids collisions) |
| Inbound port(s) | :8080 | :80 + :443 |
| TLS | Operator's responsibility | Caddy + Let's Encrypt on instance (per-customer subdomain) |
| Bootstrap | Minimal — operator supplies their own compose / dockerfile / app | Path B — vanilla AMI + first-boot git clone M8trxAgent + docker compose up |
| Brain telemetry / Tailscale | Not wired | Required (6 SSM SecureStrings written pre-deploy) |
| Sample customers | keithenterprises | agent2 |
M8trx Agent First-Boot Bootstrap (Path B)
The M8trx Agent template uses the same hardened AMI as the baseline — no agent-specific Packer build. The full stack is fetched and started at first boot via cloud-init. This means deploying a new Agent customer always picks up the latest M8trxAgent source (subject to a fixed --depth 1 clone of main); operator re-bakes are not in the critical path.
Set hostname to instance-id
So the dashboard's post-deploy Tailnet Lock approval poll can find the device: hostnamectl set-hostname $(IMDSv2 instance-id).
Install runtime deps (idempotent)
awscli, git, jq, curl, docker.io, docker-compose-plugin, tailscale. Wait for the Docker socket to be ready (up to 30s) before continuing.
Read 6 SSM SecureString params
brain-key, brain-customer-id, brain-url, tailscale-auth-key, agent-fetch-pat, agent-ui-password — all under /m8trx/<customer-name>/. The instance role can read only its own path; see Brain & Tailscale tab.
Join the tailnet
tailscale up --auth-key=<ssm> --hostname=<instance-id> --ssh=false --accept-routes=false --accept-dns=false --reset --timeout=120s.
Write /etc/m8trx/brain.env (mode 0600)
Three lines: BRAIN_URL, BRAIN_API_KEY, BRAIN_CUSTOMER_ID. umask 077 in a subshell prevents the brief 0644 race that an after-the-fact chmod would leave.
Clone the agent repo, then strip the PAT
git clone --depth 1 https://oauth2:<PAT>@github.com/M8trxInfra/M8trxAgent.git /opt/m8trx-agent, then git remote set-url origin https://github.com/M8trxInfra/M8trxAgent.git so the PAT is gone from .git/config. unset GH_PAT clears the env var.
Generate compose/.env (mode 0700)
Random Postgres credentials, random JWT secret, allowed-hostnames list (localhost + EIP + customer hostname). The agent UI password is bcrypt-hashed by caddy hash-password; every literal $ is doubled so docker-compose interpolation survives.
Patch the Caddyfile and compose for hostname-aware TLS
Prepend { auto_https disable_redirects } (so direct-IP HTTP keeps working alongside the hostname block), append a {name}.{platform-domain} { basic_auth … reverse_proxy … } block, and add "443:443" to the paperclip service ports. Idempotent: skips if already applied.
Bring up the stack
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d --pull always. The .prod.yml override pulls images from GHCR — no local builds in production. First boot ends when caddy obtains its certificate.
Cloud-init output is tee'd to /var/log/m8trx-agent/first-boot.log for SSM-side debugging. Total user-data is kept under the AWS 16 KB user-data hard limit — anything bigger goes into the cloned repo, not user-data.
Audit Rules (auditd) — Immutable, both templates
| What's Monitored | Rule Key |
|---|---|
| Authentication config changes (/etc/pam.d, shadow, passwd) | auth_config |
| Sudo usage and sudoers changes | sudoers |
| SSH config modifications | sshd_config |
| File deletions by non-system users | file_deletion |
| Privilege escalation (execve as root) | privilege_escalation |
| Agent data directory access | agent_data_access |
| Kernel module loading | kernel_modules |
Planned: Multi-Tenant Compute (foundations in place, host-level orchestration pending)
Today, every customer is on their own EC2 — strong isolation, but ~$37/mo even when a customer is sending two emails a week (~$30 of that is the t3.medium itself). The planned multi-tenant compute mode packs multiple low-traffic customers onto a single hardened EC2 host, while keeping the per-customer trust boundary intact. Goal: ~$8–12/mo per small customer at the per-customer slice level, with the same external behaviour (separate subdomain, separate Brain customer key, separate Tailscale tag, separate KMS volume).
This is not a redesign — it's an optimization on top of what Path B already gives us. The agent stack is already a Docker compose project, the project name is already namespaced per customer, the storage is already a per-customer KMS-encrypted EBS volume, and each agent already runs its own tailscaled with a tag-bound identity. Most of the per-customer trust boundary is therefore already in place at the compose level; the missing pieces are at the host level.
| Concern | Status today (single-tenant per host) | Multi-tenant change |
|---|---|---|
| Compute | Done — One docker compose project per customer, namespaced m8trx-{name}_*. Each stack runs as non-root with seccomp + dropped caps + per-service resource limits. (The agent template already bakes Docker into first-boot — Path B.) | Pending — Run multiple compose projects on one host. User-namespace remap to give each customer its own UID/GID range. Per-project cgroup limits. |
| Network | Done — Per-customer compose creates its own Docker network. Caddy on the instance terminates TLS for one customer subdomain. | Pending — Host-level SNI-routing reverse proxy in front of the per-project Caddys (or per-project ENI with separate EIP per customer). Each subdomain still terminates with that customer's own Caddy + LE cert. |
| Storage | Done — Per-customer KMS-encrypted EBS volume. Per-customer S3 bucket with TLS + KMS enforced. | Pending — Multiple per-customer EBS volumes attached to the same host, each mount-namespaced into one compose project. A leaked container can't read another customer's mount because the KMS grant is on a different key. |
| IAM | Partial — Today the EC2 instance role is the customer role (1:1). Multi-tenant changes that. | Pending — Host runs a credential broker bound to localhost. Each compose project authenticates with a short-lived bearer (mounted from a per-project secret); broker exchanges it for STS creds on that customer's IAM role. Container iptables blocks IMDS as today. |
| Tailscale identity | Done — Per-customer tag-bound auth key, per-customer tagOwner, intra-customer accept rule. The reusable flag means one key onboards multiple agents under the same tag. | Pending — Per-customer tailscaled in its own network namespace on the shared host (instead of the host-level tailscaled used today). Each compose project sees only its own customer's tag. |
| Brain identity | Done — Per-customer UID + bearer minted by the brain. The customer's bearer token never leaves that customer's compose project. | No change — same flow, same SSM scope per customer. |
| Blast radius | EC2 compromise affects one customer. | Host-kernel compromise affects every customer on that host (same as any multi-tenant K8s node). Critical / regulated customers stay on dedicated EC2 — opt-in per customer at deploy. |
| Migration | — | Existing customers stay on dedicated EC2; new low-traffic customers can opt in. Dashboard exposes the choice on the deploy form. |
Status: most per-customer guarantees are already enforced at the compose / KMS / Tailscale layer — Path B got us there. The remaining work is host-level: a multi-project compose orchestrator, a credential broker, an SNI-routing reverse proxy, and a deploy-form opt-in. No new Terraform module yet.
IAM & Access Control
Each customer instance assumes a unique IAM role with a permission boundary that hard-caps maximum privileges.
Organization SCPs
StopLogging, DeleteTrail, UpdateTrail
DeleteDetector, DisassociateFromMaster
StopConfigurationRecorder, DeleteConfigurationRecorder
All actions blocked for root principal
RunInstances denied unless HttpTokens=required
RunInstances denied if ec2:Encrypted=false
PutBucketPublicAccessBlock/Policy/ACL blocked
EC2/RDS/Lambda restricted to us-east-1, us-west-2
Brain Telemetry — additional inline scope
Each M8trx Agent customer's instance role carries one extra inline policy, brain_telemetry_ssm_read, scoped to only that customer's SSM path. The permission boundary explicitly permits ssm:GetParameter / ssm:GetParameters so the inline policy isn't blocked at the boundary. Full detail with example policy JSON: Brain & Tailscale tab.
| Resource ARN | Allowed actions | Why scoped this way |
|---|---|---|
arn:aws:ssm:*:*:parameter/m8trx/<own-customer-name>/* | ssm:GetParameter, ssm:GetParameters | A leaked customer instance role cannot read another customer's brain key, Tailscale auth key, or agent UI password — pivot fails at the IAM layer, not just at runtime. |
Brain Telemetry & Tailscale Mesh
A control plane that connects every M8trx Agent customer instance to a central Brain service over a Tailscale tailnet. The Brain stores per-customer metadata (events, agents, API keys) and is what the operator dashboard reads from when monitoring fleet posture. The mesh is the only network path the Brain ever uses to reach a customer; there is no public-internet path from Brain to customer.
What is the Brain? A separate, internally-hosted service (different repo, different deploy) that holds the customer registry, per-customer API keys, and per-customer event/agent telemetry. The deployer-side dashboard talks to the Brain over HTTP+bearer (the brain_admin_token), and customer instances talk to the Brain over their per-customer bearer (brain-key). This page documents the integration points only — the Brain's internals (storage, indexing, alerting) live in the Brain repo's own docs.
The Customer UID — one identifier ties everything together
Every Agent customer has a single, stable, public identifier — the Customer UID — derived from the deployer-side customer name on first deploy and never reused. The UID drives every other per-customer artifact in the platform; getting it right matters because it is the only string we use to scope cross-system access.
| Artifact | Shape | Example (for customer name acme) | Where it's enforced |
|---|---|---|---|
| Customer UID (public id) | cust_<name> with -→_ | cust_acme | Brain customer table |
| Customer bearer token (private) | Random string, returned by Brain on mint | br_xxx… | Brain authenticates every event with it |
| Tailscale tag | tag:m8trx-cust-<uid> | tag:m8trx-cust-acme | Tailscale ACL tagOwners + intra-customer accept rule |
| SSM namespace | /m8trx/<name>/* | /m8trx/acme/* | Per-customer IAM brain_telemetry_ssm_read scope |
| Cloudflare subdomain | {label}.{platform-domain} | acme.m8trx.ai or acme-finance.m8trx.ai (per agent when multiple) | Cloudflare DNS A record + Caddy hostname block |
| Terraform tags | Customer = <name>-agent | Customer = acme-agent | tfvars + every AWS resource tag |
A customer can run multiple agents (e.g. an inbox agent and a finance agent). All of that customer's agents share the same UID, the same Tailscale tag, the same brain customer entry, and the same SSM namespace — but get distinct subdomains and distinct EC2 instances. The shared tag is what gives them a private network to talk to each other on (the per-customer "VLAN" — see Network tab).
Status: the infra layer (Tailscale tagging, brain mint idempotency, SSM scoping, Cloudflare provisioning) is already ready for multiple agents under one UID — the auth key is reusable, the tag's accept rule has no device-count limit, and the SSM IAM scope wildcards over the namespace. The piece that is not yet wired is the deploy form: today it treats each new deploy as a new customer (1 deployer-name = 1 brain mint = 1 EC2). The next step is a "Tenant" dropdown that lets the operator add a new agent under an existing customer UID — re-using the brain customer, the Tailscale tag, and the SSM namespace, while creating a fresh subdomain and EC2.
SSM Parameter Layout (per customer)
All parameters live under /m8trx/<deployer-customer-name>/* (the deployer name, not the brain id — so user-data needs only the one name terraform already templates in). Five are SecureString (KMS-encrypted via alias/aws/ssm); only brain-customer-id and brain-url are stored in the clear.
| Parameter | Type | Source | Used by |
|---|---|---|---|
brain-key | SecureString | Brain mint response (pre-deploy) | Paperclip → Brain bearer |
brain-customer-id | String | Derived: cust_ + name with -→_ | Paperclip event payloads |
brain-url | String | Settings (one value, fleet-wide) | Paperclip → Brain endpoint |
tailscale-auth-key | SecureString | Tailscale mint response (pre-deploy) | tailscaled at first boot |
agent-fetch-pat | SecureString | Settings (one value, fleet-wide; read-only on M8trxAgent) | git clone at first boot — stripped from .git/config immediately |
agent-ui-password | SecureString | Settings default; rotatable per-customer | Caddy basic-auth (bcrypt hashed at first boot via caddy hash-password) |
Pre-Deploy Lifecycle (provision)
Order matters. Each step streams progress to the dashboard's job log via on_output. If any step fails, the deploy aborts before terraform apply touches AWS — so a half-provisioned customer never enters the brain. Source: dashboard/services/brain_telemetry.py.
Mint a Brain customer key
POST {brain_url}/admin/customers with bearer = brain_admin_token, body { customer_id: cust_acme, name: acme }. Returns { api_key, key_id }. A 409 Conflict means the brain already has this id — abort, since reusing it would let the new instance read the previous instance's events.
Upsert Tailscale ACL entries (must precede key mint)
Tailscale rejects an auth key whose tag isn't already declared in tagOwners. GET the ACL with the If-Match ETag, add tag:m8trx-cust-<id> → autogroup:admin to tagOwners if missing, and add an intra-customer accept rule (src=tag, dst=tag:*) if missing. Single POST back. Idempotent — no write if no change.
Mint a tag-bound, ephemeral, reusable, 90-day Tailscale auth key
POST /api/v2/tailnet/{tailnet}/keys with capabilities devices.create.tags=[tag:m8trx-cust-<id>], ephemeral=true, reusable=true, preauthorized=false (Tailnet Lock approval is a separate post-apply step), expirySeconds=90×86400. Returns the bearer auth key.
Write 6 SSM parameters
All under /m8trx/<name>/. Done last — if Tailscale failed in step 2 or 3, no SSM is touched, so a re-run starts clean. agent_fetch_pat is the same value across all customers but copied per-customer so it fits the existing per-customer IAM read scope; no fleet-level IAM change needed.
Post-Apply: Tailnet Lock approval
After terraform apply succeeds, the dashboard polls Tailscale for a device with hostname == <instance-id>. When it appears (cloud-init has finished step 4 of bootstrap), the dashboard calls POST /api/v2/device/{device_id}/key with {"keyExpiryDisabled": false} to approve it under Tailnet Lock. This step is best-effort: if cloud-init is slow, the operator can still approve manually from the Tailscale admin console; the deploy job logs a warning rather than failing.
Destroy Lifecycle
| Step | Action | Notes |
|---|---|---|
| 1 | Pre-destroy: confirm termination protection is OFF on the EC2 | Refuses to start if on — operator must explicitly disable, to avoid leaving SSM/IAM/SG cleanup half-done while the instance survives |
| 2 | Run terraform destroy | Deletes EC2, SG, IAM, KMS (30-day window), VPC, S3, etc. Cloudflare A record removed by dashboard separately |
| 3 | Delete the 6 SSM parameters under /m8trx/<name>/* | Idempotent — missing params are skipped |
| 4 | DELETE {brain_url}/admin/customers/<brain_customer_id> | Cascades on the brain side: events, agents, api_keys, customer row |
| 5 | Tailscale auth key auto-revokes | Ephemeral + 90-day expiry — no explicit revoke API call needed in the normal path |
Settings UI (one-time, fleet-wide)
All five fleet-wide brain-telemetry values are entered once via the dashboard's topbar settings cog (📡 modal). GET endpoints return masked values (last 4 chars + bullets); POST writes them to state.json.
| Setting | What it is | Used during |
|---|---|---|
tailscale_api_token | OAuth client / API token with ACL + key + device-approval scopes | provision + destroy |
tailscale_tailnet | Tailnet name (e.g. M8trxInfra.github) | provision + destroy |
brain_url | Base URL of the brain server (Tailscale-internal preferred) | provision + destroy + customer-instance runtime |
brain_admin_token | Brain admin bearer — only used by the dashboard, never written to a customer instance | provision + destroy |
agent_fetch_pat | GitHub PAT with read-only scope on M8trxInfra/M8trxAgent | provision (copied to per-customer SSM) |
agent_ui_default_password | Initial Caddy basic-auth password seeded for new customers (rotatable per-customer afterwards) | provision (copied to per-customer SSM) |
IAM Scope (instance-side)
Each customer's EC2 instance role has a scoped brain_telemetry_ssm_read inline policy. The role can read only its own customer's path — a leaked instance role cannot pivot to another customer's brain key, Tailscale auth key, or PAT.
resource "aws_iam_role_policy" "brain_telemetry_ssm_read" {
policy = jsonencode({
Statement = [{
Sid = "ReadOwnBrainTelemetryParams"
Effect = "Allow"
Action = ["ssm:GetParameter", "ssm:GetParameters"]
Resource = "arn:aws:ssm:*:*:parameter/m8trx/${local.ssm_customer_name}/*"
}]
})
}
The permission boundary on each instance role also explicitly permits ssm:GetParameter + ssm:GetParameters — without that, the policy above would still be denied at the boundary. local.ssm_customer_name strips any -agent resource-naming suffix back to the deployer name, so SSM paths and tfvars keys stay aligned.
Brain-Side Monitoring (what the team sees)
Once an Agent customer is online, Paperclip POSTs events (login, task created, task settled, finance/leads activity, errors) to the brain over the customer's bearer. The brain rolls these into per-customer dashboards (covered in the brain repo's own docs). Things the team monitors at the platform layer:
- Customer presence on the tailnet — every Agent customer should appear as a single device tagged
tag:m8trx-cust-<id>. A missing or duplicate device is a deploy/health red flag. - Last-event recency per customer — gap > 24h on a paid customer is worth investigating (instance down, brain unreachable, paperclip crash loop).
- Auth-key expiry pressure — keys are 90-day ephemeral; the team rotates ahead of expiry by re-deploying or rotating manually. Operator dashboard has a planned "key TTL" column on the customer list.
- Cross-customer ACL drift — if anyone manually edits the Tailscale ACL outside of
tailscale_client.ensure_customer_acl_entriesand adds a cross-tag accept rule, that breaks the per-customer trust boundary. Periodic ACL diff against the expected shape is the planned defense.
Failure Modes & Operator Actions
| What goes wrong | Symptom | Fix |
|---|---|---|
| Brain mint returns 409 | Deploy aborts with "Brain already has a customer with id…" | Pick a different deployer name, or revoke the orphaned brain customer first via the brain admin UI |
| Tailscale ACL upsert 412 (precondition failed) | Race against another concurrent dashboard run | Retry — uses ETag-based optimistic locking, second attempt usually wins |
| Tailnet Lock approval times out post-apply | Warning in deploy log, customer instance can't reach brain yet | Approve manually in Tailscale admin console; cloud-init may still be running |
| SSM put fails after Tailscale key minted | Half-provisioned state — auth key exists, SSM is empty | Re-run deploy: brain mint will 409 (so abort and fall through to manual re-provision), or revoke the unused auth key from Tailscale and start over |
Operator rotates brain_admin_token | All future provisions and destroys use the new token | No customer-instance impact — instances hold their own per-customer brain-key, not the admin token |
Data Protection
All data encrypted at rest with per-customer KMS keys and in transit with TLS.
AI Agent Sandboxing
AI agents handle sensitive customer data — email, financial documents, chats, business records. They run inside a Docker compose project with multiple independent restriction layers, so a single failure (a vulnerable Python dep, a leaked secret, a user-data exploit) does not give code-execution on the host or read access to another customer. The mechanisms below apply to both templates; the M8trx Agent template adds Caddy + Postgres + Ollama as additional compose services bound to the same protections.
Per-Service Notes
| Service | Internal Port | Exposed? | Talks to |
|---|---|---|---|
caddy | :80, :443 | Yes (host SG) | Internet (LE ACME challenge), m8trx-bridge:3200, paperclip:3100 |
paperclip | :3100 | No (only via Caddy) | postgres, Anthropic API (per-customer key), Brain (per-customer bearer) |
m8trx-bridge | :3200 | No (only via Caddy) | paperclip, Brain |
postgres | :5432 | No (compose-internal) | Only the paperclip service in the same project |
ollama | :11434 | No (compose-internal) | Local LLM inference fallback when Claude is unreachable |
Management Access
Two distinct access paths for the operator team: (1) reaching the operations dashboard over Tailscale, and (2) reaching individual customer instances for shell or telemetry. Neither path uses a dedicated management proxy any more — the mgmt-proxy Terraform module was retired on 2026-04-21 because Tailscale-on-the-dashboard-host plus AWS SSM Session Manager cover both paths at zero extra cost.
Option Comparison
| Option | Reaches | Cost | Admin Needs | Add / Remove Admin | Best For |
|---|---|---|---|---|---|
| A. Tailscale | Dashboard | $0 (free 100 devices) | Tailscale app + invite to tailnet | Approve / delete device — instant | Daily-driver for the whole team |
| B. SSM Port-Forward | Dashboard | $0 | AWS CLI + SSM plugin + IAM user/role with ssm:StartSession + MFA |
Disable IAM user / role | Break-glass when Tailscale is unavailable |
| C. SSM Session Manager | Customer instance shell | $0 | Triggered from dashboard "Shell Access" panel, or AWS CLI | Same — IAM-controlled | Forensics, debugging, ad-hoc commands |
Setup — Tailscale on the Dashboard Host
Install Tailscale on the dashboard EC2 + your laptop
# On the dashboard host (one-time):
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up --hostname=m8trx-deployer-dashboard --advertise-tags=tag:mgmt
# follow the printed URL to authorise
# Then on your laptop, install Tailscale and sign in to the same tailnet.
Bind the dashboard to its Tailscale IP
export DASHBOARD_HOST=$(tailscale ip -4 | head -1) # e.g. 100.98.195.91
export DASHBOARD_PORT=8443
export DASHBOARD_USER=admin
export DASHBOARD_PASS='<strong-password>' # required, no default
cd dashboard && python main.py
A cron watchdog re-runs this command if the process dies — see scripts/cron/watchdog.sh. PID lock at dashboard/dashboard.pid prevents duplicates.
Add more admins
Each admin installs Tailscale and joins the tailnet. You approve their device in the Tailscale admin console; they can immediately reach the dashboard at https://<tailscale-ip>:8443 and sign in with the dashboard's HTTP Basic credentials. Revocation: delete their device — instant.
Setup — SSM Port-Forward (break-glass for the dashboard)
# Install SSM plugin: https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html
aws ssm start-session \
--target i-DASHBOARD_INSTANCE_ID \
--document-name AWS-StartPortForwardingSession \
--parameters '{"portNumber":["8443"],"localPortNumber":["8443"]}'
# Open https://localhost:8443 in your browser.
Setup — SSM Session Manager (customer-instance shell)
Triggered from the dashboard's per-customer detail view. The dashboard runs ssm:SendCommand with an inline AWS-RunShellScript document, polling every 3s for output (max 300s). Sessions land as the ubuntu user (not root, not ssm-user) — so paths and ownership match what the customer's compose stack expects. Commands are validated against a blocklist (rm -rf /, mkfs, shutdown, pipes-to-shell, etc.) and capped at 2000 chars. Endpoint: POST /api/customers/{name}/exec.
# Or from the AWS CLI directly:
aws ssm start-session --target i-CUSTOMER_INSTANCE_ID
# lands as ubuntu, full audit in CloudTrail
Security Properties
| Property | Detail |
|---|---|
| Dashboard exposure | Binds to its Tailscale IP only — not accessible on the public IP, not on AWS-internal IPs from outside the tailnet |
| Network auth | Tailscale: device approval + Tailnet Lock | SSM: IAM + MFA, audited in CloudTrail |
| Application auth | HTTP Basic Auth middleware on every route except /api/health |
| Security headers | X-Content-Type-Options, X-Frame-Options, Referrer-Policy, Cache-Control: no-store on all responses |
| Password policy | DASHBOARD_PASS env var required — no default. Dashboard refuses to start without it. |
| PID lock | dashboard.pid prevents duplicate instances from running simultaneously |
| Audit trail | Tailscale admin console (device + sessions) | CloudTrail (every SSM session) | dashboard job log (every dashboard action) |
| Customer-instance access | SSM Session Manager only — no SSH, no Tailscale-SSH (--ssh=false). Sessions land as ubuntu. |
| Encryption | WireGuard (Tailscale) | TLS (SSM) | TLS (Caddy + LE on customer instances) |
Monitoring & Detection
Multi-layered monitoring with automated alerting across all customer environments.
Incident Response
One-command quarantine triggered from the dashboard UI or CLI.
Security Controls Matrix
Every threat has at least 3 independent control layers. If one fails, the others still protect.
| Threat | Layer 1 | Layer 2 | Layer 3 |
|---|---|---|---|
| SSRF credential theft | IMDSv2 required | Hop limit = 1 | iptables block 169.254.x.x |
| Lateral movement | Separate VPCs | Per-customer IAM | Per-customer KMS |
| Agent breakout | Docker + seccomp | AppArmor profile | Non-root + read-only FS |
| Data exfiltration | Network rate limit | VPC flow logs | S3 TLS + KMS enforced |
| Credential compromise | No static keys | Permission boundaries | Unauthorized API alerts |
| Crypto mining | GuardDuty | CPU alarm > 90% | Container CPU limits |
| Security tampering | SCPs deny disable | IAM deny in boundary | CloudTrail validation + GuardDuty |
| Unauthorized dashboard | Tailscale-only binding | Auth middleware (all routes) | Security headers + PID lock |
| Shell injection (dashboard) | shell=False (subprocess) | Command blocklist on /exec | Input regex validation |
| Credential leakage | No default password | Passwords never logged | /api/health strips AWS info |
| Audit log tampering | auditd immutable (-e 2) | Shipped to CloudWatch | CloudTrail validation |
| Cross-customer telemetry leak | Per-customer Brain key | Tailscale tag-only ACL | SSM IAM scoped to /m8trx/<own>/* |
| Stale agent UI password | Per-customer Caddy basic-auth | Rotatable via dashboard (SSM SendCommand) | Plaintext base64 in transit only |
Per-User Access Control
Each customer instance supports multiple users with role-based access, per-user data isolation, granular permissions, and full audit logging. This is the application layer running inside the Docker sandbox — distinct from the deployer-side dashboard auth (HTTP Basic Auth) and from the per-customer Caddy basic-auth that gates the agent UI. The model below applies to the baseline (custom-app) template; the M8trx Agent template uses Paperclip's own user/permission model — see Paperclip docs.
Roles & Permissions Matrix
| Capability | admin | user | readonly |
|---|---|---|---|
| View own files | Yes | Yes | Yes |
| Upload / write files | Yes | Yes | No |
| Delete own files | Yes | Yes | No |
| View other users' files | Yes | No | No |
| Create / manage users | Yes | No | No |
| Reset user passwords | Yes | No | No |
| View full audit log | Yes | Own events only | Own events only |
| Use agent features | Yes | Per permission | No |
Granular Permissions
Access email processing features and the emails/ data directory. Required for agents that handle customer email.
Access financial document processing and the financial/ directory. Enables stricter audit logging on these operations.
Access chat/messaging features and the chats/ data directory.
Access general document storage and the documents/ data directory.
Access web portal features for the customer's public-facing services.
API Endpoints
| Endpoint | Method | Who | What it does |
|---|---|---|---|
/auth/login | POST | Anyone | Authenticate, receive JWT token |
/auth/me | GET | Authenticated | View own profile and permissions |
/auth/change-password | POST | Authenticated | Change own password (min 12 chars) |
/users | GET/POST | Admin | List all users / create new user |
/users/{id} | GET/PATCH/DELETE | Admin | View / update / disable user |
/users/{id}/reset-password | POST | Admin | Reset user password (returns temp pw) |
/data/{category} | GET/POST | User+ | List own files / upload file |
/data/{category}/{file} | GET/DELETE | User+ | Download / delete own file |
/data/admin/{user}/{cat} | GET | Admin | List any user's files |
/data/admin/{user}/{cat}/{file} | GET | Admin | Read any user's file |
/audit | GET | Authenticated | View audit log (scoped to role) |
/health | GET | Anyone | Container health check |
Security Controls
| Control | Implementation |
|---|---|
| Password hashing | PBKDF2-SHA256 with 100,000 iterations + random salt |
| Token signing | HMAC-SHA256, secret from AWS Secrets Manager |
| Token expiry | 30 minutes — forces re-authentication |
| Path traversal | Filename stripped to basename, resolved path checked against data root |
| File size | 50 MB maximum per upload |
| User deletion | Soft-delete only — account disabled, data preserved for audit |
| Default admin | Created on first boot with temporary password printed to logs |
| Audit trail | Append-only JSONL, shipped to CloudWatch, tamper-evident |
Repository Structure
Per-Customer Cost Breakdown (one customer, t3.medium, audited 2026-05-05)
Defaults from terraform/environments/prod/*.tfvars.json: instance_type=t3.medium, volume_size_gb=30. Retention from terraform/modules/vpc/main.tf (flow logs: 30d) and terraform/modules/ec2/user_data.sh (syslog 90d, auth 365d, audit 365d, agent 90d).
| Item | Baseline | M8trx Agent | Notes |
|---|---|---|---|
| EC2 (t3.medium, on-demand, us-east-2) | ~$30.00 | ~$30.00 | Same instance class for both templates |
| EBS root (30 GB gp3) | ~$2.40 | ~$2.40 | $0.08/GB/mo gp3 baseline; encrypted with per-customer KMS CMK |
| EBS snapshots (DLM, daily 03:00, 14-day retain) | ~$2.00 | ~$2.00 | $0.05/GB-month; only the delta of each snapshot is billed after the first |
| KMS CMK (1 per customer, auto-rotate) | ~$1.00 | ~$1.00 | $1/key/month + tiny API costs |
| CloudWatch Logs ingestion + storage | ~$2.00 | ~$0.50 | Baseline ships VPC flow (30d) + syslog (90d) + auth (365d) + audit (365d) + agent (90d) via the CW agent. Agent template ships VPC flow only — Path B's user_data does not install the CW agent (deliberate, keeps user_data under the 16 KB limit). Dashboard fills the gap for memory/disk via SSM RunCommand. |
| CloudWatch alarms (per-customer high-CPU) | ~$0.10 | ~$0.10 | $0.10 per standard-resolution alarm |
| GuardDuty marginal (per-customer flow log + CT events) | ~$0.50 | ~$0.50 | Detector is account-wide but ingestion volume is per-customer |
| Data transfer (egress, IGW use) | ~$0.30 | ~$0.50 | Outbound EC2 → internet at $0.09/GB. Brain telemetry rides Tailscale and is billed once. Agent template has slightly more egress (GHCR pulls, LE renewals, Anthropic). |
| Total per agent instance | ~$38 / mo | ~$37 / mo | A multi-agent customer pays per agent (2 agents ≈ $74–$76/mo). Heavier traffic = more flow logs + more egress; expect +$2–$5 for a busy customer. |
Shared platform cost: ~$3–6 / mo. SNS topic (free tier covers low-volume), Terraform state S3 bucket + DynamoDB lock (~$1/mo combined), CloudTrail (first trail free), AWS Config recorder ($0.003 per recorded item — adds up at scale, but trivial for the small fleet). The mgmt-proxy ($7/mo) was removed on 2026-04-21; Tailscale + SSM cover the same access surface at $0. Tailscale is on the free 100-device tier; the Brain server runs on existing infrastructure.
Cost Audit Findings (2026-05-05)
- Flow-log retention drift. 3 of 5 customer flow-log groups in production have
retentionInDays = None(never expire) —m8trx-cleantest4-agent,m8trx-dustin-bot,m8trx-smoke-test-2. The VPC module setsretention_in_days = 30, so these are likely log groups created by VPC Flow Logs auto-provisioning before the Terraform-managed group existed. Action: set retention manually withaws logs put-retention-policy --log-group-name <name> --retention-in-days 30for the three drifted groups, then audit the VPC module to ensure the log group is always created by Terraform (not auto-provisioned by AWS) so retention is enforced from day one. - CW agent custom metrics are empty in practice. Baseline template's user_data installs
amazon-cloudwatch-agentand configures cpu/mem/disk/net publishing under namespacem8trx/<customer>, butaws cloudwatch list-metricsreturns empty for bothm8trx/keithenterprisesandm8trx/agent2. The user_data has|| echo "Warning"fallbacks so failures don't abort boot — they just hide. Action: SSM into the baseline instance and runamazon-cloudwatch-agent-ctl -a statusto see whether the agent is running and authorized to push metrics. If the install is failing silently, decide: fix it, or remove the metrics block from user_data and lean on the dashboard's SSM-fallback for memory/disk like the Agent template already does. - Path B's missing CW agent is intentional. Not a finding — surfaced here for clarity. The Agent template's user_data is deliberately minimal (the 16 KB AWS user-data hard limit drove a slim-down in
94a1d04). The dashboard already handles this — Phase 3 → Monitoring notes that memory/disk are fetched on demand via SSM RunCommand when CW agent data is unavailable.
Deployment Guide
Complete step-by-step instructions for initial platform setup and deploying new customer instances. Follow these in order.
Phase 1 — One-Time Platform Setup
These steps are done once to set up the platform infrastructure. After this, deploying a new customer is a single dashboard click that takes ~5 minutes.
Build the hardened base AMI (Packer)
The hardened Ubuntu 22.04 AMI is built by Packer and used by both customer templates — the M8trx Agent template no longer bakes its own AMI (Path B installs the agent stack at first boot via git+compose).
cd packer
./build-ami.sh # runs `packer build` and prints the new AMI ID
What the build does: kernel parameter lockdown (ASLR, ptrace, SYN flood protection), auditd with immutable rules, fail2ban, UFW firewall, AppArmor profiles, SSH locked down (no password auth, no root login), unattended security updates, AIDE file integrity, core dumps disabled, unnecessary packages removed, cron restricted to root. Note the AMI ID — you'll set it in tfvars in step 3.
Terraform state backend (already configured)
The repository ships with the S3 + DynamoDB backend wired up. State lives in m8trx-terraform-state (versioned, KMS-encrypted, public-access blocked) with locking in DynamoDB table terraform-locks in us-east-2. No action needed unless you're running this in a fresh AWS account, in which case create the bucket + table once and update terraform/main.tf.
Configure Terraform tfvars
Edit terraform/environments/prod/base.tfvars.json with the platform-wide values. Per-customer maps (customers, m8trx_agent_customers) live in their own tfvars files and are managed by the dashboard — no manual editing on customer add/remove.
{
"aws_region": "us-east-2",
"project_name": "m8trx",
"admin_email": "ops@m8trx.ai",
"base_ami_id": "ami-XXXXXXXXX", // from step 1, used by both templates
"m8trx_agent_ami_id": "ami-XXXXXXXXX", // same as base_ami_id (Path B)
"platform_domain": "m8trx.ai"
}
Tailscale on the dashboard host (mgmt access)
The dashboard binds to a Tailscale IP — there is no public ingress on the operations dashboard. Install Tailscale on the dashboard EC2, join the same tailnet your operators use, and bind the FastAPI server to the Tailscale IP. Detail under the Mgmt Access tab.
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up --hostname=m8trx-deployer-dashboard --advertise-tags=tag:mgmt
# follow the printed URL to authorize
There is no separate management proxy or management VPC any more — that module was retired on 2026-04-21. Tailscale-on-the-dashboard plus AWS SSM cover the same ground at lower cost.
Start the dashboard (with required env)
cd dashboard
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
export DASHBOARD_HOST=$(tailscale ip -4 | head -1)
export DASHBOARD_PORT=8443
export DASHBOARD_USER=admin
export DASHBOARD_PASS='<strong-password>' # REQUIRED — no default
export AWS_REGION=us-east-2
export PROJECT_NAME=m8trx
python3 main.py
A cron watchdog auto-restarts the process every minute if it dies (see scripts/cron/watchdog.sh). PID lock at dashboard/dashboard.pid prevents duplicates. After every code change, kill and restart: pkill -f "python main.py"; sleep 2; rm -f dashboard/dashboard.pid.
Configure Cloudflare (subdomain provisioning)
Open the 📡 settings cog in the dashboard topbar. Enter the Cloudflare API token (zone:DNS:edit + zone:zone:read), zone ID, and platform domain (m8trx.ai). The dashboard validates the token on save. Subdomains are created with proxied=false so origin TLS via Caddy + Let's Encrypt works end-to-end.
Configure Brain Telemetry (M8trx Agent customers only)
In the same 📡 settings modal, fill the Brain & Tailscale section — these five fields are required only for deploying M8trx Agent customers; baseline (custom-app) deploys ignore them.
| Field | What it is |
|---|---|
tailscale_api_token | Tailscale OAuth client / API token with ACL + key + device-approval scopes |
tailscale_tailnet | Tailnet name (e.g. M8trxInfra.github) |
brain_url | Internal URL of the brain server (Tailscale-internal address preferred) |
brain_admin_token | Brain admin bearer — used by the dashboard, never written to a customer instance |
agent_fetch_pat | GitHub PAT (read-only on M8trxInfra/M8trxAgent) — used at customer first-boot to clone the agent repo |
agent_ui_default_password | Initial Caddy basic-auth password seeded for new customers — operators rotate per-customer afterwards |
All fields are written to state.json. GET endpoints return masked values; full plaintext is only available to whoever holds DASHBOARD_PASS.
Apply Organization SCPs
Apply the guardrail policies at the AWS Organization level — these prevent anyone (including a leaked admin role) from disabling security services, launching unencrypted instances, or making S3 buckets public.
# In AWS Organizations console:
# Policies > Service control policies > Create policy
# Paste contents of policies/scp-guardrails.json
# Attach to the OU containing your account
Test SCPs in a sandbox account first. An overly broad SCP can lock you out of your own account.
Phase 2 — Deploying a New Customer (Repeatable)
After Phase 1, deploying a new customer is a single dashboard click that takes ~5 minutes. The form picks the template; the dashboard handles brain-telemetry provisioning, terraform apply, Cloudflare DNS, and Tailnet Lock approval as one orchestrated job.
Fill the Deploy Form
On the dashboard, click "Deploy New". The fields shown depend on the template you pick.
| Field | Example | Validation / Notes |
|---|---|---|
| Customer Name | acme | 3–32 chars, lowercase, alphanumeric + hyphens. Used in all resource names. Must be unique across both templates. |
| Template | m8trx_agent | baseline | Picks which Terraform module + tfvars map is used. Determines whether brain-telemetry provisioning runs. |
| Instance Type | t3.medium (default) | Whitelisted dropdown. |
| Volume Size | 30 GB | 20–500 GB. Encrypted EBS with the customer's per-customer KMS key. |
| Subdomain (Agent only) | acme | Optional label. Defaults to the customer name. Final hostname is {label}.{platform-domain}. |
| Agent UI password (Agent only) | auto-generated | Caddy basic-auth on the customer subdomain. Shown in cleartext in the dashboard with a copy button. |
Click "Deploy Secure Instance". The form validates all inputs before submitting.
Pre-Deploy: Brain Telemetry Provisioning (Agent template only)
For the M8trx Agent template, the dashboard runs brain_telemetry.provision_for_customer() before Terraform sees anything — so a half-provisioned customer never enters the brain or AWS state. Steps stream into the deploy job log:
- Brain mint —
POST /admin/customers; aborts on 409 (duplicate brain id). - Tailscale ACL upsert — adds
tag:m8trx-cust-<id>tagOwner + intra-customer accept rule (must precede auth-key mint). - Tailscale auth key mint — tag-bound, ephemeral, reusable, 90-day.
- SSM put × 6 —
/m8trx/<name>/{brain-key, brain-customer-id, brain-url, tailscale-auth-key, agent-fetch-pat, agent-ui-password}as SecureStrings.
Terraform Apply (creates AWS resources)
The dashboard runs terraform apply -var-file=<template>.tfvars in a background job (terraform_runner, shell=False). Progress streams to the Jobs tab.
| Resource | Detail |
|---|---|
| VPC | Isolated VPC, single public subnet, IGW. CIDR octet allocated stably from the customer name (re-deploys don't shift). Flow logs to CloudWatch (30-day). |
| Security Group | Baseline: :8080 only. M8trx Agent: :80 + :443. Outbound: all (SSM, GHCR, Brain, Anthropic, Tailscale). |
| KMS Key | Customer-dedicated CMK with auto-rotation, key policy scoped to that customer's IAM role. |
| S3 Bucket | Encrypted, versioned, public-access-blocked, TLS-enforced, wrong-key-rejected. |
| IAM Role | Instance role with permission boundary. M8trx Agent additionally gets the brain_telemetry_ssm_read inline policy scoped to /m8trx/<own-name>/*. |
| EC2 Instance | Hardened Ubuntu, IMDSv2 required (hop=1), encrypted EBS, EIP. M8trx Agent template runs the 9-step Path B user-data bootstrap. |
| DLM Snapshots | Daily EBS snapshots at 03:00 UTC, 14-day retention. |
Post-Apply: DNS + Tailnet Lock (Agent template only)
After Terraform apply succeeds, the dashboard runs two best-effort follow-ups:
- Cloudflare DNS upsert — A record
{label}.{platform-domain}→ instance EIP,proxied=false, commentm8trx-managed:customer. - Tailnet Lock approval — polls Tailscale for a device with
hostname == <instance-id>; once it appears (cloud-init step 4 ran), POSTs{"keyExpiryDisabled": false}to approve. If cloud-init is slow, the operator can approve manually from the Tailscale admin console.
End-user traffic flow (Agent template): {name}.m8trx.ai → Cloudflare DNS → instance EIP :443 → Caddy (LE cert + basic-auth) → m8trx-bridge:3200 / paperclip:3100.
First-Boot Bootstrap (automatic, Agent template — full detail in Compute tab)
- Set hostname to instance-id (so Tailnet Lock can find the device).
- Install
docker.io,docker-compose-plugin,tailscale. - Read 6 SSM SecureStrings →
/etc/m8trx/brain.env(mode 0600). tailscale up --auth-key=… --hostname=<instance-id> --ssh=false.git clone --depth 1the agent repo (PAT stripped from.git/configimmediately).- Generate
compose/.envwith random Postgres creds + bcrypt-hashed agent UI password. - Patch the Caddyfile for the customer hostname + LE; add
:443to the compose ports. docker compose up -d --pull always. First boot ends when Caddy obtains its certificate.
Logs tee'd to /var/log/m8trx-agent/first-boot.log for SSM-side debugging via the dashboard's Shell Access panel.
Hand the Customer Their Welcome Package
The dashboard generates a per-customer welcome card on the customer detail page: URL (the subdomain), username (admin), and password (the agent UI password — shown cleartext, copy button). Operator copies into a 1Password share / secure email and sends.
If the customer needs the password rotated later, the operator clicks "Change Agent UI Password" on the customer panel — see Phase 3.
Verify Security Posture
On the customer's row, expand the Security Posture panel (API-level checks) and click "Run Check" on the Deep Security Check panel (live SSM hardening checks, ~15–30s). Both should be green before declaring the customer ready.
| Check | What it verifies |
|---|---|
| IMDSv2 required (hop=1) | Instance metadata requires token + 1-hop limit (no container forwarding) |
| EBS encryption | All volumes encrypted with the customer's KMS key |
| SG rules | Only :8080 (baseline) or :80+:443 (Agent) ingress |
| auditd / fail2ban / UFW / AppArmor | Each running with immutable / active config |
| Unattended-upgrades | Automatic security patching active |
| Agent stack health (Agent template) | Compose project up; Caddy has a valid LE cert; paperclip /health green |
| Brain reachability (Agent template) | Tailscale device present + Tailnet-Lock approved + last paperclip→Brain event < 5 min ago |
Phase 3 — Ongoing Operations
Check the Monitoring tab for CloudWatch alarms, GuardDuty findings, and CloudTrail events. Resource monitors (CPU, memory, disk, network) display in each customer's detail view — memory and disk use SSM fallback when CloudWatch agent data is unavailable. Instance logs viewable in the Logs tab per customer.
If an instance is compromised: click "Quarantine" in the dashboard. This snapshots all volumes, replaces security groups with deny-all, and tags the instance. SSM still works for forensics. Never terminate - preserve for investigation.
Each customer detail page shows a Shell Access panel. Copy the SSM command to open an interactive terminal, or type commands directly in the dashboard and see output in real time. Commands are validated against a blocklist (no rm -rf /, pipe-to-shell, etc.) and capped at 2000 chars. Endpoint: POST /api/customers/{name}/exec.
"Backup Now" creates on-demand EBS snapshots of all volumes. The Backups panel shows all snapshots (automated + on-demand) with state, progress, time, and type. DLM runs daily at 03:00 UTC with 14-day retention. Endpoints: POST /api/customers/{name}/snapshot, GET /api/customers/{name}/snapshots.
Click "Resize Disk" in the detail view to expand a customer's root EBS volume (20-500 GB). Works on both running and stopped instances. For running instances, the filesystem is expanded automatically via SSM (growpart + resize2fs/xfs_growfs). Stopped instances auto-expand on next boot via cloud-init. Endpoint: POST /api/customers/{name}/resize-disk.
Click the gear icon in the topbar to configure email alerts via AWS SES. Set a service email address and choose which events trigger alerts: incidents (quarantine), deploy failures, deploy successes, and CloudWatch alarms. Use "Send Test" to verify delivery. The sender address must be verified in SES. Settings stored in dashboard state. Endpoints: GET/PUT /api/platform/alert-settings, POST /api/platform/alert-test. Platform-level only — these alerts are for operators; customer-facing notifications use the channels flow below.
Each existing customer instance runs the Paperclip version baked into its AMI. To avoid re-baking + instance replacement for every Paperclip change, the deployer dashboard has an "Upgrade Paperclip Source" button on the per-customer Paperclip panel. It tars paperclip/ from the dashboard host (excluding venv, caches, local DB), base64-encodes it, ships it inline via SSM, rsyncs into /opt/paperclip-src, re-runs the idempotent bootstrap/install.sh, and restarts paperclip.service. No S3 round-trip, no persistent artifacts. Endpoint: POST /api/customers/{name}/paperclip/upgrade. Size-capped at ~90 KB base64 — if the tree outgrows that, switch to an S3-mediated push.
Each M8trx Agent customer has their own Gmail + Telegram notification channels for task events (approvals needed, task complete, task failed). Configured in two places:
- Deployer dashboard → Paperclip panel → "Notification channels (seed)": operator seeds Gmail SMTP creds (smtp.gmail.com:587 + Google App Password) and/or a Telegram bot token + chat id when onboarding. Written to
/etc/paperclip/config.yamlvia SSM alongside the Anthropic key and admin password. - Paperclip (customer dashboard) → gear icon → Settings modal: the customer can override any field. Overrides live in the instance's SQLite (
channel_settingstable) and win field-by-field over the deployer seed; blank fields inherit the seed. "Send test" button fires a live notification to verify.
Outbound notifications fire-and-forget from a daemon thread inside Paperclip — hooked at dispatcher._settle() (complete/failed) and POST /api/tasks when needs_approval=true. Failures are logged, never block task execution. Endpoints on Paperclip: GET /api/channels (redacted), PUT /api/channels/{email|telegram}, DELETE /api/channels/{provider} (revert to seed), POST /api/channels/test. Gmail requires 2FA + an App Password, not the account password.
Casual chat replies previously took 3–5 s (CLI subprocess spawn) and data-aware prompts like "summarize my inbox" took 30–150 s (Claude tool-use loop). Two changes land the fast-path:
- Intent detection + context pre-fetch (
chat_context.py): when a chat message mentions inbox / tasks / events / finance / leads, regex rules detect the intent(s), the relevant SQLite rows are fetched directly (≤50 ms), and injected as a plain-text context block prepended to the task prompt. Claude answers from this context in one turn — no tool-use loop needed. - SDK worker (
workers/claude_sdk.py): calls the Anthropic Messages API directly via the Python SDK, bypassing theclaudeCLI subprocess (~1–2 s overhead). API key required; no OAuth path (OAuth is CLI-audience only). First-token latency in the sub-second range for trivial chat on a t3.medium.
Three-tier dispatcher routing for chat-linked tasks (tasks with an assistant placeholder message): Tier 1 — SDK worker (fast, API key required). Tier 2 — CLI worker (OAuth supported, full tool use) on any SDK availability/auth failure. Tier 3 — Hermes/Llama (local, free) on CLI availability failure. Non-chat tasks (draft-reply approval, coding) go straight to the CLI worker unchanged. The meta field on each WorkerResult records which worker answered and any fallback chain (fallback_from).
System prompt is shared via workers/_claude_system_prompt.py — imported by both CLI and SDK workers so the context description stays identical across both paths.
Customer admins manage their own users via the agent API. Each user gets isolated data directories and scoped permissions. All actions audit-logged and shipped to CloudWatch.
Click "Delete Customer" in the detail view. The dashboard refuses if AWS termination protection is on — disable it first (operator-explicit, intentionally a friction point). Pre-destroy: empty S3 (including versioned objects), delete CloudWatch log groups, detach IAM permission boundaries. Run terraform destroy. Post-destroy (Agent template): delete the 6 SSM params under /m8trx/<name>/*, DELETE /admin/customers/<brain_customer_id> on the brain, remove the Cloudflare A record. KMS key keeps a 30-day deletion window.
Per-customer Caddy basic-auth password. Click "Change Agent UI Password" on the customer panel (8–256 chars, validated). Two-step rotation: (1) update /m8trx/<name>/agent-ui-password in SSM (so future re-deploys pick up the new password); (2) SendCommand to the running instance — re-hash with caddy hash-password, write into compose/.env, docker compose up -d --force-recreate caddy. Plaintext is base64-passed in the SSM command (encrypted in transit, not logged at rest). Endpoint: POST /api/customers/{name}/agent-password.
Customer auth keys are tag-bound, ephemeral, 90-day. They auto-revoke at expiry — but the device stays joined as long as tailscaled is running. To rotate ahead of expiry: re-deploy the customer (full bootstrap with a fresh key) or run a manual mint + push the new key into the customer's SSM, then restart tailscaled on the instance via the Shell Access panel. Planned: a "Rotate auth key" button on the customer panel.
A background decommission checker runs every 5 minutes, scanning AWS for orphaned resources (EC2 instances, VPCs, S3 buckets, KMS keys, IAM roles) from deleted customers no longer in state. Orphaned instances and VPCs are auto-cleaned to prevent cost leaks. On-demand check: GET /api/decommission-check.
Troubleshooting
| Issue | Solution |
|---|---|
| Deploy job shows "No valid credential sources" | The dashboard host needs AWS credentials. Attach an IAM role to the instance or set AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY env vars before starting the dashboard. |
| Deploy aborts: "Brain telemetry settings are not configured" | Open the 📡 settings cog and fill the brain-telemetry section (5 fields). Required only for M8trx Agent customers. |
| Deploy aborts: "Brain already has a customer with id 'cust_…' (409)" | The brain side has a leftover customer. Either pick a different deployer name, or revoke the existing brain customer via the brain admin UI before retrying. |
| Tailscale ACL upsert fails 412 | Race against another concurrent dashboard run. Retry — the upsert uses ETag-based optimistic locking and the second attempt usually wins. |
| Tailnet Lock approval times out post-apply | Cloud-init may still be running on the instance. Check /var/log/m8trx-agent/first-boot.log via SSM. Once the device appears in Tailscale, approve manually from the admin console — no need to re-run the deploy. |
| Caddy can't get a Let's Encrypt cert | Verify the Cloudflare A record exists (proxied=false), points at the EIP, and SG ingress :80 is open (LE HTTP-01 challenge). docker compose logs caddy via Shell Access for detail. |
Customer can't reach {name}.m8trx.ai | DNS not yet propagated — ~1 min typical for Cloudflare. Or the Caddy basic-auth password they were given is stale: rotate via Phase 3 card. |
| Compose stack not starting | Shell Access → cd /opt/m8trx-agent && docker compose ps. Check docker compose logs paperclip for missing env vars or DB init issues. |
| Deep security check shows "unknown" | SSM agent takes 1–2 minutes to register after instance launch. Also verify the instance's IAM role has SSM permissions. |
| Destroy refuses with "termination protection enabled" | Intentional — disable termination protection in the EC2 console (Actions → Instance Settings → Change termination protection) before retrying. Forces operator-explicit consent for destructive actions. |
API Token Usage Tracking
Per-customer Claude API token consumption analytics with cost estimation and configurable alert thresholds. Tracks input, output, cache read, and cache creation tokens across all Claude model tiers.
Architecture Overview
Data Model
Usage data is stored in a local SQLite database (usage.db) with the following schema.
| Column | Type | Description |
|---|---|---|
customer | TEXT | Customer identifier (e.g. acme-corp) |
day | TEXT (ISO date) | Date of usage |
model | TEXT | Claude model used (opus-4-6, sonnet-4-6, haiku-4-5) |
sessions | INTEGER | Number of API sessions |
input_tokens | INTEGER | Prompt input tokens consumed |
output_tokens | INTEGER | Output tokens generated |
cache_read | INTEGER | Cache read tokens (prompt caching) |
cache_creation | INTEGER | Cache creation/write tokens |
Indexed on (customer), (day), and unique on (customer, day, model) for efficient querying and upsert operations.
Pricing & Cost Calculation
Costs are estimated using per-model pricing rates (per 1M tokens). The calc_cost() function computes costs across all four token types.
| Model | Input | Output | Cache Write | Cache Read |
|---|---|---|---|---|
| Claude Opus 4.6 | $6.15 | $30.75 | $7.69 | $0.61 |
| Claude Sonnet 4.6 | $3.69 | $18.45 | $4.61 | $0.37 |
| Claude Haiku 4.5 | $1.23 | $6.15 | $1.54 | $0.12 |
API Endpoints
All endpoints require HTTP Basic Auth. Mounted under /api/usage/ via FastAPI router.
| Method | Endpoint | Description |
|---|---|---|
GET | /api/usage/data | Full usage dashboard data: daily breakdowns, customer totals, pricing, current costs, and alert thresholds |
POST | /api/usage/seed | Populate demo/synthetic usage data for POC customers |
POST | /api/usage/reset | Clear all usage data and re-seed with fresh demo data |
GET | /api/usage/thresholds | List all configured cost alert thresholds |
PUT | /api/usage/thresholds | Set or update a per-customer cost alert threshold (period + dollar limit) |
DELETE | /api/usage/thresholds/{customer} | Remove the cost alert threshold for a customer |
POST | /api/usage/check-alerts | Check all thresholds against current costs; sends SES email alerts for breaches |
Cost Alert Thresholds
Operators set per-customer cost limits with a period (week or month) and a dollar amount. Thresholds are stored in state.json alongside other dashboard state. When a customer's estimated cost for the current period meets or exceeds the threshold, an alert is triggered.
The /api/usage/check-alerts endpoint evaluates all enabled thresholds against current-period costs. Breaches trigger email alerts via AWS SES (using the alert_email service) containing the customer name, period, threshold, current cost, and overage amount. Returns a list of all breaches for dashboard display.
The get_customer_costs() function aggregates token usage for the current period: week (Monday through today) or month (1st through today). It sums costs across all models per customer using the calc_cost() pricing function, covering input, output, cache read, and cache write tokens.
Key Files
| File | Purpose |
|---|---|
dashboard/services/usage_scanner.py | Core module: database init, demo data seeding, usage queries, cost calculation |
dashboard/routers/usage.py | FastAPI router: REST endpoints for usage data, thresholds, and alert checks |
dashboard/usage.db | SQLite database storing per-customer token usage records |
dashboard/state.json | Persists alert thresholds alongside other dashboard settings |