m8trxDeployer Architecture

Multi-Tenant AWS Security Platform for AI Agent Deployment. Each customer gets a fully isolated, hardened environment with defense-in-depth security controls.

Production Design v1.4.0 Updated 2026-05-05 9 Terraform Modules 2 Customer Templates Brain Telemetry + Tailscale Mesh

System Overview

Multi-tenant AWS platform that deploys hardened, fully isolated customer environments running AI agents. Two customer templates ship today: a baseline environment for custom apps, and the packaged M8trx Agent (Paperclip + Hermes + Claude Code) reachable on a per-customer HTTPS subdomain. A shared Brain Telemetry control plane connects every Agent instance to a central Brain over a Tailscale mesh — for monitoring, key rotation, and operator visibility — without ever exposing a customer instance to another customer.

Executive Summary

m8trxDeployer deploys and manages fully isolated, hardened AWS environments for customers running AI-powered business applications. Today every customer gets their own VPC, KMS key, S3 bucket, IAM role, and EC2 instance — zero shared infrastructure between customer data planes. AI agents that process sensitive data (email, financials, business documents) run in Docker containers managed by the customer's own compose stack, with seccomp, AppArmor, non-root users, and strict resource and network limits.

Two customer templates ship today: baseline (operator-managed Docker compose, single port) and the packaged M8trx Agent (Paperclip orchestrator + Caddy TLS + Postgres + local Llama, reachable at {customer}.{platform-domain} over HTTPS via Let's Encrypt). The Agent template auto-provisions a Cloudflare DNS record, a per-customer Tailscale device on the Brain mesh, and a tag-bound ephemeral auth key on every deploy.

The platform is designed for minimal customer interaction — customers receive a URL and credentials, nothing else. All infrastructure, security, monitoring, and the per-customer Brain key rotation are handled by your team through a unified operations dashboard. Admins reach the dashboard over Tailscale (recommended) or AWS SSM port-forwarding — zero ports open to the public internet. The dashboard supports multi-admin use; any team member can deploy a new customer, rotate the agent UI password, run shell commands via SSM, view logs, take snapshots, or quarantine an instance without ever touching the AWS console.

Where Cloudflare fits. We use Cloudflare for one thing: DNS only (records created with proxied=false). Each agent gets a subdomain under our platform zone (m8trx.ai). The Cloudflare API token, zone ID, and platform domain live in dashboard settings. We do not use Cloudflare's HTTPS proxy, Tunnel, or Access — origin TLS is terminated on the customer instance by Caddy with a Let's Encrypt cert obtained at first boot.

Planned (not yet shipped): a multi-tenant compute mode that packs multiple low-traffic customers onto a single hardened EC2 host. Today every customer is on its own EC2 — but the agent stack is already a Docker compose project (Path B), the project name is namespaced per-customer, the storage is a per-customer KMS-encrypted volume, and each agent already runs its own tailscaled with a tag-bound identity. The remaining work is host-level orchestration of multiple compose projects, a credential broker for per-customer IAM, and a SNI-routing reverse proxy. Goal: drop per-customer cost from ~$37/mo to ~$8–12/mo for small customers without weakening the per-customer trust boundary. See the Compute tab for the full breakdown.

Isolation Model
1 Customer = 1 UID = N Agents
Each customer gets a stable UID (cust_<name>) that drives the brain key, Tailscale tag, SSM path, and subdomain. A customer can run one agent or many — all share the same tag, forming a per-customer Tailscale "VLAN". Cross-customer traffic is denied by ACL; cross-customer AWS access is denied by IAM and KMS.
Security Layers
9 Defense-in-Depth Controls
Network isolation, IAM boundaries, KMS encryption, OS hardening, container sandboxing, AppArmor, auditd, monitoring, SCPs.
Admin Experience
Multi-Admin, No AWS Expertise Needed
Any team member can deploy, monitor, and manage customers from a single dashboard. Connect via Tailscale or Cloudflare (free, zero ports exposed). Add/remove admins instantly.
User Model
Per-User Data Isolation
Multiple users per customer with RBAC, granular permissions, per-user data directories, and full audit trail.
Deployment
One-Click from Dashboard
Fill a form, click deploy. Pre-deploy: dashboard mints a Brain customer key, mints a Tailscale auth key, writes 6 SSM parameters. Terraform applies. Post-deploy: Cloudflare DNS upsert, Tailnet Lock device approval. Customer reachable at {name}.m8trx.ai in ~5 minutes.
Incident Response
One-Click Quarantine
Snapshot volumes, isolate network, tag instance — all automated. SSM still works for forensics.
Infrastructure as Code: Terraform (9 modules, S3+DynamoDB backend) Dashboard: Python FastAPI Agent Runtime: Docker compose (Caddy + Paperclip + Postgres + Ollama) + seccomp + AppArmor Origin TLS: Caddy + Let's Encrypt (per-customer) DNS: Cloudflare API (proxied=false; origin TLS) Mesh: Tailscale (tag-bound ephemeral keys) Monitoring: GuardDuty + CloudTrail + CloudWatch + AWS Config + Brain telemetry

PLATFORM READINESS — current state of one-time setup

Hardened Base AMI DONE
Packer-baked Ubuntu 22.04 with full hardening (kernel, auditd, fail2ban, UFW, AppArmor, AIDE). Used by both customer templates. AMI ID pinned in base.tfvars.json; M8trx Agent template uses the same AMI and bootstraps the agent stack at first boot (Path B — git+compose, no agent-specific AMI bake).
Terraform State Backend DONE
S3 bucket m8trx-terraform-state (versioned, KMS-encrypted, public access blocked) + DynamoDB lock table terraform-locks in us-east-2. Backend configured in terraform/main.tf.
Domain & Cloudflare DONE
Platform domain m8trx.ai is live on Cloudflare. API token + zone ID stored in dashboard settings (or env: CLOUDFLARE_API_TOKEN, CLOUDFLARE_ZONE_ID, PLATFORM_DOMAIN). Subdomains created on deploy with proxied=false — Caddy on the customer instance terminates origin TLS via Let's Encrypt.
Brain Telemetry & Tailscale DONE
Brain admin token, Tailscale API token + tailnet, agent fetch PAT, and default agent UI password configured in dashboard settings. Pre-deploy lifecycle mints a Brain customer key, upserts Tailscale ACL, mints a tag-bound ephemeral 90-day auth key, and writes 6 SecureString SSM params under /m8trx/<name>/*. Destroy reverses all of it. See Brain & Tailscale tab.
Management Access DONE
Dashboard binds to Tailscale IP — not reachable on the public internet. Standalone Tailscale on the dashboard host (mgmt-proxy module retired 2026-04-21). SSM Session Manager covers shell access; sessions land as ubuntu (not root, not ssm-user).
Dashboard Auth DONE
HTTP Basic Auth middleware on every route except /api/health. DASHBOARD_PASS required at start (no default — process refuses to boot without it). PID lock prevents duplicate instances. cron watchdog auto-restarts every minute.
Organization SCPs FILE READY
Policy file policies/scp-guardrails.json is complete. Action: attach to the OU in AWS Organizations console — cannot verify application status from this dashboard.
Multi-Tenant Compute PLANNED
Design in progress: shared EC2 host with multiple per-customer compose stacks, each with its own network namespace, KMS-encrypted volume mount, IAM via metadata token broker, and Tailscale identity. Currently every customer is on their own EC2.
Green = Production-ready Yellow = Action remaining / planned Red = Not started / critical
Per-customer cost ~$37–38/mo (t3.medium, audited 2026-05-05) • Shared platform cost ~$3–6/mo
Last checked: 2026-05-05 Region: us-east-2 across all environments
U YOUR TEAM C CUSTOMER Tailscale HTTPS via Caddy + LE OPERATOR DASHBOARD Tailscale-bound | FastAPI Deploy / monitor / rotate / SSM exec BRAIN + TAILSCALE MESH Per-customer keys | tag-bound auth Customer events / posture / rotation REST + Tailscale tag:m8trx-cust-A tag:m8trx-cust-B tag:m8trx-cust-N CUSTOMER A VPC 10.x.0.0/16 PUBLIC SUBNET IGW Ubuntu Instance (public IP) Docker: AI Agent Hardened OS SSM via outbound (No SSH) VPC Flow Logs Enabled CUSTOMER B VPC 10.x.1.0/16 PUBLIC SUBNET IGW Ubuntu Instance (public IP) Docker: AI Agent Hardened OS SSM via outbound (No SSH) VPC Flow Logs Enabled CUSTOMER N VPC 10.x.N.0/16 PUBLIC SUBNET IGW Ubuntu Instance (public IP) Docker: AI Agent Hardened OS SSM via outbound (No SSH) VPC Flow Logs Enabled KMS Key S3 Bucket IAM Role KMS Key S3 Bucket IAM Role KMS Key S3 Bucket IAM Role ACCOUNT-WIDE CloudTrail GuardDuty AWS Config CloudWatch SCPs SNS No cross-customer paths Customer VPC Management Compute Networking Security Services VPC Peering
V
Per-Customer VPC

Dedicated VPC with public subnet, internet gateway, and VPC flow logs. Cost-optimized (no NAT/ALB). No cross-customer network path exists at the AWS layer.

K
Per-Customer Encryption

Dedicated KMS CMK per customer with auto-rotation. EBS, S3, and Brain-telemetry SSM SecureStrings encrypted. One customer's key cannot decrypt another's data.

D
Agent Sandboxing

For the M8trx Agent template, the Paperclip stack runs as a Docker compose project (Caddy + Paperclip + Postgres + local Llama + bridge), each container non-root with seccomp, dropped caps, and resource limits. AppArmor on the host.

B
Brain & Tailscale Mesh

Each Agent customer joins a Tailscale tailnet under a tag-bound, ephemeral 90-day auth key, and gets a unique Brain customer key for telemetry. Cross-customer ACLs reject all traffic; the Brain reaches each instance only via its tag.

Network Architecture

Each customer is fully isolated at the AWS network layer. There are no VPC peering connections, no shared subnets, and no AWS-internal paths between customer VPCs. Operator and Brain reach each customer only over Tailscale (an encrypted overlay) using customer-specific tags. Each Agent customer is also reachable from the public internet on its own subdomain over HTTPS, with TLS terminated by Caddy on the instance.

INTERNET acme.m8trx.ai :443 (Caddy + LE) + :80 ACME challenge Cloudflare DNS (proxied=false) A {name}.m8trx.ai → instance EIP Internet Gateway Customer Instance (Agent template) Public subnet | EIP | IMDSv2 | encrypted EBS Caddy → Paperclip / Bridge / Postgres / Ollama + tailscaled (tag:m8trx-cust-acme) SECURITY GROUP Baseline (custom-app) SG IN: :8080 (single port) OUT: all (SSM, updates) M8trx Agent SG IN: :443 (Caddy) IN: :80 (LE HTTP-01) OUT: all (SSM, GHCR, brain) No SSH ingress on either TAILSCALE OVERLAY tag:m8trx-cust-{id} ephemeral, 90-day key --ssh=false (use SSM) Tailnet Lock-approved Reachable only by: - Brain (operator services) - Operator dashboard - Same-tag intra-customer Other customers: DENY VPC Flow Logs (all traffic, 60s, 30-day) CloudTrail covers AWS control plane · auditd covers OS

VPC Design (Per Customer)

ComponentCIDR / ConfigPurpose
VPC10.{octet}.0.0/16 (octet allocated stably from name hash)Isolated network per customer; never reused after deletion until octet is freed
Public Subnet (x1)/24 in AZ-aEC2 instance with EIP — outbound for SSM, GHCR pulls, Brain, Tailscale; inbound only on the template-specific port(s)
Internet GatewayFree, attached to VPCOutbound internet (no NAT — saves ~$32/mo per customer)
Flow LogsCloudWatch, 30-day retentionFull traffic audit trail for all VPC traffic
VPC PeeringNoneOperator and Brain reach customers over Tailscale, not VPC peering. (mgmt-proxy module retired 2026-04-21.)

Security Group Per Template

TemplateIngressEgressNotes
Baseline (custom-app):8080/tcp from 0.0.0.0/0AllSingle application port. Operator brings their own compose stack.
M8trx Agent:80/tcp + :443/tcp from 0.0.0.0/0AllCaddy serves HTTPS on :443 with a Let's Encrypt cert. :80 stays open for ACME HTTP-01 challenges and direct-IP HTTP fallback (auto-redirect disabled).

DNS & Subdomain Architecture

Every M8trx Agent instance gets its own HTTPS subdomain under the platform zone (m8trx.ai). Subdomains are created and torn down by the dashboard through the Cloudflare API (dashboard/services/domain.py) — DNS only (proxied=false). Cloudflare does not proxy traffic; TLS is terminated on the customer instance by Caddy using a Let's Encrypt cert obtained at first boot via HTTP-01. Endpoints: GET / POST / DELETE /api/platform/domains/{customer_name}.

When a customer has one agent the subdomain label defaults to the customer name (acme.m8trx.ai). When a customer has multiple agents, each instance is deployed with a distinct subdomain label so its DNS record points at its own EIP — the convention is {customer-name}-{role} (e.g. acme-inbox.m8trx.ai, acme-finance.m8trx.ai) or {customer-name}-{n}. The deploy form's Subdomain field is the operator's lever for this — it accepts any DNS-safe label, and the same label is used as the customer's display name on the welcome card.

Customer-end-user traffic flow:
{name}.m8trx.ai → Cloudflare DNS (A record, proxied=false)
  → Customer EIP :443
  → Caddy (origin TLS via Let's Encrypt)
  → Paperclip / Bridge container (basic-auth)
PropertyDetail
Domain cost~$10/yr for the platform root domain; subdomains free and unlimited
HTTPSCaddy + Let's Encrypt on the customer instance — auto-renewed every ~60 days. No platform-side cert management.
Why proxied=falseOrigin TLS gives end-to-end encryption to the instance and avoids Cloudflare-edge plan limits (cert renewal, payload size, websocket quirks).
On deployDashboard upserts an A record after Terraform apply succeeds. auto_https disable_redirects + a customer-hostname block + basic_auth are appended to compose/Caddyfile on first boot.
On destroyDashboard removes the A record and rebuilds consolidated Cloudflare rules.
SettingsCloudflare API token + zone ID + platform domain stored in dashboard settings (state.json), with env-var fallback (CLOUDFLARE_API_TOKEN, CLOUDFLARE_ZONE_ID, PLATFORM_DOMAIN).

Tailscale Overlay = Per-Customer "VLAN"

Operator and Brain traffic to customer instances does not traverse the public internet — every M8trx Agent instance joins a private Tailscale tailnet under a customer-specific tag (tag:m8trx-cust-<customer-uid>). All of one customer's agents share the same tag, so an ACL rule that says "src=tag, dst=tag:*" gives them a private network they can use to call each other. There is no equivalent rule across different customer tags, so cross-customer traffic is denied by default. Functionally, this is a per-customer VLAN built in software.

The dashboard mints a tag-bound, ephemeral, reusable, 90-day auth key per customer and writes it to SSM (the same key is reused across that customer's agents — that's why it's reusable). Each customer instance's user_data.sh joins the tailnet at first boot with that key, hostname set to the EC2 instance-id so the dashboard can find it for Tailnet Lock approval.

Multi-agent scenarioWhat that means on the network
Customer "acme" has one agentOne tailscaled device, one tag, one subdomain. The accept rule on the tag is in place but unused.
Customer "acme" adds a second agent (e.g. acme-finance)Second instance joins the tailnet under the same tag. Operator picks a distinct subdomain (acme-finance.m8trx.ai). The two agents can now reach each other on Tailscale-internal IPs by hostname (e.g. http://i-0abc...) without any further config — the intra-tag accept rule allows it.
Customer "beta" deploys an agentDifferent tag. Cannot see "acme" agents on Tailscale at all. No ACL rule connects the two tags.
Operator (you)Tagged tag:mgmt, with explicit accept rules to all customer tags. Reaches every agent for monitoring and shell.
PropertyDetail
Auth keyTag-bound (only registers as tag:m8trx-cust-<id>), ephemeral (auto-revoked at expiry), reusable, 90-day TTL
SSH--ssh=false on customer device (we use SSM Session Manager for shell access)
ACLPer-tag tagOwners + intra-customer-only accept rule. No cross-customer rule exists, so Tailscale rejects traffic between different customer tags by default.
ApprovalTailnet Lock — dashboard polls Tailscale for the device by hostname (=instance-id) and approves it after Terraform apply succeeds. Manual approval still possible from Tailscale admin console as a fallback.
DecommissionCustomer destroy: ephemeral keys auto-revoke at expiry; the device drops off the tailnet when the EC2 is gone (no manual revoke needed in normal flow).

Compute Architecture

Two customer templates ship today. Both run on the same hardened Ubuntu 22.04 base AMI (Packer-baked) with identical OS controls. They differ in what runs above the OS and how the workload bootstraps. A planned third mode will pack multiple low-traffic customers onto a shared host without weakening per-customer isolation.

CUSTOMER INSTANCE (one per customer today) Ubuntu 22.04 LTS | IMDSv2 required (hop=1) | Encrypted EBS (gp3) | EIP | Tailscale-joined OS HARDENING LAYER Kernel hardening | auditd (immutable) | fail2ban | UFW | AppArmor | AIDE | Auto-updates CloudWatch Agent auditd Immutable rules fail2ban Brute force UFW + AppArmor MAC + Firewall DOCKER COMPOSE STACK non-root | seccomp | dropped caps | resource limits | per-customer KMS-encrypted EBS caddy :80 :443 LE paperclip FastAPI :3100 m8trx-bridge API gateway postgres internal ollama llama3.2 M8trx Agent template — composed at first boot via git+compose (Path B) Baseline template runs operator-supplied compose stack on :8080 Both: secrets via SSM SecureString (KMS) — never on disk in plaintext EBS SNAPSHOTS Daily automatic 14-day retention LOGS SHIPPED syslog auth.log audit.log KEY SETTINGS IMDSv2: Required SSH: Disabled Hop Limit: 1 Core Dumps: Off

Two Customer Templates

AspectBaseline (custom-app)M8trx Agent
Terraform moduleterraform/modules/ec2terraform/modules/ec2-m8trx-agent
tfvars mapcustomersm8trx_agent_customers
Resource naming suffixnone — m8trx-{name}-*-agentm8trx-{name}-agent-* (avoids collisions)
Inbound port(s):8080:80 + :443
TLSOperator's responsibilityCaddy + Let's Encrypt on instance (per-customer subdomain)
BootstrapMinimal — operator supplies their own compose / dockerfile / appPath B — vanilla AMI + first-boot git clone M8trxAgent + docker compose up
Brain telemetry / TailscaleNot wiredRequired (6 SSM SecureStrings written pre-deploy)
Sample customerskeithenterprisesagent2

M8trx Agent First-Boot Bootstrap (Path B)

The M8trx Agent template uses the same hardened AMI as the baseline — no agent-specific Packer build. The full stack is fetched and started at first boot via cloud-init. This means deploying a new Agent customer always picks up the latest M8trxAgent source (subject to a fixed --depth 1 clone of main); operator re-bakes are not in the critical path.

1

Set hostname to instance-id

So the dashboard's post-deploy Tailnet Lock approval poll can find the device: hostnamectl set-hostname $(IMDSv2 instance-id).

2

Install runtime deps (idempotent)

awscli, git, jq, curl, docker.io, docker-compose-plugin, tailscale. Wait for the Docker socket to be ready (up to 30s) before continuing.

3

Read 6 SSM SecureString params

brain-key, brain-customer-id, brain-url, tailscale-auth-key, agent-fetch-pat, agent-ui-password — all under /m8trx/<customer-name>/. The instance role can read only its own path; see Brain & Tailscale tab.

4

Join the tailnet

tailscale up --auth-key=<ssm> --hostname=<instance-id> --ssh=false --accept-routes=false --accept-dns=false --reset --timeout=120s.

5

Write /etc/m8trx/brain.env (mode 0600)

Three lines: BRAIN_URL, BRAIN_API_KEY, BRAIN_CUSTOMER_ID. umask 077 in a subshell prevents the brief 0644 race that an after-the-fact chmod would leave.

6

Clone the agent repo, then strip the PAT

git clone --depth 1 https://oauth2:<PAT>@github.com/M8trxInfra/M8trxAgent.git /opt/m8trx-agent, then git remote set-url origin https://github.com/M8trxInfra/M8trxAgent.git so the PAT is gone from .git/config. unset GH_PAT clears the env var.

7

Generate compose/.env (mode 0700)

Random Postgres credentials, random JWT secret, allowed-hostnames list (localhost + EIP + customer hostname). The agent UI password is bcrypt-hashed by caddy hash-password; every literal $ is doubled so docker-compose interpolation survives.

8

Patch the Caddyfile and compose for hostname-aware TLS

Prepend { auto_https disable_redirects } (so direct-IP HTTP keeps working alongside the hostname block), append a {name}.{platform-domain} { basic_auth … reverse_proxy … } block, and add "443:443" to the paperclip service ports. Idempotent: skips if already applied.

9

Bring up the stack

docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d --pull always. The .prod.yml override pulls images from GHCR — no local builds in production. First boot ends when caddy obtains its certificate.

Cloud-init output is tee'd to /var/log/m8trx-agent/first-boot.log for SSM-side debugging. Total user-data is kept under the AWS 16 KB user-data hard limit — anything bigger goes into the cloned repo, not user-data.

Audit Rules (auditd) — Immutable, both templates

What's MonitoredRule Key
Authentication config changes (/etc/pam.d, shadow, passwd)auth_config
Sudo usage and sudoers changessudoers
SSH config modificationssshd_config
File deletions by non-system usersfile_deletion
Privilege escalation (execve as root)privilege_escalation
Agent data directory accessagent_data_access
Kernel module loadingkernel_modules

Planned: Multi-Tenant Compute (foundations in place, host-level orchestration pending)

Today, every customer is on their own EC2 — strong isolation, but ~$37/mo even when a customer is sending two emails a week (~$30 of that is the t3.medium itself). The planned multi-tenant compute mode packs multiple low-traffic customers onto a single hardened EC2 host, while keeping the per-customer trust boundary intact. Goal: ~$8–12/mo per small customer at the per-customer slice level, with the same external behaviour (separate subdomain, separate Brain customer key, separate Tailscale tag, separate KMS volume).

This is not a redesign — it's an optimization on top of what Path B already gives us. The agent stack is already a Docker compose project, the project name is already namespaced per customer, the storage is already a per-customer KMS-encrypted EBS volume, and each agent already runs its own tailscaled with a tag-bound identity. Most of the per-customer trust boundary is therefore already in place at the compose level; the missing pieces are at the host level.

ConcernStatus today (single-tenant per host)Multi-tenant change
ComputeDone — One docker compose project per customer, namespaced m8trx-{name}_*. Each stack runs as non-root with seccomp + dropped caps + per-service resource limits. (The agent template already bakes Docker into first-boot — Path B.)Pending — Run multiple compose projects on one host. User-namespace remap to give each customer its own UID/GID range. Per-project cgroup limits.
NetworkDone — Per-customer compose creates its own Docker network. Caddy on the instance terminates TLS for one customer subdomain.Pending — Host-level SNI-routing reverse proxy in front of the per-project Caddys (or per-project ENI with separate EIP per customer). Each subdomain still terminates with that customer's own Caddy + LE cert.
StorageDone — Per-customer KMS-encrypted EBS volume. Per-customer S3 bucket with TLS + KMS enforced.Pending — Multiple per-customer EBS volumes attached to the same host, each mount-namespaced into one compose project. A leaked container can't read another customer's mount because the KMS grant is on a different key.
IAMPartial — Today the EC2 instance role is the customer role (1:1). Multi-tenant changes that.Pending — Host runs a credential broker bound to localhost. Each compose project authenticates with a short-lived bearer (mounted from a per-project secret); broker exchanges it for STS creds on that customer's IAM role. Container iptables blocks IMDS as today.
Tailscale identityDone — Per-customer tag-bound auth key, per-customer tagOwner, intra-customer accept rule. The reusable flag means one key onboards multiple agents under the same tag.Pending — Per-customer tailscaled in its own network namespace on the shared host (instead of the host-level tailscaled used today). Each compose project sees only its own customer's tag.
Brain identityDone — Per-customer UID + bearer minted by the brain. The customer's bearer token never leaves that customer's compose project.No change — same flow, same SSM scope per customer.
Blast radiusEC2 compromise affects one customer.Host-kernel compromise affects every customer on that host (same as any multi-tenant K8s node). Critical / regulated customers stay on dedicated EC2 — opt-in per customer at deploy.
MigrationExisting customers stay on dedicated EC2; new low-traffic customers can opt in. Dashboard exposes the choice on the deploy form.

Status: most per-customer guarantees are already enforced at the compose / KMS / Tailscale layer — Path B got us there. The remaining work is host-level: a multi-project compose orchestrator, a credential broker, an SNI-routing reverse proxy, and a deploy-form opt-in. No new Terraform module yet.

IAM & Access Control

Each customer instance assumes a unique IAM role with a permission boundary that hard-caps maximum privileges.

PERMISSION BOUNDARY (Hard Maximum — Cannot Be Overridden) ALLOWED ACTIONS S3: Own Bucket KMS: Own Key Only Secrets Manager SSM (Session Mgr) CloudWatch Logs SES (if enabled) All scoped to own customer resources via ARN, tag, or name prefix EXPLICIT DENY iam:* Cannot modify any IAM organizations:* Cannot touch org cloudtrail:Stop/Delete Cannot disable logging guardduty:Delete Cannot disable detection ec2:Create/Delete VPC Cannot modify network Even if inline policies grant access, these are always denied.

Organization SCPs

Deny Disable CloudTrail

StopLogging, DeleteTrail, UpdateTrail

Deny Disable GuardDuty

DeleteDetector, DisassociateFromMaster

Deny Disable Config

StopConfigurationRecorder, DeleteConfigurationRecorder

Deny Root Account

All actions blocked for root principal

Deny IMDSv1 Instances

RunInstances denied unless HttpTokens=required

Deny Unencrypted EBS

RunInstances denied if ec2:Encrypted=false

Deny Public S3

PutBucketPublicAccessBlock/Policy/ACL blocked

Deny Unapproved Regions

EC2/RDS/Lambda restricted to us-east-1, us-west-2

Brain Telemetry — additional inline scope

Each M8trx Agent customer's instance role carries one extra inline policy, brain_telemetry_ssm_read, scoped to only that customer's SSM path. The permission boundary explicitly permits ssm:GetParameter / ssm:GetParameters so the inline policy isn't blocked at the boundary. Full detail with example policy JSON: Brain & Tailscale tab.

Resource ARNAllowed actionsWhy scoped this way
arn:aws:ssm:*:*:parameter/m8trx/<own-customer-name>/*ssm:GetParameter, ssm:GetParametersA leaked customer instance role cannot read another customer's brain key, Tailscale auth key, or agent UI password — pivot fails at the IAM layer, not just at runtime.

Brain Telemetry & Tailscale Mesh

A control plane that connects every M8trx Agent customer instance to a central Brain service over a Tailscale tailnet. The Brain stores per-customer metadata (events, agents, API keys) and is what the operator dashboard reads from when monitoring fleet posture. The mesh is the only network path the Brain ever uses to reach a customer; there is no public-internet path from Brain to customer.

What is the Brain? A separate, internally-hosted service (different repo, different deploy) that holds the customer registry, per-customer API keys, and per-customer event/agent telemetry. The deployer-side dashboard talks to the Brain over HTTP+bearer (the brain_admin_token), and customer instances talk to the Brain over their per-customer bearer (brain-key). This page documents the integration points only — the Brain's internals (storage, indexing, alerting) live in the Brain repo's own docs.

The Customer UID — one identifier ties everything together

Every Agent customer has a single, stable, public identifier — the Customer UID — derived from the deployer-side customer name on first deploy and never reused. The UID drives every other per-customer artifact in the platform; getting it right matters because it is the only string we use to scope cross-system access.

ArtifactShapeExample (for customer name acme)Where it's enforced
Customer UID (public id)cust_<name> with -_cust_acmeBrain customer table
Customer bearer token (private)Random string, returned by Brain on mintbr_xxx…Brain authenticates every event with it
Tailscale tagtag:m8trx-cust-<uid>tag:m8trx-cust-acmeTailscale ACL tagOwners + intra-customer accept rule
SSM namespace/m8trx/<name>/*/m8trx/acme/*Per-customer IAM brain_telemetry_ssm_read scope
Cloudflare subdomain{label}.{platform-domain}acme.m8trx.ai or acme-finance.m8trx.ai (per agent when multiple)Cloudflare DNS A record + Caddy hostname block
Terraform tagsCustomer = <name>-agentCustomer = acme-agenttfvars + every AWS resource tag

A customer can run multiple agents (e.g. an inbox agent and a finance agent). All of that customer's agents share the same UID, the same Tailscale tag, the same brain customer entry, and the same SSM namespace — but get distinct subdomains and distinct EC2 instances. The shared tag is what gives them a private network to talk to each other on (the per-customer "VLAN" — see Network tab).

Status: the infra layer (Tailscale tagging, brain mint idempotency, SSM scoping, Cloudflare provisioning) is already ready for multiple agents under one UID — the auth key is reusable, the tag's accept rule has no device-count limit, and the SSM IAM scope wildcards over the namespace. The piece that is not yet wired is the deploy form: today it treats each new deploy as a new customer (1 deployer-name = 1 brain mint = 1 EC2). The next step is a "Tenant" dropdown that lets the operator add a new agent under an existing customer UID — re-using the brain customer, the Tailscale tag, and the SSM namespace, while creating a fresh subdomain and EC2.

OPERATOR DASHBOARD brain_telemetry.py provision_for_customer() delete_for_customer() M8TRX BRAIN (separate service) /admin/customers (mint, delete) events / agents / keys reachable on Tailscale only TAILSCALE CONTROL REST API (api.tailscale.com) mint auth keys (tag-bound, 90d) upsert ACL / Tailnet Lock approve POST /admin/customers DELETE /admin/customers/... REST: ACL upsert + auth-key mint AWS SSM PARAMETER STORE — /m8trx/<customer-name>/* (SecureString, encrypted with alias/aws/ssm) brain-key (bearer, encrypted) · brain-customer-id (id only) · brain-url tailscale-auth-key (encrypted) · agent-fetch-pat (encrypted) · agent-ui-password (encrypted, rotatable) put_parameter (×6) PER-CUSTOMER EC2 INSTANCES (Agent template) Customer A instance tag:m8trx-cust-acme user_data → ssm:GetParameter (×6) tailscale up --auth-key --tags paperclip → POST /events (bearer) Customer B instance tag:m8trx-cust-beta user_data → ssm:GetParameter (×6) tailscale up --auth-key --tags paperclip → POST /events (bearer) Customer N instance tag:m8trx-cust-N user_data → ssm:GetParameter (×6) tailscale up --auth-key --tags paperclip → POST /events (bearer) Brain admin (HTTPS+bearer) Tailscale REST (operator) SSM SecureString fetch Customer telemetry over Tailscale

SSM Parameter Layout (per customer)

All parameters live under /m8trx/<deployer-customer-name>/* (the deployer name, not the brain id — so user-data needs only the one name terraform already templates in). Five are SecureString (KMS-encrypted via alias/aws/ssm); only brain-customer-id and brain-url are stored in the clear.

ParameterTypeSourceUsed by
brain-keySecureStringBrain mint response (pre-deploy)Paperclip → Brain bearer
brain-customer-idStringDerived: cust_ + name with -_Paperclip event payloads
brain-urlStringSettings (one value, fleet-wide)Paperclip → Brain endpoint
tailscale-auth-keySecureStringTailscale mint response (pre-deploy)tailscaled at first boot
agent-fetch-patSecureStringSettings (one value, fleet-wide; read-only on M8trxAgent)git clone at first boot — stripped from .git/config immediately
agent-ui-passwordSecureStringSettings default; rotatable per-customerCaddy basic-auth (bcrypt hashed at first boot via caddy hash-password)

Pre-Deploy Lifecycle (provision)

Order matters. Each step streams progress to the dashboard's job log via on_output. If any step fails, the deploy aborts before terraform apply touches AWS — so a half-provisioned customer never enters the brain. Source: dashboard/services/brain_telemetry.py.

1

Mint a Brain customer key

POST {brain_url}/admin/customers with bearer = brain_admin_token, body { customer_id: cust_acme, name: acme }. Returns { api_key, key_id }. A 409 Conflict means the brain already has this id — abort, since reusing it would let the new instance read the previous instance's events.

2

Upsert Tailscale ACL entries (must precede key mint)

Tailscale rejects an auth key whose tag isn't already declared in tagOwners. GET the ACL with the If-Match ETag, add tag:m8trx-cust-<id> → autogroup:admin to tagOwners if missing, and add an intra-customer accept rule (src=tag, dst=tag:*) if missing. Single POST back. Idempotent — no write if no change.

3

Mint a tag-bound, ephemeral, reusable, 90-day Tailscale auth key

POST /api/v2/tailnet/{tailnet}/keys with capabilities devices.create.tags=[tag:m8trx-cust-<id>], ephemeral=true, reusable=true, preauthorized=false (Tailnet Lock approval is a separate post-apply step), expirySeconds=90×86400. Returns the bearer auth key.

4

Write 6 SSM parameters

All under /m8trx/<name>/. Done last — if Tailscale failed in step 2 or 3, no SSM is touched, so a re-run starts clean. agent_fetch_pat is the same value across all customers but copied per-customer so it fits the existing per-customer IAM read scope; no fleet-level IAM change needed.

Post-Apply: Tailnet Lock approval

After terraform apply succeeds, the dashboard polls Tailscale for a device with hostname == <instance-id>. When it appears (cloud-init has finished step 4 of bootstrap), the dashboard calls POST /api/v2/device/{device_id}/key with {"keyExpiryDisabled": false} to approve it under Tailnet Lock. This step is best-effort: if cloud-init is slow, the operator can still approve manually from the Tailscale admin console; the deploy job logs a warning rather than failing.

Destroy Lifecycle

StepActionNotes
1Pre-destroy: confirm termination protection is OFF on the EC2Refuses to start if on — operator must explicitly disable, to avoid leaving SSM/IAM/SG cleanup half-done while the instance survives
2Run terraform destroyDeletes EC2, SG, IAM, KMS (30-day window), VPC, S3, etc. Cloudflare A record removed by dashboard separately
3Delete the 6 SSM parameters under /m8trx/<name>/*Idempotent — missing params are skipped
4DELETE {brain_url}/admin/customers/<brain_customer_id>Cascades on the brain side: events, agents, api_keys, customer row
5Tailscale auth key auto-revokesEphemeral + 90-day expiry — no explicit revoke API call needed in the normal path

Settings UI (one-time, fleet-wide)

All five fleet-wide brain-telemetry values are entered once via the dashboard's topbar settings cog (📡 modal). GET endpoints return masked values (last 4 chars + bullets); POST writes them to state.json.

SettingWhat it isUsed during
tailscale_api_tokenOAuth client / API token with ACL + key + device-approval scopesprovision + destroy
tailscale_tailnetTailnet name (e.g. M8trxInfra.github)provision + destroy
brain_urlBase URL of the brain server (Tailscale-internal preferred)provision + destroy + customer-instance runtime
brain_admin_tokenBrain admin bearer — only used by the dashboard, never written to a customer instanceprovision + destroy
agent_fetch_patGitHub PAT with read-only scope on M8trxInfra/M8trxAgentprovision (copied to per-customer SSM)
agent_ui_default_passwordInitial Caddy basic-auth password seeded for new customers (rotatable per-customer afterwards)provision (copied to per-customer SSM)

IAM Scope (instance-side)

Each customer's EC2 instance role has a scoped brain_telemetry_ssm_read inline policy. The role can read only its own customer's path — a leaked instance role cannot pivot to another customer's brain key, Tailscale auth key, or PAT.

resource "aws_iam_role_policy" "brain_telemetry_ssm_read" {
  policy = jsonencode({
    Statement = [{
      Sid      = "ReadOwnBrainTelemetryParams"
      Effect   = "Allow"
      Action   = ["ssm:GetParameter", "ssm:GetParameters"]
      Resource = "arn:aws:ssm:*:*:parameter/m8trx/${local.ssm_customer_name}/*"
    }]
  })
}

The permission boundary on each instance role also explicitly permits ssm:GetParameter + ssm:GetParameters — without that, the policy above would still be denied at the boundary. local.ssm_customer_name strips any -agent resource-naming suffix back to the deployer name, so SSM paths and tfvars keys stay aligned.

Brain-Side Monitoring (what the team sees)

Once an Agent customer is online, Paperclip POSTs events (login, task created, task settled, finance/leads activity, errors) to the brain over the customer's bearer. The brain rolls these into per-customer dashboards (covered in the brain repo's own docs). Things the team monitors at the platform layer:

Failure Modes & Operator Actions

What goes wrongSymptomFix
Brain mint returns 409Deploy aborts with "Brain already has a customer with id…"Pick a different deployer name, or revoke the orphaned brain customer first via the brain admin UI
Tailscale ACL upsert 412 (precondition failed)Race against another concurrent dashboard runRetry — uses ETag-based optimistic locking, second attempt usually wins
Tailnet Lock approval times out post-applyWarning in deploy log, customer instance can't reach brain yetApprove manually in Tailscale admin console; cloud-init may still be running
SSM put fails after Tailscale key mintedHalf-provisioned state — auth key exists, SSM is emptyRe-run deploy: brain mint will 409 (so abort and fall through to manual re-provision), or revoke the unused auth key from Tailscale and start over
Operator rotates brain_admin_tokenAll future provisions and destroys use the new tokenNo customer-instance impact — instances hold their own per-customer brain-key, not the admin token

Data Protection

All data encrypted at rest with per-customer KMS keys and in transit with TLS.

ENCRYPTION AT REST EBS Volumes AES-256 (KMS CMK) S3 Objects SSE-KMS (CMK) EBS Snapshots Inherited from volume CloudTrail Logs SSE-KMS (shared) ENCRYPTION IN TRANSIT End-User to Customer Instance HTTPS via Caddy + Let's Encrypt Instance to AWS APIs / Brain TLS via IGW (outbound) / Tailscale Admin to Dashboard WireGuard (Tailscale) overlay S3 Bucket Policy Deny if !SecureTransport

AI Agent Sandboxing

AI agents handle sensitive customer data — email, financial documents, chats, business records. They run inside a Docker compose project with multiple independent restriction layers, so a single failure (a vulnerable Python dep, a leaked secret, a user-data exploit) does not give code-execution on the host or read access to another customer. The mechanisms below apply to both templates; the M8trx Agent template adds Caddy + Postgres + Ollama as additional compose services bound to the same protections.

HOST OS (Hardened Ubuntu) IPTABLES NETWORK RULES Allow: DNS, HTTP, HTTPS | Block: metadata (169.254.169.254), all other | Rate limit: 30/min DOCKER RUNTIME cap_drop: ALL No capabilities no-new-privs No escalation seccomp profile Syscall whitelist read_only: true Immutable FS user: 1000:1000 Non-root CPU: 2 cores RAM: 4 GB DOCKER COMPOSE PROJECT (one per customer) caddy (TLS) paperclip (api) m8trx-bridge postgres ollama (llama) Workspaces: /var/m8trx/workspaces (rw, KMS-encrypted EBS) | tmpfs /tmp (noexec) | host runs Tailscale + Docker only Secrets: SSM SecureString fetched at first boot to /etc/m8trx/brain.env (0600) and compose/.env (0700). No keys in image layers.

Per-Service Notes

ServiceInternal PortExposed?Talks to
caddy:80, :443Yes (host SG)Internet (LE ACME challenge), m8trx-bridge:3200, paperclip:3100
paperclip:3100No (only via Caddy)postgres, Anthropic API (per-customer key), Brain (per-customer bearer)
m8trx-bridge:3200No (only via Caddy)paperclip, Brain
postgres:5432No (compose-internal)Only the paperclip service in the same project
ollama:11434No (compose-internal)Local LLM inference fallback when Claude is unreachable

Management Access

Two distinct access paths for the operator team: (1) reaching the operations dashboard over Tailscale, and (2) reaching individual customer instances for shell or telemetry. Neither path uses a dedicated management proxy any more — the mgmt-proxy Terraform module was retired on 2026-04-21 because Tailscale-on-the-dashboard-host plus AWS SSM Session Manager cover both paths at zero extra cost.

OPERATIONS DASHBOARD Bound to Tailscale IP only FastAPI :8443 | HTTP Basic Auth PID lock + cron watchdog DASHBOARD_PASS required at boot CUSTOMER INSTANCE No SSH ingress, ever SSM Session Manager → ubuntu Tailscale-joined (M8trx Agent) CloudTrail records every session A. Tailscale → Dashboard $0/mo | Tailscale on operator host WireGuard mesh | device approval RECOMMENDED B. SSM Port-Forward $0 | aws ssm start-session IAM-authenticated, audited in CloudTrail C. SSM → Customer shell $0 | landed as ubuntu From dashboard or AWS CLI over the tailnet tunnel local :8443 → instance :8443 SSM Session Manager (no SSH) ALL OPTIONS Zero public ingress on dashboard | No SSH ingress on customers | Every action audited

Option Comparison

OptionReachesCostAdmin NeedsAdd / Remove AdminBest For
A. Tailscale Dashboard $0 (free 100 devices) Tailscale app + invite to tailnet Approve / delete device — instant Daily-driver for the whole team
B. SSM Port-Forward Dashboard $0 AWS CLI + SSM plugin + IAM user/role with ssm:StartSession + MFA Disable IAM user / role Break-glass when Tailscale is unavailable
C. SSM Session Manager Customer instance shell $0 Triggered from dashboard "Shell Access" panel, or AWS CLI Same — IAM-controlled Forensics, debugging, ad-hoc commands

Setup — Tailscale on the Dashboard Host

1

Install Tailscale on the dashboard EC2 + your laptop

# On the dashboard host (one-time):
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up --hostname=m8trx-deployer-dashboard --advertise-tags=tag:mgmt
# follow the printed URL to authorise

# Then on your laptop, install Tailscale and sign in to the same tailnet.
2

Bind the dashboard to its Tailscale IP

export DASHBOARD_HOST=$(tailscale ip -4 | head -1)   # e.g. 100.98.195.91
export DASHBOARD_PORT=8443
export DASHBOARD_USER=admin
export DASHBOARD_PASS='<strong-password>'    # required, no default
cd dashboard && python main.py

A cron watchdog re-runs this command if the process dies — see scripts/cron/watchdog.sh. PID lock at dashboard/dashboard.pid prevents duplicates.

3

Add more admins

Each admin installs Tailscale and joins the tailnet. You approve their device in the Tailscale admin console; they can immediately reach the dashboard at https://<tailscale-ip>:8443 and sign in with the dashboard's HTTP Basic credentials. Revocation: delete their device — instant.

Setup — SSM Port-Forward (break-glass for the dashboard)

# Install SSM plugin: https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html

aws ssm start-session \
  --target i-DASHBOARD_INSTANCE_ID \
  --document-name AWS-StartPortForwardingSession \
  --parameters '{"portNumber":["8443"],"localPortNumber":["8443"]}'

# Open https://localhost:8443 in your browser.

Setup — SSM Session Manager (customer-instance shell)

Triggered from the dashboard's per-customer detail view. The dashboard runs ssm:SendCommand with an inline AWS-RunShellScript document, polling every 3s for output (max 300s). Sessions land as the ubuntu user (not root, not ssm-user) — so paths and ownership match what the customer's compose stack expects. Commands are validated against a blocklist (rm -rf /, mkfs, shutdown, pipes-to-shell, etc.) and capped at 2000 chars. Endpoint: POST /api/customers/{name}/exec.

# Or from the AWS CLI directly:
aws ssm start-session --target i-CUSTOMER_INSTANCE_ID
# lands as ubuntu, full audit in CloudTrail

Security Properties

PropertyDetail
Dashboard exposureBinds to its Tailscale IP only — not accessible on the public IP, not on AWS-internal IPs from outside the tailnet
Network authTailscale: device approval + Tailnet Lock | SSM: IAM + MFA, audited in CloudTrail
Application authHTTP Basic Auth middleware on every route except /api/health
Security headersX-Content-Type-Options, X-Frame-Options, Referrer-Policy, Cache-Control: no-store on all responses
Password policyDASHBOARD_PASS env var required — no default. Dashboard refuses to start without it.
PID lockdashboard.pid prevents duplicate instances from running simultaneously
Audit trailTailscale admin console (device + sessions) | CloudTrail (every SSM session) | dashboard job log (every dashboard action)
Customer-instance accessSSM Session Manager only — no SSH, no Tailscale-SSH (--ssh=false). Sessions land as ubuntu.
EncryptionWireGuard (Tailscale) | TLS (SSM) | TLS (Caddy + LE on customer instances)

Monitoring & Detection

Multi-layered monitoring with automated alerting across all customer environments.

DATA SOURCES VPC Flow Logs CloudTrail Instance Logs S3 Access Logs DNS Logs DETECTION SERVICES GuardDuty Credential abuse, crypto mining, C2, malware CloudWatch Alarms CPU > 90%, unauthorized API calls > 5 AWS Config Continuous compliance recording SNS Topic Security Alerts SNS Email SES Alerts Dashboard LOG RETENTION syslog, agent: 90 days auth, audit, CloudTrail: 365 days

Incident Response

One-command quarantine triggered from the dashboard UI or CLI.

1 Snapshot All EBS volumes Preserve evidence 2 Quarantine SG Zero ingress/egress Complete isolation 3 Apply to Instance Replace all SGs Instant network cutoff 4 Tag & Report QUARANTINED tag Metadata captured SSM still works for forensics Do NOT terminate Post-quarantine: Review CloudTrail | Check audit.log via SSM | Rotate all customer secrets | Investigate root cause

Security Controls Matrix

Every threat has at least 3 independent control layers. If one fails, the others still protect.

ThreatLayer 1Layer 2Layer 3
SSRF credential theftIMDSv2 requiredHop limit = 1iptables block 169.254.x.x
Lateral movementSeparate VPCsPer-customer IAMPer-customer KMS
Agent breakoutDocker + seccompAppArmor profileNon-root + read-only FS
Data exfiltrationNetwork rate limitVPC flow logsS3 TLS + KMS enforced
Credential compromiseNo static keysPermission boundariesUnauthorized API alerts
Crypto miningGuardDutyCPU alarm > 90%Container CPU limits
Security tamperingSCPs deny disableIAM deny in boundaryCloudTrail validation + GuardDuty
Unauthorized dashboardTailscale-only bindingAuth middleware (all routes)Security headers + PID lock
Shell injection (dashboard)shell=False (subprocess)Command blocklist on /execInput regex validation
Credential leakageNo default passwordPasswords never logged/api/health strips AWS info
Audit log tamperingauditd immutable (-e 2)Shipped to CloudWatchCloudTrail validation
Cross-customer telemetry leakPer-customer Brain keyTailscale tag-only ACLSSM IAM scoped to /m8trx/<own>/*
Stale agent UI passwordPer-customer Caddy basic-authRotatable via dashboard (SSM SendCommand)Plaintext base64 in transit only

Per-User Access Control

Each customer instance supports multiple users with role-based access, per-user data isolation, granular permissions, and full audit logging. This is the application layer running inside the Docker sandbox — distinct from the deployer-side dashboard auth (HTTP Basic Auth) and from the per-customer Caddy basic-auth that gates the agent UI. The model below applies to the baseline (custom-app) template; the M8trx Agent template uses Paperclip's own user/permission model — see Paperclip docs.

CUSTOMER INSTANCE (1 per customer) USERS Admin All permissions User A email, docs, chat User B financial, docs Auditor readonly, all cats AUTH JWT Token HMAC-SHA256 30 min expiry PBKDF2 passwords 100k iters Role check Perm check on every req RBAC admin: all data, all users user: own data only readonly: read own, no write Path traversal BLOCKED 50MB file limit PER-USER DATA DIRECTORIES admin/ docs/ emails/ chats/ financial/ user-a/ docs/ emails/ chats/ no access user-b/ docs/ no access no access financial/ AUDIT LOG (append-only JSON Lines) Every login, file read, file write, user change → CloudWatch

Roles & Permissions Matrix

Capabilityadminuserreadonly
View own filesYesYesYes
Upload / write filesYesYesNo
Delete own filesYesYesNo
View other users' filesYesNoNo
Create / manage usersYesNoNo
Reset user passwordsYesNoNo
View full audit logYesOwn events onlyOwn events only
Use agent featuresYesPer permissionNo

Granular Permissions

E
email

Access email processing features and the emails/ data directory. Required for agents that handle customer email.

F
financial

Access financial document processing and the financial/ directory. Enables stricter audit logging on these operations.

C
chat

Access chat/messaging features and the chats/ data directory.

D
documents

Access general document storage and the documents/ data directory.

W
web_portal

Access web portal features for the customer's public-facing services.

API Endpoints

EndpointMethodWhoWhat it does
/auth/loginPOSTAnyoneAuthenticate, receive JWT token
/auth/meGETAuthenticatedView own profile and permissions
/auth/change-passwordPOSTAuthenticatedChange own password (min 12 chars)
/usersGET/POSTAdminList all users / create new user
/users/{id}GET/PATCH/DELETEAdminView / update / disable user
/users/{id}/reset-passwordPOSTAdminReset user password (returns temp pw)
/data/{category}GET/POSTUser+List own files / upload file
/data/{category}/{file}GET/DELETEUser+Download / delete own file
/data/admin/{user}/{cat}GETAdminList any user's files
/data/admin/{user}/{cat}/{file}GETAdminRead any user's file
/auditGETAuthenticatedView audit log (scoped to role)
/healthGETAnyoneContainer health check

Security Controls

ControlImplementation
Password hashingPBKDF2-SHA256 with 100,000 iterations + random salt
Token signingHMAC-SHA256, secret from AWS Secrets Manager
Token expiry30 minutes — forces re-authentication
Path traversalFilename stripped to basename, resolved path checked against data root
File size50 MB maximum per upload
User deletionSoft-delete only — account disabled, data preserved for audit
Default adminCreated on first boot with temporary password printed to logs
Audit trailAppend-only JSONL, shipped to CloudWatch, tamper-evident

Repository Structure

m8trxDeployer/ ├── terraform/ │ ├── main.tf # Wires both customer maps (customers + m8trx_agent_customers) │ ├── variables.tf # Input variables │ ├── cloudflare/ # Zone-level resources (managed separately from customer state) │ ├── environments/prod/ # base.tfvars.json + per-template customer tfvars │ └── modules/ │ ├── vpc/ main.tf # Per-customer VPC, public subnet, IGW, flow logs, SG (port-driven) │ ├── ec2/ main.tf, user_data.sh # Baseline (custom-app) instance — single port :8080 │ ├── ec2-m8trx-agent/ main.tf, user_data.sh # M8trx Agent (Path B — git+compose at first boot) │ ├── iam/ main.tf # Per-customer role + permission boundary + brain_telemetry_ssm_read │ ├── kms/ main.tf # Per-customer CMK with auto-rotation │ ├── s3/ main.tf # Per-customer bucket (encrypted, versioned, TLS-enforced) │ └── monitoring/ main.tf # GuardDuty, CloudTrail, alarms, AWS Config# jumpbox / mgmt-proxy module retired 2026-04-21 ├── dashboard/ # FastAPI operations dashboard │ ├── main.py, config.py, state.json │ ├── routers/ customers, settings (brain-telemetry + cloudflare), platform, usage, incidents │ ├── services/ aws_client, terraform_runner, brain_telemetry, brain_client, tailscale_client, domain, security_checker, decommission_checker │ ├── models/ customer.py (Pydantic validation, including template selector + subdomain) │ └── static/ index.html, app.js, architecture.html (= arch-dashboard/index.html) ├── arch-dashboard/ # Architecture docs source (this page) — copy to dashboard/static after edits ├── paperclip/ ### git submodule → M8trxInfra/M8trx_agent (pinned SHA = AMI provenance) ├── packer/ # Hardened base AMI build (Ubuntu 22.04, used by both templates) ├── scripts/ │ ├── hardening/ harden-ubuntu.sh, apparmor-agent-profile │ ├── agent-sandbox/ Dockerfile, docker-compose.yml, seccomp-profile.json │ ├── cron/ watchdog.sh (auto-restart dashboard if it dies) │ └── incident-response.sh └── policies/ scp-guardrails.json, deployer-iam-policy.json

Per-Customer Cost Breakdown (one customer, t3.medium, audited 2026-05-05)

Defaults from terraform/environments/prod/*.tfvars.json: instance_type=t3.medium, volume_size_gb=30. Retention from terraform/modules/vpc/main.tf (flow logs: 30d) and terraform/modules/ec2/user_data.sh (syslog 90d, auth 365d, audit 365d, agent 90d).

ItemBaselineM8trx AgentNotes
EC2 (t3.medium, on-demand, us-east-2)~$30.00~$30.00Same instance class for both templates
EBS root (30 GB gp3)~$2.40~$2.40$0.08/GB/mo gp3 baseline; encrypted with per-customer KMS CMK
EBS snapshots (DLM, daily 03:00, 14-day retain)~$2.00~$2.00$0.05/GB-month; only the delta of each snapshot is billed after the first
KMS CMK (1 per customer, auto-rotate)~$1.00~$1.00$1/key/month + tiny API costs
CloudWatch Logs ingestion + storage~$2.00~$0.50Baseline ships VPC flow (30d) + syslog (90d) + auth (365d) + audit (365d) + agent (90d) via the CW agent. Agent template ships VPC flow only — Path B's user_data does not install the CW agent (deliberate, keeps user_data under the 16 KB limit). Dashboard fills the gap for memory/disk via SSM RunCommand.
CloudWatch alarms (per-customer high-CPU)~$0.10~$0.10$0.10 per standard-resolution alarm
GuardDuty marginal (per-customer flow log + CT events)~$0.50~$0.50Detector is account-wide but ingestion volume is per-customer
Data transfer (egress, IGW use)~$0.30~$0.50Outbound EC2 → internet at $0.09/GB. Brain telemetry rides Tailscale and is billed once. Agent template has slightly more egress (GHCR pulls, LE renewals, Anthropic).
Total per agent instance~$38 / mo~$37 / moA multi-agent customer pays per agent (2 agents ≈ $74–$76/mo). Heavier traffic = more flow logs + more egress; expect +$2–$5 for a busy customer.

Shared platform cost: ~$3–6 / mo. SNS topic (free tier covers low-volume), Terraform state S3 bucket + DynamoDB lock (~$1/mo combined), CloudTrail (first trail free), AWS Config recorder ($0.003 per recorded item — adds up at scale, but trivial for the small fleet). The mgmt-proxy ($7/mo) was removed on 2026-04-21; Tailscale + SSM cover the same access surface at $0. Tailscale is on the free 100-device tier; the Brain server runs on existing infrastructure.

Cost Audit Findings (2026-05-05)

Deployment Guide

Complete step-by-step instructions for initial platform setup and deploying new customer instances. Follow these in order.

Phase 1 — One-Time Platform Setup

These steps are done once to set up the platform infrastructure. After this, deploying a new customer is a single dashboard click that takes ~5 minutes.

1

Build the hardened base AMI (Packer)

The hardened Ubuntu 22.04 AMI is built by Packer and used by both customer templates — the M8trx Agent template no longer bakes its own AMI (Path B installs the agent stack at first boot via git+compose).

cd packer
./build-ami.sh    # runs `packer build` and prints the new AMI ID

What the build does: kernel parameter lockdown (ASLR, ptrace, SYN flood protection), auditd with immutable rules, fail2ban, UFW firewall, AppArmor profiles, SSH locked down (no password auth, no root login), unattended security updates, AIDE file integrity, core dumps disabled, unnecessary packages removed, cron restricted to root. Note the AMI ID — you'll set it in tfvars in step 3.

2

Terraform state backend (already configured)

The repository ships with the S3 + DynamoDB backend wired up. State lives in m8trx-terraform-state (versioned, KMS-encrypted, public-access blocked) with locking in DynamoDB table terraform-locks in us-east-2. No action needed unless you're running this in a fresh AWS account, in which case create the bucket + table once and update terraform/main.tf.

3

Configure Terraform tfvars

Edit terraform/environments/prod/base.tfvars.json with the platform-wide values. Per-customer maps (customers, m8trx_agent_customers) live in their own tfvars files and are managed by the dashboard — no manual editing on customer add/remove.

{
  "aws_region": "us-east-2",
  "project_name": "m8trx",
  "admin_email": "ops@m8trx.ai",
  "base_ami_id": "ami-XXXXXXXXX",         // from step 1, used by both templates
  "m8trx_agent_ami_id": "ami-XXXXXXXXX",  // same as base_ami_id (Path B)
  "platform_domain": "m8trx.ai"
}
4

Tailscale on the dashboard host (mgmt access)

The dashboard binds to a Tailscale IP — there is no public ingress on the operations dashboard. Install Tailscale on the dashboard EC2, join the same tailnet your operators use, and bind the FastAPI server to the Tailscale IP. Detail under the Mgmt Access tab.

curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up --hostname=m8trx-deployer-dashboard --advertise-tags=tag:mgmt
# follow the printed URL to authorize

There is no separate management proxy or management VPC any more — that module was retired on 2026-04-21. Tailscale-on-the-dashboard plus AWS SSM cover the same ground at lower cost.

5

Start the dashboard (with required env)

cd dashboard
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

export DASHBOARD_HOST=$(tailscale ip -4 | head -1)
export DASHBOARD_PORT=8443
export DASHBOARD_USER=admin
export DASHBOARD_PASS='<strong-password>'   # REQUIRED — no default
export AWS_REGION=us-east-2
export PROJECT_NAME=m8trx
python3 main.py

A cron watchdog auto-restarts the process every minute if it dies (see scripts/cron/watchdog.sh). PID lock at dashboard/dashboard.pid prevents duplicates. After every code change, kill and restart: pkill -f "python main.py"; sleep 2; rm -f dashboard/dashboard.pid.

6

Configure Cloudflare (subdomain provisioning)

Open the 📡 settings cog in the dashboard topbar. Enter the Cloudflare API token (zone:DNS:edit + zone:zone:read), zone ID, and platform domain (m8trx.ai). The dashboard validates the token on save. Subdomains are created with proxied=false so origin TLS via Caddy + Let's Encrypt works end-to-end.

7

Configure Brain Telemetry (M8trx Agent customers only)

In the same 📡 settings modal, fill the Brain & Tailscale section — these five fields are required only for deploying M8trx Agent customers; baseline (custom-app) deploys ignore them.

FieldWhat it is
tailscale_api_tokenTailscale OAuth client / API token with ACL + key + device-approval scopes
tailscale_tailnetTailnet name (e.g. M8trxInfra.github)
brain_urlInternal URL of the brain server (Tailscale-internal address preferred)
brain_admin_tokenBrain admin bearer — used by the dashboard, never written to a customer instance
agent_fetch_patGitHub PAT (read-only on M8trxInfra/M8trxAgent) — used at customer first-boot to clone the agent repo
agent_ui_default_passwordInitial Caddy basic-auth password seeded for new customers — operators rotate per-customer afterwards

All fields are written to state.json. GET endpoints return masked values; full plaintext is only available to whoever holds DASHBOARD_PASS.

8

Apply Organization SCPs

Apply the guardrail policies at the AWS Organization level — these prevent anyone (including a leaked admin role) from disabling security services, launching unencrypted instances, or making S3 buckets public.

# In AWS Organizations console:
# Policies > Service control policies > Create policy
# Paste contents of policies/scp-guardrails.json
# Attach to the OU containing your account

Test SCPs in a sandbox account first. An overly broad SCP can lock you out of your own account.

Phase 2 — Deploying a New Customer (Repeatable)

After Phase 1, deploying a new customer is a single dashboard click that takes ~5 minutes. The form picks the template; the dashboard handles brain-telemetry provisioning, terraform apply, Cloudflare DNS, and Tailnet Lock approval as one orchestrated job.

1

Fill the Deploy Form

On the dashboard, click "Deploy New". The fields shown depend on the template you pick.

FieldExampleValidation / Notes
Customer Nameacme3–32 chars, lowercase, alphanumeric + hyphens. Used in all resource names. Must be unique across both templates.
Templatem8trx_agent | baselinePicks which Terraform module + tfvars map is used. Determines whether brain-telemetry provisioning runs.
Instance Typet3.medium (default)Whitelisted dropdown.
Volume Size30 GB20–500 GB. Encrypted EBS with the customer's per-customer KMS key.
Subdomain (Agent only)acmeOptional label. Defaults to the customer name. Final hostname is {label}.{platform-domain}.
Agent UI password (Agent only)auto-generatedCaddy basic-auth on the customer subdomain. Shown in cleartext in the dashboard with a copy button.

Click "Deploy Secure Instance". The form validates all inputs before submitting.

2

Pre-Deploy: Brain Telemetry Provisioning (Agent template only)

For the M8trx Agent template, the dashboard runs brain_telemetry.provision_for_customer() before Terraform sees anything — so a half-provisioned customer never enters the brain or AWS state. Steps stream into the deploy job log:

  1. Brain mintPOST /admin/customers; aborts on 409 (duplicate brain id).
  2. Tailscale ACL upsert — adds tag:m8trx-cust-<id> tagOwner + intra-customer accept rule (must precede auth-key mint).
  3. Tailscale auth key mint — tag-bound, ephemeral, reusable, 90-day.
  4. SSM put × 6/m8trx/<name>/{brain-key, brain-customer-id, brain-url, tailscale-auth-key, agent-fetch-pat, agent-ui-password} as SecureStrings.
3

Terraform Apply (creates AWS resources)

The dashboard runs terraform apply -var-file=<template>.tfvars in a background job (terraform_runner, shell=False). Progress streams to the Jobs tab.

ResourceDetail
VPCIsolated VPC, single public subnet, IGW. CIDR octet allocated stably from the customer name (re-deploys don't shift). Flow logs to CloudWatch (30-day).
Security GroupBaseline: :8080 only. M8trx Agent: :80 + :443. Outbound: all (SSM, GHCR, Brain, Anthropic, Tailscale).
KMS KeyCustomer-dedicated CMK with auto-rotation, key policy scoped to that customer's IAM role.
S3 BucketEncrypted, versioned, public-access-blocked, TLS-enforced, wrong-key-rejected.
IAM RoleInstance role with permission boundary. M8trx Agent additionally gets the brain_telemetry_ssm_read inline policy scoped to /m8trx/<own-name>/*.
EC2 InstanceHardened Ubuntu, IMDSv2 required (hop=1), encrypted EBS, EIP. M8trx Agent template runs the 9-step Path B user-data bootstrap.
DLM SnapshotsDaily EBS snapshots at 03:00 UTC, 14-day retention.
4

Post-Apply: DNS + Tailnet Lock (Agent template only)

After Terraform apply succeeds, the dashboard runs two best-effort follow-ups:

  • Cloudflare DNS upsert — A record {label}.{platform-domain} → instance EIP, proxied=false, comment m8trx-managed:customer.
  • Tailnet Lock approval — polls Tailscale for a device with hostname == <instance-id>; once it appears (cloud-init step 4 ran), POSTs {"keyExpiryDisabled": false} to approve. If cloud-init is slow, the operator can approve manually from the Tailscale admin console.

End-user traffic flow (Agent template): {name}.m8trx.ai → Cloudflare DNS → instance EIP :443 → Caddy (LE cert + basic-auth) → m8trx-bridge:3200 / paperclip:3100.

5

First-Boot Bootstrap (automatic, Agent template — full detail in Compute tab)

  1. Set hostname to instance-id (so Tailnet Lock can find the device).
  2. Install docker.io, docker-compose-plugin, tailscale.
  3. Read 6 SSM SecureStrings → /etc/m8trx/brain.env (mode 0600).
  4. tailscale up --auth-key=… --hostname=<instance-id> --ssh=false.
  5. git clone --depth 1 the agent repo (PAT stripped from .git/config immediately).
  6. Generate compose/.env with random Postgres creds + bcrypt-hashed agent UI password.
  7. Patch the Caddyfile for the customer hostname + LE; add :443 to the compose ports.
  8. docker compose up -d --pull always. First boot ends when Caddy obtains its certificate.

Logs tee'd to /var/log/m8trx-agent/first-boot.log for SSM-side debugging via the dashboard's Shell Access panel.

6

Hand the Customer Their Welcome Package

The dashboard generates a per-customer welcome card on the customer detail page: URL (the subdomain), username (admin), and password (the agent UI password — shown cleartext, copy button). Operator copies into a 1Password share / secure email and sends.

If the customer needs the password rotated later, the operator clicks "Change Agent UI Password" on the customer panel — see Phase 3.

7

Verify Security Posture

On the customer's row, expand the Security Posture panel (API-level checks) and click "Run Check" on the Deep Security Check panel (live SSM hardening checks, ~15–30s). Both should be green before declaring the customer ready.

CheckWhat it verifies
IMDSv2 required (hop=1)Instance metadata requires token + 1-hop limit (no container forwarding)
EBS encryptionAll volumes encrypted with the customer's KMS key
SG rulesOnly :8080 (baseline) or :80+:443 (Agent) ingress
auditd / fail2ban / UFW / AppArmorEach running with immutable / active config
Unattended-upgradesAutomatic security patching active
Agent stack health (Agent template)Compose project up; Caddy has a valid LE cert; paperclip /health green
Brain reachability (Agent template)Tailscale device present + Tailnet-Lock approved + last paperclip→Brain event < 5 min ago

Phase 3 — Ongoing Operations

M
Monitoring

Check the Monitoring tab for CloudWatch alarms, GuardDuty findings, and CloudTrail events. Resource monitors (CPU, memory, disk, network) display in each customer's detail view — memory and disk use SSM fallback when CloudWatch agent data is unavailable. Instance logs viewable in the Logs tab per customer.

I
Incident Response

If an instance is compromised: click "Quarantine" in the dashboard. This snapshots all volumes, replaces security groups with deny-all, and tags the instance. SSM still works for forensics. Never terminate - preserve for investigation.

S
Shell Access

Each customer detail page shows a Shell Access panel. Copy the SSM command to open an interactive terminal, or type commands directly in the dashboard and see output in real time. Commands are validated against a blocklist (no rm -rf /, pipe-to-shell, etc.) and capped at 2000 chars. Endpoint: POST /api/customers/{name}/exec.

B
Backups

"Backup Now" creates on-demand EBS snapshots of all volumes. The Backups panel shows all snapshots (automated + on-demand) with state, progress, time, and type. DLM runs daily at 03:00 UTC with 14-day retention. Endpoints: POST /api/customers/{name}/snapshot, GET /api/customers/{name}/snapshots.

R
Resize Disk

Click "Resize Disk" in the detail view to expand a customer's root EBS volume (20-500 GB). Works on both running and stopped instances. For running instances, the filesystem is expanded automatically via SSM (growpart + resize2fs/xfs_growfs). Stopped instances auto-expand on next boot via cloud-init. Endpoint: POST /api/customers/{name}/resize-disk.

E
Email Alerts (Platform)

Click the gear icon in the topbar to configure email alerts via AWS SES. Set a service email address and choose which events trigger alerts: incidents (quarantine), deploy failures, deploy successes, and CloudWatch alarms. Use "Send Test" to verify delivery. The sender address must be verified in SES. Settings stored in dashboard state. Endpoints: GET/PUT /api/platform/alert-settings, POST /api/platform/alert-test. Platform-level only — these alerts are for operators; customer-facing notifications use the channels flow below.

U
Paperclip Source Upgrade (in-place)

Each existing customer instance runs the Paperclip version baked into its AMI. To avoid re-baking + instance replacement for every Paperclip change, the deployer dashboard has an "Upgrade Paperclip Source" button on the per-customer Paperclip panel. It tars paperclip/ from the dashboard host (excluding venv, caches, local DB), base64-encodes it, ships it inline via SSM, rsyncs into /opt/paperclip-src, re-runs the idempotent bootstrap/install.sh, and restarts paperclip.service. No S3 round-trip, no persistent artifacts. Endpoint: POST /api/customers/{name}/paperclip/upgrade. Size-capped at ~90 KB base64 — if the tree outgrows that, switch to an S3-mediated push.

N
Per-Customer Notifications (Paperclip Channels)

Each M8trx Agent customer has their own Gmail + Telegram notification channels for task events (approvals needed, task complete, task failed). Configured in two places:

  • Deployer dashboard → Paperclip panel → "Notification channels (seed)": operator seeds Gmail SMTP creds (smtp.gmail.com:587 + Google App Password) and/or a Telegram bot token + chat id when onboarding. Written to /etc/paperclip/config.yaml via SSM alongside the Anthropic key and admin password.
  • Paperclip (customer dashboard) → gear icon → Settings modal: the customer can override any field. Overrides live in the instance's SQLite (channel_settings table) and win field-by-field over the deployer seed; blank fields inherit the seed. "Send test" button fires a live notification to verify.

Outbound notifications fire-and-forget from a daemon thread inside Paperclip — hooked at dispatcher._settle() (complete/failed) and POST /api/tasks when needs_approval=true. Failures are logged, never block task execution. Endpoints on Paperclip: GET /api/channels (redacted), PUT /api/channels/{email|telegram}, DELETE /api/channels/{provider} (revert to seed), POST /api/channels/test. Gmail requires 2FA + an App Password, not the account password.

Chat Fast-Path (Intent-Aware SDK Worker)

Casual chat replies previously took 3–5 s (CLI subprocess spawn) and data-aware prompts like "summarize my inbox" took 30–150 s (Claude tool-use loop). Two changes land the fast-path:

  • Intent detection + context pre-fetch (chat_context.py): when a chat message mentions inbox / tasks / events / finance / leads, regex rules detect the intent(s), the relevant SQLite rows are fetched directly (≤50 ms), and injected as a plain-text context block prepended to the task prompt. Claude answers from this context in one turn — no tool-use loop needed.
  • SDK worker (workers/claude_sdk.py): calls the Anthropic Messages API directly via the Python SDK, bypassing the claude CLI subprocess (~1–2 s overhead). API key required; no OAuth path (OAuth is CLI-audience only). First-token latency in the sub-second range for trivial chat on a t3.medium.

Three-tier dispatcher routing for chat-linked tasks (tasks with an assistant placeholder message): Tier 1 — SDK worker (fast, API key required). Tier 2 — CLI worker (OAuth supported, full tool use) on any SDK availability/auth failure. Tier 3 — Hermes/Llama (local, free) on CLI availability failure. Non-chat tasks (draft-reply approval, coding) go straight to the CLI worker unchanged. The meta field on each WorkerResult records which worker answered and any fallback chain (fallback_from).

System prompt is shared via workers/_claude_system_prompt.py — imported by both CLI and SDK workers so the context description stays identical across both paths.

U
Adding Users

Customer admins manage their own users via the agent API. Each user gets isolated data directories and scoped permissions. All actions audit-logged and shipped to CloudWatch.

D
Delete Customer

Click "Delete Customer" in the detail view. The dashboard refuses if AWS termination protection is on — disable it first (operator-explicit, intentionally a friction point). Pre-destroy: empty S3 (including versioned objects), delete CloudWatch log groups, detach IAM permission boundaries. Run terraform destroy. Post-destroy (Agent template): delete the 6 SSM params under /m8trx/<name>/*, DELETE /admin/customers/<brain_customer_id> on the brain, remove the Cloudflare A record. KMS key keeps a 30-day deletion window.

P
Rotate Agent UI Password

Per-customer Caddy basic-auth password. Click "Change Agent UI Password" on the customer panel (8–256 chars, validated). Two-step rotation: (1) update /m8trx/<name>/agent-ui-password in SSM (so future re-deploys pick up the new password); (2) SendCommand to the running instance — re-hash with caddy hash-password, write into compose/.env, docker compose up -d --force-recreate caddy. Plaintext is base64-passed in the SSM command (encrypted in transit, not logged at rest). Endpoint: POST /api/customers/{name}/agent-password.

T
Tailscale Auth-Key Rotation

Customer auth keys are tag-bound, ephemeral, 90-day. They auto-revoke at expiry — but the device stays joined as long as tailscaled is running. To rotate ahead of expiry: re-deploy the customer (full bootstrap with a fresh key) or run a manual mint + push the new key into the customer's SSM, then restart tailscaled on the instance via the Shell Access panel. Planned: a "Rotate auth key" button on the customer panel.

O
Orphan Detection

A background decommission checker runs every 5 minutes, scanning AWS for orphaned resources (EC2 instances, VPCs, S3 buckets, KMS keys, IAM roles) from deleted customers no longer in state. Orphaned instances and VPCs are auto-cleaned to prevent cost leaks. On-demand check: GET /api/decommission-check.

Troubleshooting

IssueSolution
Deploy job shows "No valid credential sources"The dashboard host needs AWS credentials. Attach an IAM role to the instance or set AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY env vars before starting the dashboard.
Deploy aborts: "Brain telemetry settings are not configured"Open the 📡 settings cog and fill the brain-telemetry section (5 fields). Required only for M8trx Agent customers.
Deploy aborts: "Brain already has a customer with id 'cust_…' (409)"The brain side has a leftover customer. Either pick a different deployer name, or revoke the existing brain customer via the brain admin UI before retrying.
Tailscale ACL upsert fails 412Race against another concurrent dashboard run. Retry — the upsert uses ETag-based optimistic locking and the second attempt usually wins.
Tailnet Lock approval times out post-applyCloud-init may still be running on the instance. Check /var/log/m8trx-agent/first-boot.log via SSM. Once the device appears in Tailscale, approve manually from the admin console — no need to re-run the deploy.
Caddy can't get a Let's Encrypt certVerify the Cloudflare A record exists (proxied=false), points at the EIP, and SG ingress :80 is open (LE HTTP-01 challenge). docker compose logs caddy via Shell Access for detail.
Customer can't reach {name}.m8trx.aiDNS not yet propagated — ~1 min typical for Cloudflare. Or the Caddy basic-auth password they were given is stale: rotate via Phase 3 card.
Compose stack not startingShell Access → cd /opt/m8trx-agent && docker compose ps. Check docker compose logs paperclip for missing env vars or DB init issues.
Deep security check shows "unknown"SSM agent takes 1–2 minutes to register after instance launch. Also verify the instance's IAM role has SSM permissions.
Destroy refuses with "termination protection enabled"Intentional — disable termination protection in the EC2 console (Actions → Instance Settings → Change termination protection) before retrying. Forces operator-explicit consent for destructive actions.

API Token Usage Tracking

Per-customer Claude API token consumption analytics with cost estimation and configurable alert thresholds. Tracks input, output, cache read, and cache creation tokens across all Claude model tiers.

Architecture Overview

CUSTOMER INSTANCES Claude API Calls Token Counts Model Selection usage_scanner.py Token aggregation & cost calc Per-customer, per-model, per-day usage.db SQLite /api/usage/* FastAPI Router HTTP Basic Auth required Cost Alert Engine Threshold monitoring SES email notifications state.json (thresholds) Dashboard Token Usage Tab

Data Model

Usage data is stored in a local SQLite database (usage.db) with the following schema.

ColumnTypeDescription
customerTEXTCustomer identifier (e.g. acme-corp)
dayTEXT (ISO date)Date of usage
modelTEXTClaude model used (opus-4-6, sonnet-4-6, haiku-4-5)
sessionsINTEGERNumber of API sessions
input_tokensINTEGERPrompt input tokens consumed
output_tokensINTEGEROutput tokens generated
cache_readINTEGERCache read tokens (prompt caching)
cache_creationINTEGERCache creation/write tokens

Indexed on (customer), (day), and unique on (customer, day, model) for efficient querying and upsert operations.

Pricing & Cost Calculation

Costs are estimated using per-model pricing rates (per 1M tokens). The calc_cost() function computes costs across all four token types.

ModelInputOutputCache WriteCache Read
Claude Opus 4.6$6.15$30.75$7.69$0.61
Claude Sonnet 4.6$3.69$18.45$4.61$0.37
Claude Haiku 4.5$1.23$6.15$1.54$0.12

API Endpoints

All endpoints require HTTP Basic Auth. Mounted under /api/usage/ via FastAPI router.

MethodEndpointDescription
GET/api/usage/dataFull usage dashboard data: daily breakdowns, customer totals, pricing, current costs, and alert thresholds
POST/api/usage/seedPopulate demo/synthetic usage data for POC customers
POST/api/usage/resetClear all usage data and re-seed with fresh demo data
GET/api/usage/thresholdsList all configured cost alert thresholds
PUT/api/usage/thresholdsSet or update a per-customer cost alert threshold (period + dollar limit)
DELETE/api/usage/thresholds/{customer}Remove the cost alert threshold for a customer
POST/api/usage/check-alertsCheck all thresholds against current costs; sends SES email alerts for breaches

Cost Alert Thresholds

T
Threshold Configuration

Operators set per-customer cost limits with a period (week or month) and a dollar amount. Thresholds are stored in state.json alongside other dashboard state. When a customer's estimated cost for the current period meets or exceeds the threshold, an alert is triggered.

A
Alert Pipeline

The /api/usage/check-alerts endpoint evaluates all enabled thresholds against current-period costs. Breaches trigger email alerts via AWS SES (using the alert_email service) containing the customer name, period, threshold, current cost, and overage amount. Returns a list of all breaches for dashboard display.

C
Cost Aggregation

The get_customer_costs() function aggregates token usage for the current period: week (Monday through today) or month (1st through today). It sums costs across all models per customer using the calc_cost() pricing function, covering input, output, cache read, and cache write tokens.

Key Files

FilePurpose
dashboard/services/usage_scanner.pyCore module: database init, demo data seeding, usage queries, cost calculation
dashboard/routers/usage.pyFastAPI router: REST endpoints for usage data, thresholds, and alert checks
dashboard/usage.dbSQLite database storing per-customer token usage records
dashboard/state.jsonPersists alert thresholds alongside other dashboard settings