m8trxDeployer — Architecture Review

System Overview

Multi-tenant AWS platform that deploys hardened, fully isolated customer environments running AI agents. Two customer templates ship today: a baseline environment for custom apps, and the packaged M8trx Agent (Paperclip + Hermes + Claude Code) reachable on a per-customer HTTPS subdomain. A shared Brain Telemetry control plane connects every Agent instance to a central Brain over a Tailscale mesh — for monitoring, key rotation, and operator visibility — without ever exposing a customer instance to another customer.

Executive Summary

m8trxDeployer deploys and manages fully isolated, hardened AWS environments for customers running AI-powered business applications. Today every customer gets their own VPC, KMS key, S3 bucket, IAM role, and EC2 instance — zero shared infrastructure between customer data planes. AI agents that process sensitive data (email, financials, business documents) run in Docker containers managed by the customer's own compose stack, with seccomp, AppArmor, non-root users, and strict resource and network limits.

Two customer templates ship today: baseline (operator-managed Docker compose, single port) and the packaged M8trx Agent (Paperclip orchestrator + Caddy TLS + Postgres + local Llama, reachable at {customer}.{platform-domain} over HTTPS via Let's Encrypt). The Agent template auto-provisions a Cloudflare DNS record, a per-customer Tailscale device on the Brain mesh, and a tag-bound ephemeral auth key on every deploy.

The platform is designed for minimal customer interaction — customers receive a URL and credentials, nothing else. All infrastructure, security, monitoring, and the per-customer Brain key rotation are handled by your team through a unified operations dashboard. Admins reach the dashboard over Tailscale (recommended) or AWS SSM port-forwarding — zero ports open to the public internet. The dashboard supports multi-admin use; any team member can deploy a new customer, rotate the agent UI password, run shell commands via SSM, view logs, take snapshots, or quarantine an instance without ever touching the AWS console.

Where Cloudflare fits. We use Cloudflare for one thing: DNS only (records created with proxied=false). Each agent gets a subdomain under our platform zone (m8trx.ai). The Cloudflare API token, zone ID, and platform domain live in dashboard settings. We do not use Cloudflare's HTTPS proxy, Tunnel, or Access — origin TLS is terminated on the customer instance by Caddy with a Let's Encrypt cert obtained at first boot.

Planned (not yet shipped): a multi-tenant compute mode that packs multiple low-traffic customers onto a single hardened EC2 host. Today every customer is on its own EC2 — but the agent stack is already a Docker compose project (Path B), the project name is namespaced per-customer, the storage is a per-customer KMS-encrypted volume, and each agent already runs its own tailscaled with a tag-bound identity. The remaining work is host-level orchestration of multiple compose projects, a credential broker for per-customer IAM, and a SNI-routing reverse proxy. Goal: drop per-customer cost from ~$37/mo to ~$8–12/mo for small customers without weakening the per-customer trust boundary. See the Compute tab for the full breakdown.

Isolation Model

1 Customer = 1 UID = N Agents

Each customer gets a stable UID (cust_<name>) that drives the brain key, Tailscale tag, SSM path, and subdomain. A customer can run one agent or many — all share the same tag, forming a per-customer Tailscale "VLAN". Cross-customer traffic is denied by ACL; cross-customer AWS access is denied by IAM and KMS.

Security Layers

9 Defense-in-Depth Controls

Network isolation, IAM boundaries, KMS encryption, OS hardening, container sandboxing, AppArmor, auditd, monitoring, SCPs.

Admin Experience

Multi-Admin, No AWS Expertise Needed

Any team member can deploy, monitor, and manage customers from a single dashboard. Connect via Tailscale or Cloudflare (free, zero ports exposed). Add/remove admins instantly.

User Model

Per-User Data Isolation

Multiple users per customer with RBAC, granular permissions, per-user data directories, and full audit trail.

Deployment

One-Click from Dashboard

Fill a form, click deploy. Pre-deploy: dashboard mints a Brain customer key, mints a Tailscale auth key, writes 6 SSM parameters. Terraform applies. Post-deploy: Cloudflare DNS upsert, Tailnet Lock device approval. Customer reachable at {name}.m8trx.ai in ~5 minutes.

Incident Response

One-Click Quarantine

Snapshot volumes, isolate network, tag instance — all automated. SSM still works for forensics.

Infrastructure as Code: Terraform (9 modules, S3+DynamoDB backend) Dashboard: Python FastAPI Agent Runtime: Docker compose (Caddy + Paperclip + Postgres + Ollama) + seccomp + AppArmor Origin TLS: Caddy + Let's Encrypt (per-customer) DNS: Cloudflare API (proxied=false; origin TLS) Mesh: Tailscale (tag-bound ephemeral keys) Monitoring: GuardDuty + CloudTrail + CloudWatch + AWS Config + Brain telemetry

✓ PLATFORM READINESS — current state of one-time setup

Hardened Base AMI DONE

Packer-baked Ubuntu 22.04 with full hardening (kernel, auditd, fail2ban, UFW, AppArmor, AIDE). Used by both customer templates. AMI ID pinned in base.tfvars.json; M8trx Agent template uses the same AMI and bootstraps the agent stack at first boot (Path B — git+compose, no agent-specific AMI bake).

Terraform State Backend DONE

S3 bucket m8trx-terraform-state (versioned, KMS-encrypted, public access blocked) + DynamoDB lock table terraform-locks in us-east-2. Backend configured in terraform/main.tf.

Domain & Cloudflare DONE

Platform domain m8trx.ai is live on Cloudflare. API token + zone ID stored in dashboard settings (or env: CLOUDFLARE_API_TOKEN, CLOUDFLARE_ZONE_ID, PLATFORM_DOMAIN). Subdomains created on deploy with proxied=false — Caddy on the customer instance terminates origin TLS via Let's Encrypt.

Brain Telemetry & Tailscale DONE

Brain admin token, Tailscale API token + tailnet, agent fetch PAT, and default agent UI password configured in dashboard settings. Pre-deploy lifecycle mints a Brain customer key, upserts Tailscale ACL, mints a tag-bound ephemeral 90-day auth key, and writes 6 SecureString SSM params under /m8trx/<name>/*. Destroy reverses all of it. See Brain & Tailscale tab.

Management Access DONE

Dashboard binds to Tailscale IP — not reachable on the public internet. Standalone Tailscale on the dashboard host (mgmt-proxy module retired 2026-04-21). SSM Session Manager covers shell access; sessions land as ubuntu (not root, not ssm-user).

Dashboard Auth DONE

HTTP Basic Auth middleware on every route except /api/health. DASHBOARD_PASS required at start (no default — process refuses to boot without it). PID lock prevents duplicate instances. cron watchdog auto-restarts every minute.

Organization SCPs FILE READY

Policy file policies/scp-guardrails.json is complete. Action: attach to the OU in AWS Organizations console — cannot verify application status from this dashboard.

Multi-Tenant Compute PLANNED

Design in progress: shared EC2 host with multiple per-customer compose stacks, each with its own network namespace, KMS-encrypted volume mount, IAM via metadata token broker, and Tailscale identity. Currently every customer is on their own EC2.

Green = Production-ready Yellow = Action remaining / planned Red = Not started / critical

Per-customer cost ~$37–38/mo (t3.medium, audited 2026-05-05) • Shared platform cost ~$3–6/mo

Last checked: 2026-05-05 Region: us-east-2 across all environments

Per-Customer VPC

Dedicated VPC with public subnet, internet gateway, and VPC flow logs. Cost-optimized (no NAT/ALB). No cross-customer network path exists at the AWS layer.

Per-Customer Encryption

Dedicated KMS CMK per customer with auto-rotation. EBS, S3, and Brain-telemetry SSM SecureStrings encrypted. One customer's key cannot decrypt another's data.

Agent Sandboxing

For the M8trx Agent template, the Paperclip stack runs as a Docker compose project (Caddy + Paperclip + Postgres + local Llama + bridge), each container non-root with seccomp, dropped caps, and resource limits. AppArmor on the host.

Brain & Tailscale Mesh

Each Agent customer joins a Tailscale tailnet under a tag-bound, ephemeral 90-day auth key, and gets a unique Brain customer key for telemetry. Cross-customer ACLs reject all traffic; the Brain reaches each instance only via its tag.

Network Architecture

Each customer is fully isolated at the AWS network layer. There are no VPC peering connections, no shared subnets, and no AWS-internal paths between customer VPCs. Operator and Brain reach each customer only over Tailscale (an encrypted overlay) using customer-specific tags. Each Agent customer is also reachable from the public internet on its own subdomain over HTTPS, with TLS terminated by Caddy on the instance.

VPC Design (Per Customer)

Component	CIDR / Config	Purpose
VPC	`10.{octet}.0.0/16` (octet allocated stably from name hash)	Isolated network per customer; never reused after deletion until octet is freed
Public Subnet (x1)	`/24` in AZ-a	EC2 instance with EIP — outbound for SSM, GHCR pulls, Brain, Tailscale; inbound only on the template-specific port(s)
Internet Gateway	Free, attached to VPC	Outbound internet (no NAT — saves ~$32/mo per customer)
Flow Logs	CloudWatch, 30-day retention	Full traffic audit trail for all VPC traffic
VPC Peering	None	Operator and Brain reach customers over Tailscale, not VPC peering. (mgmt-proxy module retired 2026-04-21.)

Security Group Per Template

Template	Ingress	Egress	Notes
Baseline (custom-app)	`:8080/tcp` from `0.0.0.0/0`	All	Single application port. Operator brings their own compose stack.
M8trx Agent	`:80/tcp` + `:443/tcp` from `0.0.0.0/0`	All	Caddy serves HTTPS on :443 with a Let's Encrypt cert. :80 stays open for ACME HTTP-01 challenges and direct-IP HTTP fallback (auto-redirect disabled).

DNS & Subdomain Architecture

Every M8trx Agent instance gets its own HTTPS subdomain under the platform zone (m8trx.ai). Subdomains are created and torn down by the dashboard through the Cloudflare API (dashboard/services/domain.py) — DNS only (proxied=false). Cloudflare does not proxy traffic; TLS is terminated on the customer instance by Caddy using a Let's Encrypt cert obtained at first boot via HTTP-01. Endpoints: GET / POST / DELETE /api/platform/domains/{customer_name}.

When a customer has one agent the subdomain label defaults to the customer name (acme.m8trx.ai). When a customer has multiple agents, each instance is deployed with a distinct subdomain label so its DNS record points at its own EIP — the convention is {customer-name}-{role} (e.g. acme-inbox.m8trx.ai, acme-finance.m8trx.ai) or {customer-name}-{n}. The deploy form's Subdomain field is the operator's lever for this — it accepts any DNS-safe label, and the same label is used as the customer's display name on the welcome card.

Customer-end-user traffic flow:
{name}.m8trx.ai → Cloudflare DNS (A record, proxied=false)
  → Customer EIP :443
  → Caddy (origin TLS via Let's Encrypt)
  → Paperclip / Bridge container (basic-auth)

Property	Detail
Domain cost	~$10/yr for the platform root domain; subdomains free and unlimited
HTTPS	Caddy + Let's Encrypt on the customer instance — auto-renewed every ~60 days. No platform-side cert management.
Why proxied=false	Origin TLS gives end-to-end encryption to the instance and avoids Cloudflare-edge plan limits (cert renewal, payload size, websocket quirks).
On deploy	Dashboard upserts an A record after Terraform apply succeeds. `auto_https disable_redirects` + a customer-hostname block + `basic_auth` are appended to `compose/Caddyfile` on first boot.
On destroy	Dashboard removes the A record and rebuilds consolidated Cloudflare rules.
Settings	Cloudflare API token + zone ID + platform domain stored in dashboard settings (state.json), with env-var fallback (`CLOUDFLARE_API_TOKEN`, `CLOUDFLARE_ZONE_ID`, `PLATFORM_DOMAIN`).

Tailscale Overlay = Per-Customer "VLAN"

Operator and Brain traffic to customer instances does not traverse the public internet — every M8trx Agent instance joins a private Tailscale tailnet under a customer-specific tag (tag:m8trx-cust-<customer-uid>). All of one customer's agents share the same tag, so an ACL rule that says "src=tag, dst=tag:*" gives them a private network they can use to call each other. There is no equivalent rule across different customer tags, so cross-customer traffic is denied by default. Functionally, this is a per-customer VLAN built in software.

The dashboard mints a tag-bound, ephemeral, reusable, 90-day auth key per customer and writes it to SSM (the same key is reused across that customer's agents — that's why it's reusable). Each customer instance's user_data.sh joins the tailnet at first boot with that key, hostname set to the EC2 instance-id so the dashboard can find it for Tailnet Lock approval.

Multi-agent scenario	What that means on the network
Customer "acme" has one agent	One tailscaled device, one tag, one subdomain. The accept rule on the tag is in place but unused.
Customer "acme" adds a second agent (e.g. `acme-finance`)	Second instance joins the tailnet under the same tag. Operator picks a distinct subdomain (`acme-finance.m8trx.ai`). The two agents can now reach each other on Tailscale-internal IPs by hostname (e.g. `http://i-0abc...`) without any further config — the intra-tag accept rule allows it.
Customer "beta" deploys an agent	Different tag. Cannot see "acme" agents on Tailscale at all. No ACL rule connects the two tags.
Operator (you)	Tagged `tag:mgmt`, with explicit accept rules to all customer tags. Reaches every agent for monitoring and shell.

Property	Detail
Auth key	Tag-bound (only registers as `tag:m8trx-cust-<id>`), ephemeral (auto-revoked at expiry), reusable, 90-day TTL
SSH	`--ssh=false` on customer device (we use SSM Session Manager for shell access)
ACL	Per-tag `tagOwners` + intra-customer-only accept rule. No cross-customer rule exists, so Tailscale rejects traffic between different customer tags by default.
Approval	Tailnet Lock — dashboard polls Tailscale for the device by hostname (=instance-id) and approves it after Terraform apply succeeds. Manual approval still possible from Tailscale admin console as a fallback.
Decommission	Customer destroy: ephemeral keys auto-revoke at expiry; the device drops off the tailnet when the EC2 is gone (no manual revoke needed in normal flow).

Compute Architecture

Two customer templates ship today. Both run on the same hardened Ubuntu 22.04 base AMI (Packer-baked) with identical OS controls. They differ in what runs above the OS and how the workload bootstraps. A planned third mode will pack multiple low-traffic customers onto a shared host without weakening per-customer isolation.

Two Customer Templates

Aspect	Baseline (custom-app)	M8trx Agent
Terraform module	`terraform/modules/ec2`	`terraform/modules/ec2-m8trx-agent`
tfvars map	`customers`	`m8trx_agent_customers`
Resource naming suffix	none — `m8trx-{name}-*`	`-agent` — `m8trx-{name}-agent-*` (avoids collisions)
Inbound port(s)	`:8080`	`:80` + `:443`
TLS	Operator's responsibility	Caddy + Let's Encrypt on instance (per-customer subdomain)
Bootstrap	Minimal — operator supplies their own compose / dockerfile / app	Path B — vanilla AMI + first-boot `git clone M8trxAgent` + `docker compose up`
Brain telemetry / Tailscale	Not wired	Required (6 SSM SecureStrings written pre-deploy)
Sample customers	`keithenterprises`	`agent2`

M8trx Agent First-Boot Bootstrap (Path B)

The M8trx Agent template uses the same hardened AMI as the baseline — no agent-specific Packer build. The full stack is fetched and started at first boot via cloud-init. This means deploying a new Agent customer always picks up the latest M8trxAgent source (subject to a fixed --depth 1 clone of main); operator re-bakes are not in the critical path.

Set hostname to instance-id

So the dashboard's post-deploy Tailnet Lock approval poll can find the device: hostnamectl set-hostname $(IMDSv2 instance-id).

Install runtime deps (idempotent)

awscli, git, jq, curl, docker.io, docker-compose-plugin, tailscale. Wait for the Docker socket to be ready (up to 30s) before continuing.

Read 6 SSM SecureString params

brain-key, brain-customer-id, brain-url, tailscale-auth-key, agent-fetch-pat, agent-ui-password — all under /m8trx/<customer-name>/. The instance role can read only its own path; see Brain & Tailscale tab.

Join the tailnet

tailscale up --auth-key=<ssm> --hostname=<instance-id> --ssh=false --accept-routes=false --accept-dns=false --reset --timeout=120s.

Write `/etc/m8trx/brain.env` (mode 0600)

Three lines: BRAIN_URL, BRAIN_API_KEY, BRAIN_CUSTOMER_ID. umask 077 in a subshell prevents the brief 0644 race that an after-the-fact chmod would leave.

Clone the agent repo, then strip the PAT

git clone --depth 1 https://oauth2:<PAT>@github.com/M8trxInfra/M8trxAgent.git /opt/m8trx-agent, then git remote set-url origin https://github.com/M8trxInfra/M8trxAgent.git so the PAT is gone from .git/config. unset GH_PAT clears the env var.

Generate `compose/.env` (mode 0700)

Random Postgres credentials, random JWT secret, allowed-hostnames list (localhost + EIP + customer hostname). The agent UI password is bcrypt-hashed by caddy hash-password; every literal $ is doubled so docker-compose interpolation survives.

Patch the Caddyfile and compose for hostname-aware TLS

Prepend { auto_https disable_redirects } (so direct-IP HTTP keeps working alongside the hostname block), append a {name}.{platform-domain} { basic_auth … reverse_proxy … } block, and add "443:443" to the paperclip service ports. Idempotent: skips if already applied.

Bring up the stack

docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d --pull always. The .prod.yml override pulls images from GHCR — no local builds in production. First boot ends when caddy obtains its certificate.

Cloud-init output is tee'd to /var/log/m8trx-agent/first-boot.log for SSM-side debugging. Total user-data is kept under the AWS 16 KB user-data hard limit — anything bigger goes into the cloned repo, not user-data.

Audit Rules (auditd) — Immutable, both templates

What's Monitored	Rule Key
Authentication config changes (/etc/pam.d, shadow, passwd)	`auth_config`
Sudo usage and sudoers changes	`sudoers`
SSH config modifications	`sshd_config`
File deletions by non-system users	`file_deletion`
Privilege escalation (execve as root)	`privilege_escalation`
Agent data directory access	`agent_data_access`
Kernel module loading	`kernel_modules`

Planned: Multi-Tenant Compute (foundations in place, host-level orchestration pending)

Today, every customer is on their own EC2 — strong isolation, but ~$37/mo even when a customer is sending two emails a week (~$30 of that is the t3.medium itself). The planned multi-tenant compute mode packs multiple low-traffic customers onto a single hardened EC2 host, while keeping the per-customer trust boundary intact. Goal: ~$8–12/mo per small customer at the per-customer slice level, with the same external behaviour (separate subdomain, separate Brain customer key, separate Tailscale tag, separate KMS volume).

This is not a redesign — it's an optimization on top of what Path B already gives us. The agent stack is already a Docker compose project, the project name is already namespaced per customer, the storage is already a per-customer KMS-encrypted EBS volume, and each agent already runs its own tailscaled with a tag-bound identity. Most of the per-customer trust boundary is therefore already in place at the compose level; the missing pieces are at the host level.

Concern	Status today (single-tenant per host)	Multi-tenant change
Compute	Done — One `docker compose` project per customer, namespaced `m8trx-{name}_*`. Each stack runs as non-root with seccomp + dropped caps + per-service resource limits. (The agent template already bakes Docker into first-boot — Path B.)	Pending — Run multiple compose projects on one host. User-namespace remap to give each customer its own UID/GID range. Per-project cgroup limits.
Network	Done — Per-customer compose creates its own Docker network. Caddy on the instance terminates TLS for one customer subdomain.	Pending — Host-level SNI-routing reverse proxy in front of the per-project Caddys (or per-project ENI with separate EIP per customer). Each subdomain still terminates with that customer's own Caddy + LE cert.
Storage	Done — Per-customer KMS-encrypted EBS volume. Per-customer S3 bucket with TLS + KMS enforced.	Pending — Multiple per-customer EBS volumes attached to the same host, each mount-namespaced into one compose project. A leaked container can't read another customer's mount because the KMS grant is on a different key.
IAM	Partial — Today the EC2 instance role is the customer role (1:1). Multi-tenant changes that.	Pending — Host runs a credential broker bound to localhost. Each compose project authenticates with a short-lived bearer (mounted from a per-project secret); broker exchanges it for STS creds on that customer's IAM role. Container `iptables` blocks IMDS as today.
Tailscale identity	Done — Per-customer tag-bound auth key, per-customer tagOwner, intra-customer accept rule. The reusable flag means one key onboards multiple agents under the same tag.	Pending — Per-customer `tailscaled` in its own network namespace on the shared host (instead of the host-level tailscaled used today). Each compose project sees only its own customer's tag.
Brain identity	Done — Per-customer UID + bearer minted by the brain. The customer's bearer token never leaves that customer's compose project.	No change — same flow, same SSM scope per customer.
Blast radius	EC2 compromise affects one customer.	Host-kernel compromise affects every customer on that host (same as any multi-tenant K8s node). Critical / regulated customers stay on dedicated EC2 — opt-in per customer at deploy.
Migration	—	Existing customers stay on dedicated EC2; new low-traffic customers can opt in. Dashboard exposes the choice on the deploy form.

Status: most per-customer guarantees are already enforced at the compose / KMS / Tailscale layer — Path B got us there. The remaining work is host-level: a multi-project compose orchestrator, a credential broker, an SNI-routing reverse proxy, and a deploy-form opt-in. No new Terraform module yet.

IAM & Access Control

Each customer instance assumes a unique IAM role with a permission boundary that hard-caps maximum privileges.

Organization SCPs

Deny Disable CloudTrail

StopLogging, DeleteTrail, UpdateTrail

Deny Disable GuardDuty

DeleteDetector, DisassociateFromMaster

Deny Disable Config

StopConfigurationRecorder, DeleteConfigurationRecorder

Deny Root Account

All actions blocked for root principal

Deny IMDSv1 Instances

RunInstances denied unless HttpTokens=required

Deny Unencrypted EBS

RunInstances denied if ec2:Encrypted=false

Deny Public S3

PutBucketPublicAccessBlock/Policy/ACL blocked

Deny Unapproved Regions

EC2/RDS/Lambda restricted to us-east-1, us-west-2

Brain Telemetry — additional inline scope

Each M8trx Agent customer's instance role carries one extra inline policy, brain_telemetry_ssm_read, scoped to only that customer's SSM path. The permission boundary explicitly permits ssm:GetParameter / ssm:GetParameters so the inline policy isn't blocked at the boundary. Full detail with example policy JSON: Brain & Tailscale tab.

Resource ARN	Allowed actions	Why scoped this way
`arn:aws:ssm:::parameter/m8trx/<own-customer-name>/*`	`ssm:GetParameter`, `ssm:GetParameters`	A leaked customer instance role cannot read another customer's brain key, Tailscale auth key, or agent UI password — pivot fails at the IAM layer, not just at runtime.

Brain Telemetry & Tailscale Mesh

A control plane that connects every M8trx Agent customer instance to a central Brain service over a Tailscale tailnet. The Brain stores per-customer metadata (events, agents, API keys) and is what the operator dashboard reads from when monitoring fleet posture. The mesh is the only network path the Brain ever uses to reach a customer; there is no public-internet path from Brain to customer.

What is the Brain? A separate, internally-hosted service (different repo, different deploy) that holds the customer registry, per-customer API keys, and per-customer event/agent telemetry. The deployer-side dashboard talks to the Brain over HTTP+bearer (the brain_admin_token), and customer instances talk to the Brain over their per-customer bearer (brain-key). This page documents the integration points only — the Brain's internals (storage, indexing, alerting) live in the Brain repo's own docs.

The Customer UID — one identifier ties everything together

Every Agent customer has a single, stable, public identifier — the Customer UID — derived from the deployer-side customer name on first deploy and never reused. The UID drives every other per-customer artifact in the platform; getting it right matters because it is the only string we use to scope cross-system access.

Artifact	Shape	Example (for customer name `acme`)	Where it's enforced
Customer UID (public id)	`cust_<name>` with `-`→`_`	`cust_acme`	Brain customer table
Customer bearer token (private)	Random string, returned by Brain on mint	`br_xxx…`	Brain authenticates every event with it
Tailscale tag	`tag:m8trx-cust-<uid>`	`tag:m8trx-cust-acme`	Tailscale ACL `tagOwners` + intra-customer accept rule
SSM namespace	`/m8trx/<name>/*`	`/m8trx/acme/*`	Per-customer IAM `brain_telemetry_ssm_read` scope
Cloudflare subdomain	`{label}.{platform-domain}`	`acme.m8trx.ai` or `acme-finance.m8trx.ai` (per agent when multiple)	Cloudflare DNS A record + Caddy hostname block
Terraform tags	`Customer = <name>-agent`	`Customer = acme-agent`	tfvars + every AWS resource tag

A customer can run multiple agents (e.g. an inbox agent and a finance agent). All of that customer's agents share the same UID, the same Tailscale tag, the same brain customer entry, and the same SSM namespace — but get distinct subdomains and distinct EC2 instances. The shared tag is what gives them a private network to talk to each other on (the per-customer "VLAN" — see Network tab).

Status: the infra layer (Tailscale tagging, brain mint idempotency, SSM scoping, Cloudflare provisioning) is already ready for multiple agents under one UID — the auth key is reusable, the tag's accept rule has no device-count limit, and the SSM IAM scope wildcards over the namespace. The piece that is not yet wired is the deploy form: today it treats each new deploy as a new customer (1 deployer-name = 1 brain mint = 1 EC2). The next step is a "Tenant" dropdown that lets the operator add a new agent under an existing customer UID — re-using the brain customer, the Tailscale tag, and the SSM namespace, while creating a fresh subdomain and EC2.

SSM Parameter Layout (per customer)

All parameters live under /m8trx/<deployer-customer-name>/* (the deployer name, not the brain id — so user-data needs only the one name terraform already templates in). Five are SecureString (KMS-encrypted via alias/aws/ssm); only brain-customer-id and brain-url are stored in the clear.

Parameter	Type	Source	Used by
`brain-key`	SecureString	Brain mint response (pre-deploy)	Paperclip → Brain bearer
`brain-customer-id`	String	Derived: `cust_` + name with `-`→`_`	Paperclip event payloads
`brain-url`	String	Settings (one value, fleet-wide)	Paperclip → Brain endpoint
`tailscale-auth-key`	SecureString	Tailscale mint response (pre-deploy)	tailscaled at first boot
`agent-fetch-pat`	SecureString	Settings (one value, fleet-wide; read-only on M8trxAgent)	`git clone` at first boot — stripped from `.git/config` immediately
`agent-ui-password`	SecureString	Settings default; rotatable per-customer	Caddy basic-auth (bcrypt hashed at first boot via `caddy hash-password`)

Pre-Deploy Lifecycle (provision)

Order matters. Each step streams progress to the dashboard's job log via on_output. If any step fails, the deploy aborts before terraform apply touches AWS — so a half-provisioned customer never enters the brain. Source: dashboard/services/brain_telemetry.py.

Mint a Brain customer key

POST {brain_url}/admin/customers with bearer = brain_admin_token, body { customer_id: cust_acme, name: acme }. Returns { api_key, key_id }. A 409 Conflict means the brain already has this id — abort, since reusing it would let the new instance read the previous instance's events.

Upsert Tailscale ACL entries (must precede key mint)

Tailscale rejects an auth key whose tag isn't already declared in tagOwners. GET the ACL with the If-Match ETag, add tag:m8trx-cust-<id> → autogroup:admin to tagOwners if missing, and add an intra-customer accept rule (src=tag, dst=tag:*) if missing. Single POST back. Idempotent — no write if no change.

Mint a tag-bound, ephemeral, reusable, 90-day Tailscale auth key

POST /api/v2/tailnet/{tailnet}/keys with capabilities devices.create.tags=[tag:m8trx-cust-<id>], ephemeral=true, reusable=true, preauthorized=false (Tailnet Lock approval is a separate post-apply step), expirySeconds=90×86400. Returns the bearer auth key.

Write 6 SSM parameters

All under /m8trx/<name>/. Done last — if Tailscale failed in step 2 or 3, no SSM is touched, so a re-run starts clean. agent_fetch_pat is the same value across all customers but copied per-customer so it fits the existing per-customer IAM read scope; no fleet-level IAM change needed.

Post-Apply: Tailnet Lock approval

After terraform apply succeeds, the dashboard polls Tailscale for a device with hostname == <instance-id>. When it appears (cloud-init has finished step 4 of bootstrap), the dashboard calls POST /api/v2/device/{device_id}/key with {"keyExpiryDisabled": false} to approve it under Tailnet Lock. This step is best-effort: if cloud-init is slow, the operator can still approve manually from the Tailscale admin console; the deploy job logs a warning rather than failing.

Destroy Lifecycle

Step	Action	Notes
1	Pre-destroy: confirm termination protection is OFF on the EC2	Refuses to start if on — operator must explicitly disable, to avoid leaving SSM/IAM/SG cleanup half-done while the instance survives
2	Run `terraform destroy`	Deletes EC2, SG, IAM, KMS (30-day window), VPC, S3, etc. Cloudflare A record removed by dashboard separately
3	Delete the 6 SSM parameters under `/m8trx/<name>/*`	Idempotent — missing params are skipped
4	`DELETE {brain_url}/admin/customers/<brain_customer_id>`	Cascades on the brain side: events, agents, api_keys, customer row
5	Tailscale auth key auto-revokes	Ephemeral + 90-day expiry — no explicit revoke API call needed in the normal path

Settings UI (one-time, fleet-wide)

All five fleet-wide brain-telemetry values are entered once via the dashboard's topbar settings cog (📡 modal). GET endpoints return masked values (last 4 chars + bullets); POST writes them to state.json.

Setting	What it is	Used during
`tailscale_api_token`	OAuth client / API token with ACL + key + device-approval scopes	provision + destroy
`tailscale_tailnet`	Tailnet name (e.g. `M8trxInfra.github`)	provision + destroy
`brain_url`	Base URL of the brain server (Tailscale-internal preferred)	provision + destroy + customer-instance runtime
`brain_admin_token`	Brain admin bearer — only used by the dashboard, never written to a customer instance	provision + destroy
`agent_fetch_pat`	GitHub PAT with read-only scope on `M8trxInfra/M8trxAgent`	provision (copied to per-customer SSM)
`agent_ui_default_password`	Initial Caddy basic-auth password seeded for new customers (rotatable per-customer afterwards)	provision (copied to per-customer SSM)

IAM Scope (instance-side)

Each customer's EC2 instance role has a scoped brain_telemetry_ssm_read inline policy. The role can read only its own customer's path — a leaked instance role cannot pivot to another customer's brain key, Tailscale auth key, or PAT.

resource "aws_iam_role_policy" "brain_telemetry_ssm_read" {
  policy = jsonencode({
    Statement = [{
      Sid      = "ReadOwnBrainTelemetryParams"
      Effect   = "Allow"
      Action   = ["ssm:GetParameter", "ssm:GetParameters"]
      Resource = "arn:aws:ssm:*:*:parameter/m8trx/${local.ssm_customer_name}/*"
    }]
  })
}

The permission boundary on each instance role also explicitly permits ssm:GetParameter + ssm:GetParameters — without that, the policy above would still be denied at the boundary. local.ssm_customer_name strips any -agent resource-naming suffix back to the deployer name, so SSM paths and tfvars keys stay aligned.

Brain-Side Monitoring (what the team sees)

Once an Agent customer is online, Paperclip POSTs events (login, task created, task settled, finance/leads activity, errors) to the brain over the customer's bearer. The brain rolls these into per-customer dashboards (covered in the brain repo's own docs). Things the team monitors at the platform layer:

Customer presence on the tailnet — every Agent customer should appear as a single device tagged tag:m8trx-cust-<id>. A missing or duplicate device is a deploy/health red flag.
Last-event recency per customer — gap > 24h on a paid customer is worth investigating (instance down, brain unreachable, paperclip crash loop).
Auth-key expiry pressure — keys are 90-day ephemeral; the team rotates ahead of expiry by re-deploying or rotating manually. Operator dashboard has a planned "key TTL" column on the customer list.
Cross-customer ACL drift — if anyone manually edits the Tailscale ACL outside of tailscale_client.ensure_customer_acl_entries and adds a cross-tag accept rule, that breaks the per-customer trust boundary. Periodic ACL diff against the expected shape is the planned defense.

Failure Modes & Operator Actions

What goes wrong	Symptom	Fix
Brain mint returns 409	Deploy aborts with "Brain already has a customer with id…"	Pick a different deployer name, or revoke the orphaned brain customer first via the brain admin UI
Tailscale ACL upsert 412 (precondition failed)	Race against another concurrent dashboard run	Retry — uses ETag-based optimistic locking, second attempt usually wins
Tailnet Lock approval times out post-apply	Warning in deploy log, customer instance can't reach brain yet	Approve manually in Tailscale admin console; cloud-init may still be running
SSM put fails after Tailscale key minted	Half-provisioned state — auth key exists, SSM is empty	Re-run deploy: brain mint will 409 (so abort and fall through to manual re-provision), or revoke the unused auth key from Tailscale and start over
Operator rotates `brain_admin_token`	All future provisions and destroys use the new token	No customer-instance impact — instances hold their own per-customer `brain-key`, not the admin token

Data Protection

All data encrypted at rest with per-customer KMS keys and in transit with TLS.

AI Agent Sandboxing

AI agents handle sensitive customer data — email, financial documents, chats, business records. They run inside a Docker compose project with multiple independent restriction layers, so a single failure (a vulnerable Python dep, a leaked secret, a user-data exploit) does not give code-execution on the host or read access to another customer. The mechanisms below apply to both templates; the M8trx Agent template adds Caddy + Postgres + Ollama as additional compose services bound to the same protections.

Per-Service Notes

Service	Internal Port	Exposed?	Talks to
`caddy`	:80, :443	Yes (host SG)	Internet (LE ACME challenge), `m8trx-bridge:3200`, `paperclip:3100`
`paperclip`	:3100	No (only via Caddy)	`postgres`, Anthropic API (per-customer key), Brain (per-customer bearer)
`m8trx-bridge`	:3200	No (only via Caddy)	`paperclip`, Brain
`postgres`	:5432	No (compose-internal)	Only the paperclip service in the same project
`ollama`	:11434	No (compose-internal)	Local LLM inference fallback when Claude is unreachable

Management Access

Two distinct access paths for the operator team: (1) reaching the operations dashboard over Tailscale, and (2) reaching individual customer instances for shell or telemetry. Neither path uses a dedicated management proxy any more — the mgmt-proxy Terraform module was retired on 2026-04-21 because Tailscale-on-the-dashboard-host plus AWS SSM Session Manager cover both paths at zero extra cost.

aws ssm start-session IAM-authenticated, audited in CloudTrail C. SSM → Customer shell $0 | landed as ubuntu From dashboard or AWS CLI over the tailnet tunnel local :8443 → instance :8443 SSM Session Manager (no SSH) ALL OPTIONS Zero public ingress on dashboard | No SSH ingress on customers | Every action audited

Option Comparison

Option	Reaches	Cost	Admin Needs	Add / Remove Admin	Best For
A. Tailscale	Dashboard	$0 (free 100 devices)	Tailscale app + invite to tailnet	Approve / delete device — instant	Daily-driver for the whole team
B. SSM Port-Forward	Dashboard	$0	AWS CLI + SSM plugin + IAM user/role with `ssm:StartSession` + MFA	Disable IAM user / role	Break-glass when Tailscale is unavailable
C. SSM Session Manager	Customer instance shell	$0	Triggered from dashboard "Shell Access" panel, or AWS CLI	Same — IAM-controlled	Forensics, debugging, ad-hoc commands

Setup — Tailscale on the Dashboard Host

Install Tailscale on the dashboard EC2 + your laptop

# On the dashboard host (one-time):
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up --hostname=m8trx-deployer-dashboard --advertise-tags=tag:mgmt
# follow the printed URL to authorise

# Then on your laptop, install Tailscale and sign in to the same tailnet.

Bind the dashboard to its Tailscale IP

export DASHBOARD_HOST=$(tailscale ip -4 | head -1)   # e.g. 100.98.195.91
export DASHBOARD_PORT=8443
export DASHBOARD_USER=admin
export DASHBOARD_PASS='<strong-password>'    # required, no default
cd dashboard && python main.py

A cron watchdog re-runs this command if the process dies — see scripts/cron/watchdog.sh. PID lock at dashboard/dashboard.pid prevents duplicates.

Add more admins

Each admin installs Tailscale and joins the tailnet. You approve their device in the Tailscale admin console; they can immediately reach the dashboard at https://<tailscale-ip>:8443 and sign in with the dashboard's HTTP Basic credentials. Revocation: delete their device — instant.

Setup — SSM Port-Forward (break-glass for the dashboard)

# Install SSM plugin: https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html

aws ssm start-session \
  --target i-DASHBOARD_INSTANCE_ID \
  --document-name AWS-StartPortForwardingSession \
  --parameters '{"portNumber":["8443"],"localPortNumber":["8443"]}'

# Open https://localhost:8443 in your browser.

Setup — SSM Session Manager (customer-instance shell)

Triggered from the dashboard's per-customer detail view. The dashboard runs ssm:SendCommand with an inline AWS-RunShellScript document, polling every 3s for output (max 300s). Sessions land as the ubuntu user (not root, not ssm-user) — so paths and ownership match what the customer's compose stack expects. Commands are validated against a blocklist (rm -rf /, mkfs, shutdown, pipes-to-shell, etc.) and capped at 2000 chars. Endpoint: POST /api/customers/{name}/exec.

# Or from the AWS CLI directly:
aws ssm start-session --target i-CUSTOMER_INSTANCE_ID
# lands as ubuntu, full audit in CloudTrail

Security Properties

Property	Detail
Dashboard exposure	Binds to its Tailscale IP only — not accessible on the public IP, not on AWS-internal IPs from outside the tailnet
Network auth	Tailscale: device approval + Tailnet Lock \| SSM: IAM + MFA, audited in CloudTrail
Application auth	HTTP Basic Auth middleware on every route except `/api/health`
Security headers	`X-Content-Type-Options`, `X-Frame-Options`, `Referrer-Policy`, `Cache-Control: no-store` on all responses
Password policy	`DASHBOARD_PASS` env var required — no default. Dashboard refuses to start without it.
PID lock	`dashboard.pid` prevents duplicate instances from running simultaneously
Audit trail	Tailscale admin console (device + sessions) \| CloudTrail (every SSM session) \| dashboard job log (every dashboard action)
Customer-instance access	SSM Session Manager only — no SSH, no Tailscale-SSH (`--ssh=false`). Sessions land as `ubuntu`.
Encryption	WireGuard (Tailscale) \| TLS (SSM) \| TLS (Caddy + LE on customer instances)

Monitoring & Detection

Multi-layered monitoring with automated alerting across all customer environments.

Incident Response

One-command quarantine triggered from the dashboard UI or CLI.

Security Controls Matrix

Every threat has at least 3 independent control layers. If one fails, the others still protect.

Threat	Layer 1	Layer 2	Layer 3
SSRF credential theft	IMDSv2 required	Hop limit = 1	iptables block 169.254.x.x
Lateral movement	Separate VPCs	Per-customer IAM	Per-customer KMS
Agent breakout	Docker + seccomp	AppArmor profile	Non-root + read-only FS
Data exfiltration	Network rate limit	VPC flow logs	S3 TLS + KMS enforced
Credential compromise	No static keys	Permission boundaries	Unauthorized API alerts
Crypto mining	GuardDuty	CPU alarm > 90%	Container CPU limits
Security tampering	SCPs deny disable	IAM deny in boundary	CloudTrail validation + GuardDuty
Unauthorized dashboard	Tailscale-only binding	Auth middleware (all routes)	Security headers + PID lock
Shell injection (dashboard)	shell=False (subprocess)	Command blocklist on /exec	Input regex validation
Credential leakage	No default password	Passwords never logged	/api/health strips AWS info
Audit log tampering	auditd immutable (-e 2)	Shipped to CloudWatch	CloudTrail validation
Cross-customer telemetry leak	Per-customer Brain key	Tailscale tag-only ACL	SSM IAM scoped to `/m8trx/<own>/*`
Stale agent UI password	Per-customer Caddy basic-auth	Rotatable via dashboard (SSM SendCommand)	Plaintext base64 in transit only

Per-User Access Control

Each customer instance supports multiple users with role-based access, per-user data isolation, granular permissions, and full audit logging. This is the application layer running inside the Docker sandbox — distinct from the deployer-side dashboard auth (HTTP Basic Auth) and from the per-customer Caddy basic-auth that gates the agent UI. The model below applies to the baseline (custom-app) template; the M8trx Agent template uses Paperclip's own user/permission model — see Paperclip docs.

Roles & Permissions Matrix

Capability	admin	user	readonly
View own files	Yes	Yes	Yes
Upload / write files	Yes	Yes	No
Delete own files	Yes	Yes	No
View other users' files	Yes	No	No
Create / manage users	Yes	No	No
Reset user passwords	Yes	No	No
View full audit log	Yes	Own events only	Own events only
Use agent features	Yes	Per permission	No

Granular Permissions

Access email processing features and the emails/ data directory. Required for agents that handle customer email.

financial

Access financial document processing and the financial/ directory. Enables stricter audit logging on these operations.

chat

Access chat/messaging features and the chats/ data directory.

documents

Access general document storage and the documents/ data directory.

web_portal

Access web portal features for the customer's public-facing services.

API Endpoints

Endpoint	Method	Who	What it does
`/auth/login`	POST	Anyone	Authenticate, receive JWT token
`/auth/me`	GET	Authenticated	View own profile and permissions
`/auth/change-password`	POST	Authenticated	Change own password (min 12 chars)
`/users`	GET/POST	Admin	List all users / create new user
`/users/{id}`	GET/PATCH/DELETE	Admin	View / update / disable user
`/users/{id}/reset-password`	POST	Admin	Reset user password (returns temp pw)
`/data/{category}`	GET/POST	User+	List own files / upload file
`/data/{category}/{file}`	GET/DELETE	User+	Download / delete own file
`/data/admin/{user}/{cat}`	GET	Admin	List any user's files
`/data/admin/{user}/{cat}/{file}`	GET	Admin	Read any user's file
`/audit`	GET	Authenticated	View audit log (scoped to role)
`/health`	GET	Anyone	Container health check

Security Controls

Control	Implementation
Password hashing	PBKDF2-SHA256 with 100,000 iterations + random salt
Token signing	HMAC-SHA256, secret from AWS Secrets Manager
Token expiry	30 minutes — forces re-authentication
Path traversal	Filename stripped to basename, resolved path checked against data root
File size	50 MB maximum per upload
User deletion	Soft-delete only — account disabled, data preserved for audit
Default admin	Created on first boot with temporary password printed to logs
Audit trail	Append-only JSONL, shipped to CloudWatch, tamper-evident

Repository Structure

m8trxDeployer/ ├── terraform/ │ ├── main.tf # Wires both customer maps (customers + m8trx_agent_customers) │ ├── variables.tf # Input variables │ ├── cloudflare/ # Zone-level resources (managed separately from customer state) │ ├── environments/prod/ # base.tfvars.json + per-template customer tfvars │ └── modules/ │ ├── vpc/ main.tf # Per-customer VPC, public subnet, IGW, flow logs, SG (port-driven) │ ├── ec2/ main.tf, user_data.sh # Baseline (custom-app) instance — single port :8080 │ ├── ec2-m8trx-agent/ main.tf, user_data.sh # M8trx Agent (Path B — git+compose at first boot) │ ├── iam/ main.tf # Per-customer role + permission boundary + brain_telemetry_ssm_read │ ├── kms/ main.tf # Per-customer CMK with auto-rotation │ ├── s3/ main.tf # Per-customer bucket (encrypted, versioned, TLS-enforced) │ └── monitoring/ main.tf # GuardDuty, CloudTrail, alarms, AWS Config │ # jumpbox / mgmt-proxy module retired 2026-04-21 ├── dashboard/ # FastAPI operations dashboard │ ├── main.py, config.py, state.json │ ├── routers/ customers, settings (brain-telemetry + cloudflare), platform, usage, incidents │ ├── services/ aws_client, terraform_runner, brain_telemetry, brain_client, tailscale_client, domain, security_checker, decommission_checker │ ├── models/ customer.py (Pydantic validation, including template selector + subdomain) │ └── static/ index.html, app.js, architecture.html (= arch-dashboard/index.html) ├── arch-dashboard/ # Architecture docs source (this page) — copy to dashboard/static after edits ├── paperclip/ ### git submodule → M8trxInfra/M8trx_agent (pinned SHA = AMI provenance) ├── packer/ # Hardened base AMI build (Ubuntu 22.04, used by both templates) ├── scripts/ │ ├── hardening/ harden-ubuntu.sh, apparmor-agent-profile │ ├── agent-sandbox/ Dockerfile, docker-compose.yml, seccomp-profile.json │ ├── cron/ watchdog.sh (auto-restart dashboard if it dies) │ └── incident-response.sh └── policies/ scp-guardrails.json, deployer-iam-policy.json

Per-Customer Cost Breakdown (one customer, t3.medium, audited 2026-05-05)

Defaults from terraform/environments/prod/*.tfvars.json: instance_type=t3.medium, volume_size_gb=30. Retention from terraform/modules/vpc/main.tf (flow logs: 30d) and terraform/modules/ec2/user_data.sh (syslog 90d, auth 365d, audit 365d, agent 90d).

Item	Baseline	M8trx Agent	Notes
EC2 (t3.medium, on-demand, us-east-2)	~$30.00	~$30.00	Same instance class for both templates
EBS root (30 GB gp3)	~$2.40	~$2.40	$0.08/GB/mo gp3 baseline; encrypted with per-customer KMS CMK
EBS snapshots (DLM, daily 03:00, 14-day retain)	~$2.00	~$2.00	$0.05/GB-month; only the delta of each snapshot is billed after the first
KMS CMK (1 per customer, auto-rotate)	~$1.00	~$1.00	$1/key/month + tiny API costs
CloudWatch Logs ingestion + storage	~$2.00	~$0.50	Baseline ships VPC flow (30d) + syslog (90d) + auth (365d) + audit (365d) + agent (90d) via the CW agent. Agent template ships VPC flow only — Path B's user_data does not install the CW agent (deliberate, keeps user_data under the 16 KB limit). Dashboard fills the gap for memory/disk via SSM RunCommand.
CloudWatch alarms (per-customer high-CPU)	~$0.10	~$0.10	$0.10 per standard-resolution alarm
GuardDuty marginal (per-customer flow log + CT events)	~$0.50	~$0.50	Detector is account-wide but ingestion volume is per-customer
Data transfer (egress, IGW use)	~$0.30	~$0.50	Outbound EC2 → internet at $0.09/GB. Brain telemetry rides Tailscale and is billed once. Agent template has slightly more egress (GHCR pulls, LE renewals, Anthropic).
Total per agent instance	~$38 / mo	~$37 / mo	A multi-agent customer pays per agent (2 agents ≈ $74–$76/mo). Heavier traffic = more flow logs + more egress; expect +$2–$5 for a busy customer.

Shared platform cost: ~$3–6 / mo. SNS topic (free tier covers low-volume), Terraform state S3 bucket + DynamoDB lock (~$1/mo combined), CloudTrail (first trail free), AWS Config recorder ($0.003 per recorded item — adds up at scale, but trivial for the small fleet). The mgmt-proxy ($7/mo) was removed on 2026-04-21; Tailscale + SSM cover the same access surface at $0. Tailscale is on the free 100-device tier; the Brain server runs on existing infrastructure.

Cost Audit Findings (2026-05-05)

Flow-log retention drift. 3 of 5 customer flow-log groups in production have retentionInDays = None (never expire) — m8trx-cleantest4-agent, m8trx-dustin-bot, m8trx-smoke-test-2. The VPC module sets retention_in_days = 30, so these are likely log groups created by VPC Flow Logs auto-provisioning before the Terraform-managed group existed. Action: set retention manually with aws logs put-retention-policy --log-group-name <name> --retention-in-days 30 for the three drifted groups, then audit the VPC module to ensure the log group is always created by Terraform (not auto-provisioned by AWS) so retention is enforced from day one.
CW agent custom metrics are empty in practice. Baseline template's user_data installs amazon-cloudwatch-agent and configures cpu/mem/disk/net publishing under namespace m8trx/<customer>, but aws cloudwatch list-metrics returns empty for both m8trx/keithenterprises and m8trx/agent2. The user_data has || echo "Warning" fallbacks so failures don't abort boot — they just hide. Action: SSM into the baseline instance and run amazon-cloudwatch-agent-ctl -a status to see whether the agent is running and authorized to push metrics. If the install is failing silently, decide: fix it, or remove the metrics block from user_data and lean on the dashboard's SSM-fallback for memory/disk like the Agent template already does.
Path B's missing CW agent is intentional. Not a finding — surfaced here for clarity. The Agent template's user_data is deliberately minimal (the 16 KB AWS user-data hard limit drove a slim-down in 94a1d04). The dashboard already handles this — Phase 3 → Monitoring notes that memory/disk are fetched on demand via SSM RunCommand when CW agent data is unavailable.

Deployment Guide

Complete step-by-step instructions for initial platform setup and deploying new customer instances. Follow these in order.

Phase 1 — One-Time Platform Setup

These steps are done once to set up the platform infrastructure. After this, deploying a new customer is a single dashboard click that takes ~5 minutes.

Build the hardened base AMI (Packer)

The hardened Ubuntu 22.04 AMI is built by Packer and used by both customer templates — the M8trx Agent template no longer bakes its own AMI (Path B installs the agent stack at first boot via git+compose).

cd packer
./build-ami.sh    # runs `packer build` and prints the new AMI ID

What the build does: kernel parameter lockdown (ASLR, ptrace, SYN flood protection), auditd with immutable rules, fail2ban, UFW firewall, AppArmor profiles, SSH locked down (no password auth, no root login), unattended security updates, AIDE file integrity, core dumps disabled, unnecessary packages removed, cron restricted to root. Note the AMI ID — you'll set it in tfvars in step 3.

Terraform state backend (already configured)

The repository ships with the S3 + DynamoDB backend wired up. State lives in m8trx-terraform-state (versioned, KMS-encrypted, public-access blocked) with locking in DynamoDB table terraform-locks in us-east-2. No action needed unless you're running this in a fresh AWS account, in which case create the bucket + table once and update terraform/main.tf.

Configure Terraform tfvars

Edit terraform/environments/prod/base.tfvars.json with the platform-wide values. Per-customer maps (customers, m8trx_agent_customers) live in their own tfvars files and are managed by the dashboard — no manual editing on customer add/remove.

{
  "aws_region": "us-east-2",
  "project_name": "m8trx",
  "admin_email": "ops@m8trx.ai",
  "base_ami_id": "ami-XXXXXXXXX",         // from step 1, used by both templates
  "m8trx_agent_ami_id": "ami-XXXXXXXXX",  // same as base_ami_id (Path B)
  "platform_domain": "m8trx.ai"
}

Tailscale on the dashboard host (mgmt access)

The dashboard binds to a Tailscale IP — there is no public ingress on the operations dashboard. Install Tailscale on the dashboard EC2, join the same tailnet your operators use, and bind the FastAPI server to the Tailscale IP. Detail under the Mgmt Access tab.

curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up --hostname=m8trx-deployer-dashboard --advertise-tags=tag:mgmt
# follow the printed URL to authorize

There is no separate management proxy or management VPC any more — that module was retired on 2026-04-21. Tailscale-on-the-dashboard plus AWS SSM cover the same ground at lower cost.

Start the dashboard (with required env)

cd dashboard
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

export DASHBOARD_HOST=$(tailscale ip -4 | head -1)
export DASHBOARD_PORT=8443
export DASHBOARD_USER=admin
export DASHBOARD_PASS='<strong-password>'   # REQUIRED — no default
export AWS_REGION=us-east-2
export PROJECT_NAME=m8trx
python3 main.py

A cron watchdog auto-restarts the process every minute if it dies (see scripts/cron/watchdog.sh). PID lock at dashboard/dashboard.pid prevents duplicates. After every code change, kill and restart: pkill -f "python main.py"; sleep 2; rm -f dashboard/dashboard.pid.

Configure Cloudflare (subdomain provisioning)

Open the 📡 settings cog in the dashboard topbar. Enter the Cloudflare API token (zone:DNS:edit + zone:zone:read), zone ID, and platform domain (m8trx.ai). The dashboard validates the token on save. Subdomains are created with proxied=false so origin TLS via Caddy + Let's Encrypt works end-to-end.

Configure Brain Telemetry (M8trx Agent customers only)

In the same 📡 settings modal, fill the Brain & Tailscale section — these five fields are required only for deploying M8trx Agent customers; baseline (custom-app) deploys ignore them.

Field	What it is
`tailscale_api_token`	Tailscale OAuth client / API token with ACL + key + device-approval scopes
`tailscale_tailnet`	Tailnet name (e.g. `M8trxInfra.github`)
`brain_url`	Internal URL of the brain server (Tailscale-internal address preferred)
`brain_admin_token`	Brain admin bearer — used by the dashboard, never written to a customer instance
`agent_fetch_pat`	GitHub PAT (read-only on `M8trxInfra/M8trxAgent`) — used at customer first-boot to clone the agent repo
`agent_ui_default_password`	Initial Caddy basic-auth password seeded for new customers — operators rotate per-customer afterwards

All fields are written to state.json. GET endpoints return masked values; full plaintext is only available to whoever holds DASHBOARD_PASS.

Apply Organization SCPs

Apply the guardrail policies at the AWS Organization level — these prevent anyone (including a leaked admin role) from disabling security services, launching unencrypted instances, or making S3 buckets public.

# In AWS Organizations console:
# Policies > Service control policies > Create policy
# Paste contents of policies/scp-guardrails.json
# Attach to the OU containing your account

Test SCPs in a sandbox account first. An overly broad SCP can lock you out of your own account.

Phase 2 — Deploying a New Customer (Repeatable)

After Phase 1, deploying a new customer is a single dashboard click that takes ~5 minutes. The form picks the template; the dashboard handles brain-telemetry provisioning, terraform apply, Cloudflare DNS, and Tailnet Lock approval as one orchestrated job.

Fill the Deploy Form

On the dashboard, click "Deploy New". The fields shown depend on the template you pick.

Field	Example	Validation / Notes
Customer Name	`acme`	3–32 chars, lowercase, alphanumeric + hyphens. Used in all resource names. Must be unique across both templates.
Template	`m8trx_agent` \| `baseline`	Picks which Terraform module + tfvars map is used. Determines whether brain-telemetry provisioning runs.
Instance Type	`t3.medium` (default)	Whitelisted dropdown.
Volume Size	`30` GB	20–500 GB. Encrypted EBS with the customer's per-customer KMS key.
Subdomain (Agent only)	`acme`	Optional label. Defaults to the customer name. Final hostname is `{label}.{platform-domain}`.
Agent UI password (Agent only)	auto-generated	Caddy basic-auth on the customer subdomain. Shown in cleartext in the dashboard with a copy button.

Click "Deploy Secure Instance". The form validates all inputs before submitting.

Pre-Deploy: Brain Telemetry Provisioning (Agent template only)

For the M8trx Agent template, the dashboard runs brain_telemetry.provision_for_customer() before Terraform sees anything — so a half-provisioned customer never enters the brain or AWS state. Steps stream into the deploy job log:

Brain mint — POST /admin/customers; aborts on 409 (duplicate brain id).
Tailscale ACL upsert — adds tag:m8trx-cust-<id> tagOwner + intra-customer accept rule (must precede auth-key mint).
Tailscale auth key mint — tag-bound, ephemeral, reusable, 90-day.
SSM put × 6 — /m8trx/<name>/{brain-key, brain-customer-id, brain-url, tailscale-auth-key, agent-fetch-pat, agent-ui-password} as SecureStrings.

Terraform Apply (creates AWS resources)

The dashboard runs terraform apply -var-file=<template>.tfvars in a background job (terraform_runner, shell=False). Progress streams to the Jobs tab.

Resource	Detail
VPC	Isolated VPC, single public subnet, IGW. CIDR octet allocated stably from the customer name (re-deploys don't shift). Flow logs to CloudWatch (30-day).
Security Group	Baseline: `:8080` only. M8trx Agent: `:80` + `:443`. Outbound: all (SSM, GHCR, Brain, Anthropic, Tailscale).
KMS Key	Customer-dedicated CMK with auto-rotation, key policy scoped to that customer's IAM role.
S3 Bucket	Encrypted, versioned, public-access-blocked, TLS-enforced, wrong-key-rejected.
IAM Role	Instance role with permission boundary. M8trx Agent additionally gets the `brain_telemetry_ssm_read` inline policy scoped to `/m8trx/<own-name>/*`.
EC2 Instance	Hardened Ubuntu, IMDSv2 required (hop=1), encrypted EBS, EIP. M8trx Agent template runs the 9-step Path B user-data bootstrap.
DLM Snapshots	Daily EBS snapshots at 03:00 UTC, 14-day retention.

Post-Apply: DNS + Tailnet Lock (Agent template only)

After Terraform apply succeeds, the dashboard runs two best-effort follow-ups:

Cloudflare DNS upsert — A record {label}.{platform-domain} → instance EIP, proxied=false, comment m8trx-managed:customer.
Tailnet Lock approval — polls Tailscale for a device with hostname == <instance-id>; once it appears (cloud-init step 4 ran), POSTs {"keyExpiryDisabled": false} to approve. If cloud-init is slow, the operator can approve manually from the Tailscale admin console.

End-user traffic flow (Agent template): {name}.m8trx.ai → Cloudflare DNS → instance EIP :443 → Caddy (LE cert + basic-auth) → m8trx-bridge:3200 / paperclip:3100.

First-Boot Bootstrap (automatic, Agent template — full detail in Compute tab)

Set hostname to instance-id (so Tailnet Lock can find the device).
Install docker.io, docker-compose-plugin, tailscale.
Read 6 SSM SecureStrings → /etc/m8trx/brain.env (mode 0600).
tailscale up --auth-key=… --hostname=<instance-id> --ssh=false.
git clone --depth 1 the agent repo (PAT stripped from .git/config immediately).
Generate compose/.env with random Postgres creds + bcrypt-hashed agent UI password.
Patch the Caddyfile for the customer hostname + LE; add :443 to the compose ports.
docker compose up -d --pull always. First boot ends when Caddy obtains its certificate.

Logs tee'd to /var/log/m8trx-agent/first-boot.log for SSM-side debugging via the dashboard's Shell Access panel.

Hand the Customer Their Welcome Package

The dashboard generates a per-customer welcome card on the customer detail page: URL (the subdomain), username (admin), and password (the agent UI password — shown cleartext, copy button). Operator copies into a 1Password share / secure email and sends.

If the customer needs the password rotated later, the operator clicks "Change Agent UI Password" on the customer panel — see Phase 3.

Verify Security Posture

On the customer's row, expand the Security Posture panel (API-level checks) and click "Run Check" on the Deep Security Check panel (live SSM hardening checks, ~15–30s). Both should be green before declaring the customer ready.

Check	What it verifies
IMDSv2 required (hop=1)	Instance metadata requires token + 1-hop limit (no container forwarding)
EBS encryption	All volumes encrypted with the customer's KMS key
SG rules	Only `:8080` (baseline) or `:80+:443` (Agent) ingress
auditd / fail2ban / UFW / AppArmor	Each running with immutable / active config
Unattended-upgrades	Automatic security patching active
Agent stack health (Agent template)	Compose project up; Caddy has a valid LE cert; paperclip `/health` green
Brain reachability (Agent template)	Tailscale device present + Tailnet-Lock approved + last paperclip→Brain event < 5 min ago

Phase 3 — Ongoing Operations

Monitoring

Check the Monitoring tab for CloudWatch alarms, GuardDuty findings, and CloudTrail events. Resource monitors (CPU, memory, disk, network) display in each customer's detail view — memory and disk use SSM fallback when CloudWatch agent data is unavailable. Instance logs viewable in the Logs tab per customer.

Incident Response

If an instance is compromised: click "Quarantine" in the dashboard. This snapshots all volumes, replaces security groups with deny-all, and tags the instance. SSM still works for forensics. Never terminate - preserve for investigation.

Shell Access

Each customer detail page shows a Shell Access panel. Copy the SSM command to open an interactive terminal, or type commands directly in the dashboard and see output in real time. Commands are validated against a blocklist (no rm -rf /, pipe-to-shell, etc.) and capped at 2000 chars. Endpoint: POST /api/customers/{name}/exec.

Backups

"Backup Now" creates on-demand EBS snapshots of all volumes. The Backups panel shows all snapshots (automated + on-demand) with state, progress, time, and type. DLM runs daily at 03:00 UTC with 14-day retention. Endpoints: POST /api/customers/{name}/snapshot, GET /api/customers/{name}/snapshots.

Resize Disk

Click "Resize Disk" in the detail view to expand a customer's root EBS volume (20-500 GB). Works on both running and stopped instances. For running instances, the filesystem is expanded automatically via SSM (growpart + resize2fs/xfs_growfs). Stopped instances auto-expand on next boot via cloud-init. Endpoint: POST /api/customers/{name}/resize-disk.

Email Alerts (Platform)

Click the gear icon in the topbar to configure email alerts via AWS SES. Set a service email address and choose which events trigger alerts: incidents (quarantine), deploy failures, deploy successes, and CloudWatch alarms. Use "Send Test" to verify delivery. The sender address must be verified in SES. Settings stored in dashboard state. Endpoints: GET/PUT /api/platform/alert-settings, POST /api/platform/alert-test. Platform-level only — these alerts are for operators; customer-facing notifications use the channels flow below.

Paperclip Source Upgrade (in-place)

Each existing customer instance runs the Paperclip version baked into its AMI. To avoid re-baking + instance replacement for every Paperclip change, the deployer dashboard has an "Upgrade Paperclip Source" button on the per-customer Paperclip panel. It tars paperclip/ from the dashboard host (excluding venv, caches, local DB), base64-encodes it, ships it inline via SSM, rsyncs into /opt/paperclip-src, re-runs the idempotent bootstrap/install.sh, and restarts paperclip.service. No S3 round-trip, no persistent artifacts. Endpoint: POST /api/customers/{name}/paperclip/upgrade. Size-capped at ~90 KB base64 — if the tree outgrows that, switch to an S3-mediated push.

Per-Customer Notifications (Paperclip Channels)

Each M8trx Agent customer has their own Gmail + Telegram notification channels for task events (approvals needed, task complete, task failed). Configured in two places:

Deployer dashboard → Paperclip panel → "Notification channels (seed)": operator seeds Gmail SMTP creds (smtp.gmail.com:587 + Google App Password) and/or a Telegram bot token + chat id when onboarding. Written to /etc/paperclip/config.yaml via SSM alongside the Anthropic key and admin password.
Paperclip (customer dashboard) → gear icon → Settings modal: the customer can override any field. Overrides live in the instance's SQLite (channel_settings table) and win field-by-field over the deployer seed; blank fields inherit the seed. "Send test" button fires a live notification to verify.

Outbound notifications fire-and-forget from a daemon thread inside Paperclip — hooked at dispatcher._settle() (complete/failed) and POST /api/tasks when needs_approval=true. Failures are logged, never block task execution. Endpoints on Paperclip: GET /api/channels (redacted), PUT /api/channels/{email|telegram}, DELETE /api/channels/{provider} (revert to seed), POST /api/channels/test. Gmail requires 2FA + an App Password, not the account password.

⚡

Chat Fast-Path (Intent-Aware SDK Worker)

Casual chat replies previously took 3–5 s (CLI subprocess spawn) and data-aware prompts like "summarize my inbox" took 30–150 s (Claude tool-use loop). Two changes land the fast-path:

Intent detection + context pre-fetch (chat_context.py): when a chat message mentions inbox / tasks / events / finance / leads, regex rules detect the intent(s), the relevant SQLite rows are fetched directly (≤50 ms), and injected as a plain-text context block prepended to the task prompt. Claude answers from this context in one turn — no tool-use loop needed.
SDK worker (workers/claude_sdk.py): calls the Anthropic Messages API directly via the Python SDK, bypassing the claude CLI subprocess (~1–2 s overhead). API key required; no OAuth path (OAuth is CLI-audience only). First-token latency in the sub-second range for trivial chat on a t3.medium.

Three-tier dispatcher routing for chat-linked tasks (tasks with an assistant placeholder message): Tier 1 — SDK worker (fast, API key required). Tier 2 — CLI worker (OAuth supported, full tool use) on any SDK availability/auth failure. Tier 3 — Hermes/Llama (local, free) on CLI availability failure. Non-chat tasks (draft-reply approval, coding) go straight to the CLI worker unchanged. The meta field on each WorkerResult records which worker answered and any fallback chain (fallback_from).

System prompt is shared via workers/_claude_system_prompt.py — imported by both CLI and SDK workers so the context description stays identical across both paths.

Adding Users

Customer admins manage their own users via the agent API. Each user gets isolated data directories and scoped permissions. All actions audit-logged and shipped to CloudWatch.

Delete Customer

Click "Delete Customer" in the detail view. The dashboard refuses if AWS termination protection is on — disable it first (operator-explicit, intentionally a friction point). Pre-destroy: empty S3 (including versioned objects), delete CloudWatch log groups, detach IAM permission boundaries. Run terraform destroy. Post-destroy (Agent template): delete the 6 SSM params under /m8trx/<name>/*, DELETE /admin/customers/<brain_customer_id> on the brain, remove the Cloudflare A record. KMS key keeps a 30-day deletion window.

Rotate Agent UI Password

Per-customer Caddy basic-auth password. Click "Change Agent UI Password" on the customer panel (8–256 chars, validated). Two-step rotation: (1) update /m8trx/<name>/agent-ui-password in SSM (so future re-deploys pick up the new password); (2) SendCommand to the running instance — re-hash with caddy hash-password, write into compose/.env, docker compose up -d --force-recreate caddy. Plaintext is base64-passed in the SSM command (encrypted in transit, not logged at rest). Endpoint: POST /api/customers/{name}/agent-password.

Tailscale Auth-Key Rotation

Customer auth keys are tag-bound, ephemeral, 90-day. They auto-revoke at expiry — but the device stays joined as long as tailscaled is running. To rotate ahead of expiry: re-deploy the customer (full bootstrap with a fresh key) or run a manual mint + push the new key into the customer's SSM, then restart tailscaled on the instance via the Shell Access panel. Planned: a "Rotate auth key" button on the customer panel.

Orphan Detection

A background decommission checker runs every 5 minutes, scanning AWS for orphaned resources (EC2 instances, VPCs, S3 buckets, KMS keys, IAM roles) from deleted customers no longer in state. Orphaned instances and VPCs are auto-cleaned to prevent cost leaks. On-demand check: GET /api/decommission-check.

Troubleshooting

Issue	Solution
Deploy job shows "No valid credential sources"	The dashboard host needs AWS credentials. Attach an IAM role to the instance or set `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` env vars before starting the dashboard.
Deploy aborts: "Brain telemetry settings are not configured"	Open the 📡 settings cog and fill the brain-telemetry section (5 fields). Required only for M8trx Agent customers.
Deploy aborts: "Brain already has a customer with id 'cust_…' (409)"	The brain side has a leftover customer. Either pick a different deployer name, or revoke the existing brain customer via the brain admin UI before retrying.
Tailscale ACL upsert fails 412	Race against another concurrent dashboard run. Retry — the upsert uses ETag-based optimistic locking and the second attempt usually wins.
Tailnet Lock approval times out post-apply	Cloud-init may still be running on the instance. Check `/var/log/m8trx-agent/first-boot.log` via SSM. Once the device appears in Tailscale, approve manually from the admin console — no need to re-run the deploy.
Caddy can't get a Let's Encrypt cert	Verify the Cloudflare A record exists (proxied=false), points at the EIP, and SG ingress :80 is open (LE HTTP-01 challenge). `docker compose logs caddy` via Shell Access for detail.
Customer can't reach `{name}.m8trx.ai`	DNS not yet propagated — ~1 min typical for Cloudflare. Or the Caddy basic-auth password they were given is stale: rotate via Phase 3 card.
Compose stack not starting	Shell Access → `cd /opt/m8trx-agent && docker compose ps`. Check `docker compose logs paperclip` for missing env vars or DB init issues.
Deep security check shows "unknown"	SSM agent takes 1–2 minutes to register after instance launch. Also verify the instance's IAM role has SSM permissions.
Destroy refuses with "termination protection enabled"	Intentional — disable termination protection in the EC2 console (Actions → Instance Settings → Change termination protection) before retrying. Forces operator-explicit consent for destructive actions.

API Token Usage Tracking

Per-customer Claude API token consumption analytics with cost estimation and configurable alert thresholds. Tracks input, output, cache read, and cache creation tokens across all Claude model tiers.

Architecture Overview

Data Model

Usage data is stored in a local SQLite database (usage.db) with the following schema.

Column	Type	Description
`customer`	TEXT	Customer identifier (e.g. `acme-corp`)
`day`	TEXT (ISO date)	Date of usage
`model`	TEXT	Claude model used (opus-4-6, sonnet-4-6, haiku-4-5)
`sessions`	INTEGER	Number of API sessions
`input_tokens`	INTEGER	Prompt input tokens consumed
`output_tokens`	INTEGER	Output tokens generated
`cache_read`	INTEGER	Cache read tokens (prompt caching)
`cache_creation`	INTEGER	Cache creation/write tokens

Indexed on (customer), (day), and unique on (customer, day, model) for efficient querying and upsert operations.

Pricing & Cost Calculation

Costs are estimated using per-model pricing rates (per 1M tokens). The calc_cost() function computes costs across all four token types.

Model	Input	Output	Cache Write	Cache Read
Claude Opus 4.6	$6.15	$30.75	$7.69	$0.61
Claude Sonnet 4.6	$3.69	$18.45	$4.61	$0.37
Claude Haiku 4.5	$1.23	$6.15	$1.54	$0.12

API Endpoints

All endpoints require HTTP Basic Auth. Mounted under /api/usage/ via FastAPI router.

Method	Endpoint	Description
`GET`	`/api/usage/data`	Full usage dashboard data: daily breakdowns, customer totals, pricing, current costs, and alert thresholds
`POST`	`/api/usage/seed`	Populate demo/synthetic usage data for POC customers
`POST`	`/api/usage/reset`	Clear all usage data and re-seed with fresh demo data
`GET`	`/api/usage/thresholds`	List all configured cost alert thresholds
`PUT`	`/api/usage/thresholds`	Set or update a per-customer cost alert threshold (period + dollar limit)
`DELETE`	`/api/usage/thresholds/{customer}`	Remove the cost alert threshold for a customer
`POST`	`/api/usage/check-alerts`	Check all thresholds against current costs; sends SES email alerts for breaches

Cost Alert Thresholds

Threshold Configuration

Operators set per-customer cost limits with a period (week or month) and a dollar amount. Thresholds are stored in state.json alongside other dashboard state. When a customer's estimated cost for the current period meets or exceeds the threshold, an alert is triggered.

Alert Pipeline

The /api/usage/check-alerts endpoint evaluates all enabled thresholds against current-period costs. Breaches trigger email alerts via AWS SES (using the alert_email service) containing the customer name, period, threshold, current cost, and overage amount. Returns a list of all breaches for dashboard display.

Cost Aggregation

The get_customer_costs() function aggregates token usage for the current period: week (Monday through today) or month (1st through today). It sums costs across all models per customer using the calc_cost() pricing function, covering input, output, cache read, and cache write tokens.

Key Files

File	Purpose
`dashboard/services/usage_scanner.py`	Core module: database init, demo data seeding, usage queries, cost calculation
`dashboard/routers/usage.py`	FastAPI router: REST endpoints for usage data, thresholds, and alert checks
`dashboard/usage.db`	SQLite database storing per-customer token usage records
`dashboard/state.json`	Persists alert thresholds alongside other dashboard settings

m8trxDeployer Architecture

System Overview

Executive Summary

✓ PLATFORM READINESS — current state of one-time setup

Network Architecture

VPC Design (Per Customer)

Security Group Per Template

DNS & Subdomain Architecture

Tailscale Overlay = Per-Customer "VLAN"

Compute Architecture

Two Customer Templates

M8trx Agent First-Boot Bootstrap (Path B)

Set hostname to instance-id

Install runtime deps (idempotent)

Read 6 SSM SecureString params

Join the tailnet

Write /etc/m8trx/brain.env (mode 0600)

Clone the agent repo, then strip the PAT

Generate compose/.env (mode 0700)

Patch the Caddyfile and compose for hostname-aware TLS

Bring up the stack

Audit Rules (auditd) — Immutable, both templates

Planned: Multi-Tenant Compute (foundations in place, host-level orchestration pending)

IAM & Access Control

Organization SCPs

Brain Telemetry — additional inline scope

Brain Telemetry & Tailscale Mesh

The Customer UID — one identifier ties everything together

SSM Parameter Layout (per customer)

Pre-Deploy Lifecycle (provision)

Mint a Brain customer key

Upsert Tailscale ACL entries (must precede key mint)

Mint a tag-bound, ephemeral, reusable, 90-day Tailscale auth key

Write 6 SSM parameters

Post-Apply: Tailnet Lock approval

Destroy Lifecycle

Settings UI (one-time, fleet-wide)

IAM Scope (instance-side)

Brain-Side Monitoring (what the team sees)

Failure Modes & Operator Actions

Data Protection

AI Agent Sandboxing

Per-Service Notes

Management Access

Option Comparison

Setup — Tailscale on the Dashboard Host

Install Tailscale on the dashboard EC2 + your laptop

Bind the dashboard to its Tailscale IP

Add more admins

Setup — SSM Port-Forward (break-glass for the dashboard)

Setup — SSM Session Manager (customer-instance shell)

Security Properties

Monitoring & Detection

Incident Response

Security Controls Matrix

Per-User Access Control

Roles & Permissions Matrix

Granular Permissions

API Endpoints

Security Controls

Repository Structure

Per-Customer Cost Breakdown (one customer, t3.medium, audited 2026-05-05)

Cost Audit Findings (2026-05-05)

Deployment Guide

Phase 1 — One-Time Platform Setup

Build the hardened base AMI (Packer)

Terraform state backend (already configured)

Configure Terraform tfvars

Tailscale on the dashboard host (mgmt access)

Start the dashboard (with required env)

Configure Cloudflare (subdomain provisioning)

Configure Brain Telemetry (M8trx Agent customers only)

Apply Organization SCPs

Phase 2 — Deploying a New Customer (Repeatable)

Fill the Deploy Form

Pre-Deploy: Brain Telemetry Provisioning (Agent template only)

Terraform Apply (creates AWS resources)

Post-Apply: DNS + Tailnet Lock (Agent template only)

First-Boot Bootstrap (automatic, Agent template — full detail in Compute tab)

Hand the Customer Their Welcome Package

Write `/etc/m8trx/brain.env` (mode 0600)

Generate `compose/.env` (mode 0700)