Date: 2026-05-03
Phase: B.3 (telemetry — host-side liveness + stats)
Predecessor: B.2 (Claude Code hooks for agent-runtime)
Successor: B.4 (Tailscale + cloud-init bootstrap)
Install a systemd timer + service on each customer EC2 host that POSTs
a heartbeat event to the brain ingestion API every ~5 minutes. The
event carries host metadata + system stats so operators get fleet-wide
liveness ("this host is alive") plus light capacity-trend visibility
(load avg, memory, disk %), without depending on the customer
actively running agents.
Phase B.2 instrumented Claude itself (per-tool-call events from inside
the agent-runtime container). But a customer with no recent agent
activity disappears from the dashboard's "last seen" tracking. The
heartbeat closes that gap with a periodic, agent-independent liveness
signal — and while we're at it, attaches cheap host stats so a
customer trending toward disk-full or thrashing on memory shows up in
the brain before they page support.
┌─ customer EC2 host ─────────────────────┐
│ │
systemd timer ──→│ m8trx-brain-heartbeat.timer │
(every 5 min) │ OnBootSec=30s, OnUnitActiveSec=5min │
│ RandomizedDelaySec=30s ◄── fleet-wide │
│ │ spread │
│ ▼ │
│ m8trx-brain-heartbeat.service │
│ Type=oneshot, User=root │
│ EnvironmentFile=/etc/m8trx/brain.env │
│ │ │
│ ▼ │
│ /usr/local/bin/m8trx-brain-heartbeat │
│ (POSIX sh + jq + curl) │
│ - reads /proc/loadavg, /proc/meminfo, │
│ /proc/uptime, uname -r, hostname │
│ - df -P / for root disk % │
│ - jq builds payload + event JSON │
│ - curl POST to ${BRAIN_URL}/v1/events │
│ │ │
└───────┼─────────────────────────────────┘
│ HTTPS over Tailscale
▼
┌─ brain ──────────────────────────────────┐
│ /v1/events │
│ agent_id="_host" → filtered out of │
│ agent rollups by dashboard queries │
└──────────────────────────────────────────┘
RandomizedDelaySec=30s spreads the fleet so 100 customers don't all
fire at XX:00:00. Customer ID is implicit from the bearer key in
/etc/m8trx/brain.env, which B.4 cloud-init writes during host
provisioning.
Timer fires every 5 minutes (OnUnitActiveSec=5min), with a
30-second startup delay (OnBootSec=30s) and a 30-second
fleet-wide jitter (RandomizedDelaySec=30s).
5 minutes was chosen against the dashboard's statusFromLastSeenMin
thresholds (server/src/routes/dashboard.js:9–15):
A single missed beat is well below the warning threshold, and 12
beats/hour × N customers is a manageable write rate (~1.2k rows/hr at
100 customers).
Four files, all under agent-artifacts/heartbeat/:
m8trx-brain-heartbeat.shPOSIX sh script. Installed at /usr/local/bin/m8trx-brain-heartbeat,
mode 0755, root-owned.
Behaviour:
BRAIN_URL and BRAIN_API_KEY. If either is unset,exit 1. (Non-zero exit: thissystemctl status / journalctl.)hostname from hostname (or /etc/hostname fallback).uptime_sec from /proc/uptime (first field, integer-truncated).kernel from uname -r.load1, load5, load15 from /proc/loadavg (first three,mem_total_mb, mem_avail_mb from /proc/meminfoMemTotal, MemAvailable are reported in kibibytes; divide by_mb for brevity but the value isfree -m shows on the same host).disk_root_pct from df -P / | tail -1 | awk '{print $5}',%.jq -nc using --arg / --argjson soevent_id from cat /proc/sys/kernel/random/uuid.ts from date -u +"%Y-%m-%dT%H:%M:%S.000Z".event_type=heartbeat, agent_id=_host.${BRAIN_URL%/}/v1/events withcurl --silent --show-error --max-time 5 --retry 0 --fail -o /dev/null, with Authorization: Bearer ${BRAIN_API_KEY} and Content-Type: application/json.set -e is on throughout, so any of (1)–(4) failing exits non-zeroImplementation notes:
set -u. Optional vars handled with ${VAR:-} defaults wheretrap 'exit 0' (different from the brain-hook). The hooksystemctl status.m8trx-brain-heartbeat.servicesystemd unit. Installed at
/etc/systemd/system/m8trx-brain-heartbeat.service, mode 0644.
[Unit]
Description=M8trx brain host heartbeat
After=network-online.target tailscaled.service
Wants=network-online.target
[Service]
Type=oneshot
User=root
EnvironmentFile=/etc/m8trx/brain.env
ExecStart=/usr/local/bin/m8trx-brain-heartbeat
After=tailscaled.service because brain is reachable only over
Tailscale.
m8trx-brain-heartbeat.timersystemd timer. Installed at
/etc/systemd/system/m8trx-brain-heartbeat.timer, mode 0644.
[Unit]
Description=Fire M8trx brain host heartbeat every 5 minutes
[Timer]
OnBootSec=30s
OnUnitActiveSec=5min
RandomizedDelaySec=30s
Unit=m8trx-brain-heartbeat.service
[Install]
WantedBy=timers.target
README.mdShort integration doc covering:
Where each artifact installs to on a customer EC2.
The expected /etc/m8trx/brain.env format (mode 0600 root:root,
BRAIN_URL= and BRAIN_API_KEY= lines).
Install steps:
cp m8trx-brain-heartbeat.sh /usr/local/bin/m8trx-brain-heartbeat
cp m8trx-brain-heartbeat.service /etc/systemd/system/
cp m8trx-brain-heartbeat.timer /etc/systemd/system/
chmod 0755 /usr/local/bin/m8trx-brain-heartbeat
systemctl daemon-reload
systemctl enable --now m8trx-brain-heartbeat.timer
Operator debug recipe: systemctl status m8trx-brain-heartbeat.timer, journalctl -u m8trx-brain-heartbeat --since "10 min ago", plus a one-shot manual run via systemctl start m8trx-brain-heartbeat.service.
Required deps (curl, jq) — installed by B.4 cloud-init alongside
this stack.
{
"event_id": "<uuid v4>",
"ts": "2026-05-03T21:42:11.000Z",
"event_type": "heartbeat",
"agent_id": "_host",
"payload": {
"hostname": "ip-10-0-1-42",
"uptime_sec": 1234567,
"kernel": "6.8.0-1052-aws",
"load1": 0.42,
"load5": 0.51,
"load15": 0.55,
"mem_total_mb": 16384,
"mem_avail_mb": 12000,
"disk_root_pct": 38
}
}
event_type=heartbeat is in brain's VALID_TYPESserver/src/routes/events.js:9).agent_id="_host" is the convention for non-agent telemetry.agent_id != '_host' in theserver/src/routes/dashboard.js:23, 38, 151, 162), so heartbeats won't pollute agent-count orrun_id is omitted — heartbeats have no concept of a run.customer_id is not in the payload — brain derives it from the--argjson), not strings.agents table auto-upsertserver/src/routes/events.js:46–53) will create an agents rowid="_host" per customer. That gives operators a per-customer/v1/dashboard/agents queries — harmless,| Var | Set by | Required | Behaviour if missing |
|---|---|---|---|
BRAIN_URL |
/etc/m8trx/brain.env (B.4 cloud-init) |
yes | exit 1, stderr log |
BRAIN_API_KEY |
/etc/m8trx/brain.env (B.4 cloud-init) |
yes | exit 1, stderr log |
No optional env vars. Heartbeats need no AGENT_ID (uses fixed
_host), no RUN_ID, no BRAIN_DEBUG (systemd journal already
captures stderr — debug is operator-visible by default).
| Failure | Behaviour |
|---|---|
BRAIN_URL or BRAIN_API_KEY unset |
echo "m8trx-brain-heartbeat: BRAIN_URL/BRAIN_API_KEY unset" >&2; exit 1. systemd marks unit failed → visible in systemctl status and journalctl -u. |
/proc/* read fails |
set -e exits non-zero. (Should never happen on Linux.) |
df -P / fails |
set -e exits non-zero. |
jq missing or fails |
set -e exits non-zero. (B.4 cloud-init installs jq alongside.) |
curl transport failure (Tailscale down, brain unreachable) |
curl --fail --max-time 5 exits non-zero; script exits non-zero; systemd marks failed. Operator sees curl's stderr in journal. |
| Brain returns non-2xx (401 wrong key, 4xx bad event, 5xx) | --fail makes curl exit non-zero with the response code; same systemd failure path. |
| Hook crashes (syntax error, etc.) | set -e exits non-zero. |
The heartbeat is fundamentally different from the brain-hook
(B.2) in its failure philosophy:
BRAIN_DEBUG forsystemctl status /journalctl. systemd journal is operator-only by design, so noBRAIN_DEBUG gate is needed.No Restart= on the service unit — Type=oneshot units don't restart,
and the timer fires every 5 min regardless of the previous run's
exit. A transient failure self-heals on the next beat. A persistent
failure stays loud via systemctl status m8trx-brain-heartbeat.timer
which surfaces the last result.
A bin/test-brain-heartbeat.sh script in the brain repo, runnable
against the local brain server, covering:
BRAIN_URL + freshly-minted KEY,event_type=heartbeat, agent_id=_host, all 9 payloaddisk_root_pct 0–100,mem_total_mb >= mem_avail_mb, load1/5/15 >= 0,uptime_sec > 0, kernel and hostname non-empty strings).BRAIN_URL, run, verify exit 1,BRAIN_API_KEY, verify exit non-zerocurl --fail, journal-bound stderr present, no new row.BRAIN_URL=http://192.0.2.1:1,--max-time 5 + slack),Bootstrap (same as the B.2 hook test): mint a fresh test key via
docker compose exec brain-api node bin/mint-key.js cust_m8trx_test "M8TRX Test" at the top of the runner.
Real-world test (cp files into /etc/systemd/system/, daemon-reload,
enable + start the timer, wait, check systemctl status and
journalctl) requires mutating this EC2's actual systemd state.
That's heavier than the hook tests' "no host mutation" model.
Instead, the README documents the smoke procedure for an operator
deploying to a real customer EC2 in B.4. The standalone script test
above gets ≥90% of the value with zero system-state side effects.
agents row auto-upsert whenagent_id="_host". Revisit only if the per-customer "_host"/ for MVP).None at design-approval time. All four clarifying questions were
resolved interactively before this spec was written.