Cluster Alert Driver — Design

Date: 2026-03-29

Problem

The app runs as a single instance. There’s no way to deploy it across multiple servers where each monitors locally and reports to a central hub. All 8 existing alert drivers are for external tools (Grafana, AlertManager, etc.) — none speak the system’s own protocol.

Goal

Create a 9th alert driver (“cluster”) that enables multi-instance deployment. Agent instances run checkers locally, create alerts, and push them to a hub instance via the existing webhook endpoint. The same codebase runs in any role — standalone, agent, hub, or relay — determined purely by environment variables.

Configuration

Two env vars control behavior (both optional, both unset by default):

Env var	Default	Purpose
`HUB_URL`	(unset)	If set, `push_to_hub` sends alerts to this URL
`CLUSTER_ENABLED`	`0`	If `1`, cluster driver is registered, hub accepts agent payloads
`WEBHOOK_SECRET_CLUSTER`	(unset)	HMAC-SHA256 shared secret for signing
`INSTANCE_ID`	`socket.gethostname()`	Human-readable instance identifier

Behavior matrix:

HUB_URL	CLUSTER_ENABLED	Role
unset	0	Standalone — current behavior, zero changes
set	0	Agent — runs checkers, pushes to hub
unset	1	Hub — accepts cluster payloads from agents
set	1	Hub + Agent — accepts payloads AND runs `push_to_hub` (each role is independent)

Existing installs with neither var set behave exactly as today. Zero breaking changes.

Note: Transparent relay forwarding (automatically re-POSTing received cluster payloads upstream) is not implemented in this version. Each instance is configured independently via its own HUB_URL.

Cluster Alert Driver

New ClusterDriver at apps/alerts/drivers/cluster.py, registered as "cluster" in the driver registry (conditionally, when CLUSTER_ENABLED=1).

Payload format (what agents POST to HUB_URL/alerts/webhook/cluster/):

{
    "source": "cluster",
    "instance_id": "web-server-03",
    "hostname": "ip-10-0-1-42",
    "version": "1.0",
    "alerts": [
        {
            "fingerprint": "cpu-check-ip-10-0-1-42",
            "name": "CPU usage critical",
            "status": "firing",
            "severity": "critical",
            "started_at": "2026-03-29T12:00:00Z",
            "labels": {
                "checker": "cpu",
                "hostname": "ip-10-0-1-42",
                "instance_id": "web-server-03"
            },
            "annotations": {
                "message": "CPU at 95.2%"
            },
            "metrics": {
                "cpu_percent": 95.2
            }
        }
    ]
}

Driver behavior:

validate() — checks for source == "cluster" and instance_id field
parse() — maps payload to ParsedPayload (same normalized structure as every driver)
signature_header = "X-Cluster-Signature" — HMAC-SHA256 using WEBHOOK_SECRET_CLUSTER
Agent hostname and instance_id in labels on every alert — hub identifies which server
metrics preserved in annotations for intelligence/analysis

Hub treats these alerts identically to Grafana/AlertManager — same deduplication, incident grouping, pipeline flow.

`push_to_hub` Management Command

New command at apps/alerts/management/commands/push_to_hub.py.

Flow:

Run all enabled checkers locally (reuses existing checker infrastructure)
Format results as cluster driver payload
Sign with WEBHOOK_SECRET_CLUSTER (HMAC-SHA256)
POST to HUB_URL/alerts/webhook/cluster/
Report success/failure

Audit trail: Local check results are not persisted as Alert records. The hub creates incident and alert records when it processes the incoming cluster payload via the normal pipeline.

Flags:

--dry-run — run checkers, show payload, don’t POST
--json — JSON output (for cron logging)
--checkers cpu,memory — specific checkers only

Error handling:

Hub unreachable → log error, exit 1, local alerts preserved
Hub returns 4xx/5xx → log response, exit 1
No HUB_URL → error message, exit 1

Cron integration: Users schedule push_to_hub instead of (or alongside) run_pipeline --checks-only. No changes to setup_cron.sh needed.

File Changes

New files:

apps/alerts/drivers/cluster.py — ClusterDriver
apps/alerts/management/commands/push_to_hub.py — management command
apps/alerts/_tests/drivers/test_cluster.py — driver unit tests
apps/alerts/_tests/commands/test_push_to_hub.py — command tests

Modified files:

apps/alerts/drivers/__init__.py — register "cluster" conditionally on CLUSTER_ENABLED
.env.sample — add cluster section (commented out)
docs/Deployment.md — add multi-instance section
config/settings.py — read CLUSTER_ENABLED, HUB_URL, INSTANCE_ID

Unaffected: All existing drivers, install.sh, bin/ scripts, existing tests.

Approach

Reuse existing webhook infrastructure — the cluster driver is “just another driver.” No new endpoints, no new middleware, no new auth. The push_to_hub command is the agent-side glue that ties checkers to the webhook POST.

Rejected alternatives:

Dedicated /alerts/cluster/ endpoint — unnecessary, existing webhook handles it
Shared Celery broker — couples instances at infrastructure level, harder to deploy
Peer-to-peer mesh — over-engineered for the use case