Architecture Documentation Restructure Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Create a centralized Architecture doc, merge orchestration-pipelines.md into it, clean up app READMEs to remove duplication, and trim the root README to overview + links.

Architecture: Create docs/Architecture.md as the single source of truth for system architecture and entry points. App READMEs become app-specific deep dives. Root README becomes a lightweight hub.

Tech Stack: Markdown documentation only — no code changes.

Task 1: Create `docs/Architecture.md`

Files:

Create: docs/Architecture.md

Step 1: Write the Architecture doc

Create docs/Architecture.md with the following content. This is the complete file — copy it verbatim.

# Architecture

## Overview

Server-maintanence is a Django-based server monitoring and alerting system. It ingests alerts from external sources, runs health checks, generates AI-powered recommendations, and dispatches notifications — all coordinated through a strict 4-stage pipeline.

**Tech stack:** Django 5.2, Celery (async tasks), Redis (broker), psutil (system metrics), Jinja2 (notification templates).

## Pipeline Stages

The core pipeline processes events through four sequential stages, each owned by a dedicated Django app:

```
┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│ INGEST  │───▶│  CHECK  │───▶│ ANALYZE │───▶│ NOTIFY  │
│ alerts  │    │checkers │    │  intel  │    │ notify  │
└─────────┘    └─────────┘    └─────────┘    └─────────┘
```

| Stage | App | What it does | Input | Output |
|-------|-----|-------------|-------|--------|
| **INGEST** | `apps.alerts` | Parse webhook payloads, create Alert + Incident records | Raw JSON payload | `IngestResult` (incident, alerts) |
| **CHECK** | `apps.checkers` | Run system health checks (CPU, memory, disk, network, process) | Incident context | `CheckResult` (status, metrics) |
| **ANALYZE** | `apps.intelligence` | Generate AI recommendations via provider pattern (local/OpenAI) | Incident + check results | `AnalyzeResult` (recommendations) |
| **NOTIFY** | `apps.notify` | Dispatch notifications via driver pattern (email, Slack, PagerDuty) | Analysis results | `NotifyResult` (delivery status) |

The **orchestration app** (`apps.orchestration`) controls all stage transitions. Stages never call downstream stages directly.

### Skip Controls

Any stage can be skipped via environment variables:

```bash
CHECKERS_SKIP_ALL=1            # Skip all health checks
CHECKERS_SKIP=cpu,memory       # Skip specific checkers
NOTIFY_SKIP_ALL=1              # Skip all notifications
```

## Entry Points

### Management Commands

| Command | App | Purpose |
|---------|-----|---------|
| `check_health [checkers...]` | checkers | Run health checks, display summary. Flags: `--list`, `--json`, `--fail-on-warning`, `--fail-on-critical` |
| `run_check <checker>` | checkers | Run a single checker with checker-specific options (`--samples`, `--per-cpu`, `--paths`, `--hosts`, `--names`) |
| `check_and_alert` | alerts | Run checks and create alerts from results. Flags: `--dry-run`, `--no-incidents`, `--include-skipped` |
| `get_recommendations` | intelligence | Get system recommendations. Flags: `--incident-id`, `--memory`, `--disk`, `--provider`, `--json`, `--list-providers` |
| `list_notify_drivers` | notify | List available notification drivers. Flag: `--verbose` |
| `test_notify [driver]` | notify | Test notification delivery. Flags: per-driver config (`--webhook-url`, `--smtp-host`, etc.) |
| `run_pipeline` | orchestration | Run pipeline end-to-end. Flags: `--sample`, `--payload`, `--dry-run`, `--definition`, `--checks-only` |
| `monitor_pipeline` | orchestration | View pipeline run history. Flags: `--limit`, `--status`, `--run-id` |

### HTTP Endpoints

**Alerts** (`/alerts/`)

| Method | Path | Description |
|--------|------|-------------|
| POST | `/alerts/webhook/` | Receive alert (auto-detect driver) |
| POST | `/alerts/webhook/<driver>/` | Receive alert (specific driver: alertmanager, grafana, pagerduty, datadog, newrelic, opsgenie, zabbix, generic) |

**Intelligence** (`/intelligence/`)

| Method | Path | Description |
|--------|------|-------------|
| GET | `/intelligence/health/` | Health check |
| GET | `/intelligence/providers/` | List available AI providers |
| POST | `/intelligence/recommendations/` | Get recommendations for an incident |
| POST | `/intelligence/memory/` | Memory-specific analysis |
| POST | `/intelligence/disk/` | Disk-specific analysis |

**Notify** (`/notify/`)

| Method | Path | Description |
|--------|------|-------------|
| POST | `/notify/send/` | Send notification (auto-detect driver) |
| POST | `/notify/send/<driver>/` | Send notification (specific driver) |
| POST | `/notify/batch/` | Batch send multiple notifications |
| GET | `/notify/drivers/` | List available drivers |
| GET | `/notify/drivers/<driver>/` | Driver detail and config requirements |

**Orchestration** (`/orchestration/`)

| Method | Path | Description |
|--------|------|-------------|
| POST | `/orchestration/pipeline/` | Trigger pipeline (async via Celery) |
| POST | `/orchestration/pipeline/sync/` | Trigger pipeline (sync, waits for completion) |
| GET | `/orchestration/pipelines/` | List pipeline runs |
| GET | `/orchestration/pipeline/<run_id>/` | Get pipeline run status |
| POST | `/orchestration/pipeline/<run_id>/resume/` | Resume a failed pipeline |
| GET | `/orchestration/definitions/` | List pipeline definitions |
| GET | `/orchestration/definitions/<name>/` | Get definition detail |
| POST | `/orchestration/definitions/<name>/validate/` | Validate a definition |
| POST | `/orchestration/definitions/<name>/execute/` | Execute a definition |

### Celery Tasks

**Alert processing chain** (`apps.alerts.tasks`):

```
orchestrate_event → alerts_ingest → run_diagnostics → analyze_incident → notify_channels
```

Each stage task (except `orchestrate_event`) has `max_retries=3`.

**Pipeline tasks** (`apps.orchestration.tasks`):

| Task | Purpose |
|------|---------|
| `run_pipeline_task` | Run pipeline asynchronously, return result |
| `resume_pipeline_task` | Resume a failed pipeline from last successful stage |
| `start_pipeline_task` | Queue pipeline for async execution, return immediately |

### Django Admin

All apps register their models at `/admin/`:

| Admin Path | Models |
|------------|--------|
| `/admin/alerts/` | Alert, Incident, AlertHistory |
| `/admin/checkers/` | CheckRun |
| `/admin/intelligence/` | AnalysisRun |
| `/admin/notify/` | NotificationChannel |
| `/admin/orchestration/` | PipelineRun, StageExecution, PipelineDefinition |

## Orchestration Systems

The project provides two pipeline execution systems:

### Hardcoded Pipeline

**Location:** `apps/orchestration/orchestrator.py`

Fixed 4-stage sequence: INGEST → CHECK → ANALYZE → NOTIFY. Each stage has a dedicated executor class.

- **Endpoints:** `POST /orchestration/pipeline/` (async) and `/pipeline/sync/` (sync)
- **Celery support:** Yes — async mode queues via Celery
- **Resume:** Yes — failed pipelines can be resumed from the last successful stage
- **Use when:** Standard alert processing, existing webhook integrations

### Definition-Based Pipeline

**Location:** `apps/orchestration/definition_orchestrator.py`

Dynamic stages configured via JSON stored in `PipelineDefinition` model. Supports any combination and ordering of node types.

- **Endpoints:** `POST /orchestration/definitions/<name>/execute/`
- **Celery support:** Not yet (sync only)
- **Resume:** Not yet

**Available node types:**

| Type | Handler | Purpose |
|------|---------|---------|
| `ingest` | IngestNodeHandler | Process incoming alerts, create incidents |
| `context` | ContextNodeHandler | Gather system metrics (CPU, memory, disk) |
| `intelligence` | IntelligenceNodeHandler | AI analysis (local or OpenAI) |
| `notify` | NotifyNodeHandler | Send notifications |
| `transform` | TransformNodeHandler | Transform data between nodes |

**Example definition (standalone health check):**

```json
{
  "version": "1.0",
  "nodes": [
    {"id": "metrics", "type": "context", "config": {"include": ["cpu", "memory", "disk"]}, "next": "analyze"},
    {"id": "analyze", "type": "intelligence", "config": {"provider": "local"}, "next": "notify"},
    {"id": "notify", "type": "notify", "config": {"driver": "slack"}}
  ]
}
```

Definitions are created via Django Admin at `/admin/orchestration/pipelinedefinition/`.

### Comparison

| Feature | Hardcoded | Definition-based |
|---------|-----------|------------------|
| Configuration | Python code | JSON in database |
| Stages | Fixed 4 stages | Any combination of nodes |
| Deployment | Code deploy required | Admin UI |
| Retry logic | Built-in per stage | Built-in per node |
| Celery support | Yes (async mode) | Not yet (sync only) |
| Resume failed | Yes | Not yet |

## Data Models

### Core Models

```
Alert ──────┐
AlertHistory│──▶ Incident ──▶ PipelineRun ──▶ StageExecution
            │                      │
CheckRun ◀──┘                      │
AnalysisRun ◀──────────────────────┘
NotificationChannel (standalone config)
PipelineDefinition (standalone config)
```

| Model | App | Purpose |
|-------|-----|---------|
| `Alert` | alerts | Normalized alert record (fingerprint, status, severity, labels, raw payload) |
| `Incident` | alerts | Groups related alerts, tracks lifecycle (open → ack → resolved → closed) |
| `AlertHistory` | alerts | Audit trail of alert state transitions |
| `CheckRun` | checkers | Health check execution log (status, metrics, timing, trace_id) |
| `AnalysisRun` | intelligence | AI analysis execution log (provider, status, timing, recommendations) |
| `PipelineRun` | orchestration | Pipeline execution tracking (status, timing, correlation IDs) |
| `StageExecution` | orchestration | Per-stage execution within a pipeline (input/output snapshots) |
| `NotificationChannel` | notify | Persistent channel configuration (driver, config, enabled) |
| `PipelineDefinition` | orchestration | JSON pipeline definition for definition-based orchestration |

### State Machine

Pipeline runs progress through:

```
PENDING → INGESTED → CHECKED → ANALYZED → NOTIFIED (success)
                                    └──→ FAILED (terminal)
                                    └──→ RETRYING → (resume from last stage)
```

### Correlation IDs

Every pipeline run carries:
- `trace_id` — Correlation ID for tracing across all stages, logs, and DB records
- `run_id` — Unique ID for the specific pipeline run

## Configuration

### Key Environment Variables

| Variable | Purpose | Default |
|----------|---------|---------|
| `DJANGO_SECRET_KEY` | Django secret key | Required in production |
| `DJANGO_DEBUG` | Debug mode | `0` |
| `DJANGO_ALLOWED_HOSTS` | Comma-separated allowed hosts | `*` |
| `CELERY_BROKER_URL` | Redis broker URL | `redis://localhost:6379/0` |
| `CELERY_TASK_ALWAYS_EAGER` | Run tasks synchronously (dev) | `False` |
| `CHECKERS_SKIP_ALL` | Skip all health checks | `False` |
| `CHECKERS_SKIP` | Comma-separated checkers to skip | Empty |
| `NOTIFY_SKIP_ALL` | Skip all notifications | `False` |

### Settings

Django settings live in `config/settings.py`. Copy `.env.sample` to `.env` for local development.

Step 2: Verify the file renders correctly

# Check file exists and has content
wc -l docs/Architecture.md

Expected: ~200 lines.

Step 3: Commit

git add docs/Architecture.md
git commit -m "docs: add centralized Architecture doc with all entry points"

Task 2: Delete `docs/orchestration-pipelines.md`

Files:

Delete: docs/orchestration-pipelines.md

All content from this file has been merged into docs/Architecture.md (Orchestration Systems section).

Step 1: Delete the file

git rm docs/orchestration-pipelines.md

Step 2: Commit

git commit -m "docs: remove orchestration-pipelines.md (merged into Architecture.md)"

Task 3: Update root `README.md`

Files:

Modify: README.md

Remove the detailed pipeline usage sections (lines 46-224 — “Usage modes” through the end of the standalone monitor section). These are now covered by docs/Architecture.md. Replace with a short summary linking to Architecture.md.

Step 1: Edit README.md

Replace the “Usage modes” section (everything between the ## Install section and ## Environment configuration) with:

## Usage modes

This project supports two modes — see [Architecture](docs/Architecture.md) for full details:

1. **Pipeline controller**: Ingest alerts and route through intelligence + notify stages.
2. **Individual server monitor**: Run health checks locally and optionally generate alerts.

Quick examples:

```bash
# Pipeline mode (sync, with sample alert)
uv run python manage.py run_pipeline --sample

# Standalone health checks
uv run python manage.py check_health

# Run checks and generate alerts
uv run python manage.py check_and_alert

Also update the documentation map to include Architecture.md:

```markdown
## Documentation map

- Architecture: [`docs/Architecture.md`](docs/Architecture.md)
- Installation: [`docs/Installation.md`](docs/Installation.md)
- Security: [`docs/Security.md`](docs/Security.md)
- Health checks (checkers): [`apps/checkers/README.md`](apps/checkers/README.md)
- Alert ingestion: [`apps/alerts/README.md`](apps/alerts/README.md)
- Notifications: [`apps/notify/README.md`](apps/notify/README.md)
- Intelligence/recommendations: [`apps/intelligence/README.md`](apps/intelligence/README.md)
- Pipeline orchestration: [`apps/orchestration/README.md`](apps/orchestration/README.md)
- Shell scripts & CLI: [`bin/README.md`](bin/README.md)
- Working with repo AI agents / conventions: [`agents.md`](agents.md)

Remove the Templates link (it’s a stub) and add Architecture as the first entry.

Step 2: Verify links

# Check all linked files exist
ls docs/Architecture.md docs/Installation.md docs/Security.md apps/*/README.md agents.md bin/README.md

Step 3: Commit

git add README.md
git commit -m "docs: trim root README to overview + links to Architecture.md"

Task 4: Clean up `apps/alerts/README.md`

Files:

Modify: apps/alerts/README.md

What to remove: The “Note” link to root README development section (line 7) — it adds no value. No pipeline duplication to remove; alerts README is already app-focused.

What to verify stays: Webhook endpoints, 8 drivers, data model, Django Admin, services, creating alerts from checks.

Step 1: Replace line 7

Replace:

> **Note:** For development setup (formatting, linting, testing), see the main [README](../../README.md#development).

With:

> See [Architecture](../../docs/Architecture.md) for how this app fits in the pipeline.

Step 2: Commit

git add apps/alerts/README.md
git commit -m "docs: update alerts README cross-reference to Architecture.md"

Task 5: Clean up `apps/checkers/README.md`

Files:

Modify: apps/checkers/README.md

What to remove:

Lines 5-6: “Note” link to root README
Lines 7-20: “Orchestration Integration” section describing the pipeline and two modes — this is now in Architecture.md

What to replace with:

> See [Architecture](../../docs/Architecture.md) for how this app fits in the pipeline (CHECK stage).

Keep everything from “## What’s included” onward — that’s all app-specific.

Step 1: Edit the file

Replace lines 5-20 (from the Note through “Pipeline mode” section) with the single cross-reference line above.

Step 2: Commit

git add apps/checkers/README.md
git commit -m "docs: remove pipeline duplication from checkers README"

Task 6: Clean up `apps/intelligence/README.md`

Files:

Modify: apps/intelligence/README.md

What to remove:

Lines 5-6: “Note” link to root README
Lines 7-42: “Orchestration Integration” section (pipeline description, stage execution details, viewing analysis history) — all now in Architecture.md

What to replace with:

> See [Architecture](../../docs/Architecture.md) for how this app fits in the pipeline (ANALYZE stage).

Keep everything from “## Features” onward.

Step 1: Edit the file

Replace lines 5-42 with the single cross-reference line.

Step 2: Commit

git add apps/intelligence/README.md
git commit -m "docs: remove pipeline duplication from intelligence README"

Task 7: Clean up `apps/notify/README.md`

Files:

Modify: apps/notify/README.md

What to remove:

Lines 6-9: Two “Note” paragraphs (orchestration tracking note + dev setup link)

What to replace with:

> See [Architecture](../../docs/Architecture.md) for how this app fits in the pipeline (NOTIFY stage).

Keep everything from “## What’s included” onward.

Step 1: Edit the file

Replace lines 6-9 with the single cross-reference line.

Step 2: Commit

git add apps/notify/README.md
git commit -m "docs: update notify README cross-reference to Architecture.md"

Task 8: Clean up `apps/orchestration/README.md`

Files:

Modify: apps/orchestration/README.md

What to remove:

Lines 1-7: Pipeline diagram and intro that duplicates Architecture.md

What to replace with:

# Orchestration App

> See [Architecture](../../docs/Architecture.md) for the full pipeline overview, entry points, and orchestration system comparison.

This app controls the lifecycle of pipeline runs through a strict state machine.

Keep everything from “## Key Concepts” onward — state machine, correlation IDs, stage contracts, monitoring signals, API endpoints, configuration, management commands, models.

Step 1: Edit the file

Replace lines 1-7 with the replacement above.

Step 2: Commit

git add apps/orchestration/README.md
git commit -m "docs: remove pipeline duplication from orchestration README"

Task 9: Update `CLAUDE.md` references

Files:

Modify: CLAUDE.md

Step 1: Update the “Key Documentation” section

In CLAUDE.md, find the “Key Documentation” section and update it:

Replace:

## Key Documentation

- `agents.md` — AI agent roles and pipeline contracts (read this for any significant work)
- `apps/<app>/README.md` — App-specific documentation
- `apps/<app>/agents.md` — App-specific AI guidance
- `docs/orchestration-pipelines.md` — Pipeline architecture details

With:

## Key Documentation

- `docs/Architecture.md` — System architecture, all entry points, pipeline stages, data models
- `agents.md` — AI agent roles and pipeline contracts (read this for any significant work)
- `apps/<app>/README.md` — App-specific documentation
- `apps/<app>/agents.md` — App-specific AI guidance

Step 2: Commit

git add CLAUDE.md
git commit -m "docs: update CLAUDE.md references to Architecture.md"

Task 10: Full verification

Step 1: Check all cross-references resolve

# All linked files exist
ls docs/Architecture.md docs/Installation.md docs/Security.md
ls apps/alerts/README.md apps/checkers/README.md apps/intelligence/README.md apps/notify/README.md apps/orchestration/README.md
ls agents.md bin/README.md

Step 2: Verify orchestration-pipelines.md is gone

ls docs/orchestration-pipelines.md 2>&1  # Should say "No such file"

Step 3: Grep for stale references

# No remaining references to the deleted file
grep -r "orchestration-pipelines" . --include="*.md" | grep -v docs/plans/

Expected: No output (or only in plan files which are historical).

Step 4: Run tests to verify nothing is broken

uv run pytest -v
uv run python manage.py check

Step 5: Lint

uv run black --check .
uv run ruff check .

Files Summary

Action	File
Create	`docs/Architecture.md`
Delete	`docs/orchestration-pipelines.md`
Edit	`README.md` — trim to overview + links
Edit	`apps/alerts/README.md` — update cross-reference
Edit	`apps/checkers/README.md` — remove pipeline duplication
Edit	`apps/intelligence/README.md` — remove pipeline duplication
Edit	`apps/notify/README.md` — update cross-reference
Edit	`apps/orchestration/README.md` — remove pipeline intro duplication
Edit	`CLAUDE.md` — update doc references

Architecture Documentation Restructure Implementation Plan

Task 1: Create docs/Architecture.md

Task 2: Delete docs/orchestration-pipelines.md

Task 3: Update root README.md

Task 4: Clean up apps/alerts/README.md

Task 5: Clean up apps/checkers/README.md

Task 6: Clean up apps/intelligence/README.md

Task 7: Clean up apps/notify/README.md

Task 8: Clean up apps/orchestration/README.md

Task 9: Update CLAUDE.md references

Task 10: Full verification

Files Summary

Task 1: Create `docs/Architecture.md`

Task 2: Delete `docs/orchestration-pipelines.md`

Task 3: Update root `README.md`

Task 4: Clean up `apps/alerts/README.md`

Task 5: Clean up `apps/checkers/README.md`

Task 6: Clean up `apps/intelligence/README.md`

Task 7: Clean up `apps/notify/README.md`

Task 8: Clean up `apps/orchestration/README.md`

Task 9: Update `CLAUDE.md` references