Architecture
Overview
Server-maintanence is a Django-based server monitoring and alerting system. It ingests alerts from external sources, runs health checks, generates AI-powered recommendations, and dispatches notifications — all coordinated through a strict 4-stage pipeline.
Tech stack: Django 5.2, Celery (async tasks), Redis (broker), psutil (system metrics), Jinja2 (notification templates).
Pipeline Stages
The core pipeline processes events through four sequential stages, each owned by a dedicated Django app:
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ INGEST │───▶│ CHECK │───▶│ ANALYZE │───▶│ NOTIFY │
│ alerts │ │checkers │ │ intel │ │ notify │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
| Stage | App | What it does | Input | Output |
|---|---|---|---|---|
| INGEST | apps.alerts | Parse webhook payloads, create Alert + Incident records | Raw JSON payload | IngestResult (incident, alerts) |
| CHECK | apps.checkers | Run system health checks (CPU, memory, disk, network, process) | Incident context | CheckResult (status, metrics) |
| ANALYZE | apps.intelligence | Generate AI recommendations via provider pattern (local/OpenAI) | Incident + check results | AnalyzeResult (recommendations) |
| NOTIFY | apps.notify | Dispatch notifications via driver pattern (email, Slack, PagerDuty) | Analysis results | NotifyResult (delivery status) |
The orchestration app (apps.orchestration) controls all stage transitions. Stages never call downstream stages directly.
Use Cases
Not every deployment uses all four stages. The pipeline is composable — pick the stages you need:
Local server monitoring — You want to monitor CPU, memory, and disk on this machine and get notified when something is wrong. No external monitoring tools required. Health checks run on a cron schedule, generate alerts locally, and dispatch notifications.
Checkers -> Notify (local-monitor)
Checkers -> Intelligence -> Notify (local-smart, adds AI analysis)
External alert processing — You already use Grafana, AlertManager, PagerDuty, or other monitoring tools. This system receives their webhooks, optionally enriches them with local health checks and AI analysis, and forwards notifications to your preferred channels.
Alert -> Notify (direct)
Alert -> Checkers -> Notify (health-checked)
Alert -> Intelligence -> Notify (ai-analyzed)
Alert -> Checkers -> Intelligence -> Notify (full pipeline)
Central alert hub — This server acts as an aggregation point for multiple monitored servers. It receives webhooks from various sources, runs AI analysis, and dispatches notifications. No local health checks needed.
Alert -> Intelligence -> Notify (ai-analyzed)
See the Setup Guide for step-by-step walkthroughs and the setup_instance wizard.
Stage Configuration
Stage behavior is controlled through pipeline definitions and Django Admin — not environment variables:
- Checkers: Pipeline definitions specify which checkers to run via
checker_namesin the context node config. - Intelligence: The
IntelligenceProvidermodel (Django Admin) controls which AI provider is active. - Notify: The
NotificationChannelmodel (Django Admin) controls which channels are active viais_active.
Entry Points
Management Commands
| Command | App | Purpose |
|---|---|---|
check_health [checkers...] | checkers | Run health checks, display summary. Flags: --list, --json, --fail-on-warning, --fail-on-critical |
run_check <checker> | checkers | Run a single checker with checker-specific options (--samples, --per-cpu, --paths, --hosts, --names) |
run_pipeline --checks-only | orchestration | Run checks through pipeline. Additional flags: --checkers, --no-incidents, --hostname, --label, --warning-threshold, --critical-threshold |
get_recommendations | intelligence | Get system recommendations. Flags: --incident-id, --memory, --disk, --provider, --json, --list-providers |
test_notify [driver] | notify | Test notification delivery. Flags: per-driver config (--webhook-url, --smtp-host, etc.) |
run_pipeline | orchestration | Run pipeline end-to-end. Flags: --sample, --payload, --dry-run, --definition, --checks-only |
monitor_pipeline | orchestration | View pipeline run history. Flags: --limit, --status, --run-id |
HTTP Endpoints
Alerts (/alerts/)
| Method | Path | Description |
|---|---|---|
| POST | /alerts/webhook/ | Receive alert (auto-detect driver) |
| POST | /alerts/webhook/<driver>/ | Receive alert (specific driver: alertmanager, grafana, pagerduty, datadog, newrelic, opsgenie, zabbix, generic) |
Intelligence (/intelligence/)
| Method | Path | Description |
|---|---|---|
| GET | /intelligence/health/ | Health check |
| GET | /intelligence/providers/ | List available AI providers |
| POST | /intelligence/recommendations/ | Get recommendations for an incident |
| POST | /intelligence/memory/ | Memory-specific analysis |
| POST | /intelligence/disk/ | Disk-specific analysis |
Notify (/notify/)
| Method | Path | Description |
|---|---|---|
| POST | /notify/send/ | Send notification (auto-detect driver) |
| POST | /notify/send/<driver>/ | Send notification (specific driver) |
| POST | /notify/batch/ | Batch send multiple notifications |
| GET | /notify/drivers/ | List available drivers |
| GET | /notify/drivers/<driver>/ | Driver detail and config requirements |
Orchestration (/orchestration/)
| Method | Path | Description |
|---|---|---|
| POST | /orchestration/pipeline/ | Trigger pipeline (async via Celery) |
| POST | /orchestration/pipeline/sync/ | Trigger pipeline (sync, waits for completion) |
| GET | /orchestration/pipelines/ | List pipeline runs |
| GET | /orchestration/pipeline/<run_id>/ | Get pipeline run status |
| POST | /orchestration/pipeline/<run_id>/resume/ | Resume a failed pipeline |
| GET | /orchestration/definitions/ | List pipeline definitions |
| GET | /orchestration/definitions/<name>/ | Get definition detail |
| POST | /orchestration/definitions/<name>/validate/ | Validate a definition |
| POST | /orchestration/definitions/<name>/execute/ | Execute a definition |
Celery Tasks
Pipeline tasks (apps.orchestration.tasks):
| Task | Purpose |
|---|---|
run_pipeline_task | Run pipeline asynchronously, return result |
resume_pipeline_task | Resume a failed pipeline from last successful stage |
start_pipeline_task | Queue pipeline for async execution, return immediately |
Django Admin
All apps register their models at /admin/:
| Admin Path | Models |
|---|---|
/admin/alerts/ | Alert, Incident, AlertHistory |
/admin/checkers/ | CheckRun |
/admin/intelligence/ | AnalysisRun |
/admin/notify/ | NotificationChannel |
/admin/orchestration/ | PipelineRun, StageExecution, PipelineDefinition |
Orchestration Systems
The project provides two pipeline execution systems:
Hardcoded Pipeline
Location: apps/orchestration/orchestrator.py
Fixed 4-stage sequence: INGEST → CHECK → ANALYZE → NOTIFY. Each stage has a dedicated executor class.
- Endpoints:
POST /orchestration/pipeline/(async) and/pipeline/sync/(sync) - Celery support: Yes — async mode queues via Celery
- Resume: Yes — failed pipelines can be resumed from the last successful stage
- Use when: Standard alert processing, existing webhook integrations
Definition-Based Pipeline
Location: apps/orchestration/definition_orchestrator.py
Dynamic stages configured via JSON stored in PipelineDefinition model. Supports any combination and ordering of node types.
- Endpoints:
POST /orchestration/definitions/<name>/execute/ - CLI:
python manage.py run_pipeline --definition <name>or--config path/to/file.json - Celery support: Not yet (sync only)
- Resume: Not yet
Available node types:
| Type | Handler | Purpose | Config Keys |
|---|---|---|---|
ingest | IngestNodeHandler | Parse alert webhooks, create Incident + Alert records | driver (optional) |
context | ContextNodeHandler | Run real system health checkers (CPU, memory, disk, etc.) | checker_names (list, optional — defaults to all enabled) |
intelligence | IntelligenceNodeHandler | AI analysis via provider pattern (local or OpenAI) | provider (required), provider_config (optional) |
notify | NotifyNodeHandler | Send notifications via DB-configured channels | drivers (list) or driver (string) |
transform | TransformNodeHandler | Extract, filter, or map data between nodes | source_node (required), extract, mapping, filter_priority |
Node output chaining: Each node’s output is stored in NodeContext.previous_outputs[node_id] and available to all downstream nodes. For example, the notify node reads checker results from previous context node output to build notification messages with appropriate severity (critical/warning/info).
Example definition (local health check → notify):
{
"version": "1.0",
"nodes": [
{"id": "check_health", "type": "context", "config": {"checker_names": ["cpu", "memory", "disk"]}, "next": "notify"},
{"id": "notify", "type": "notify", "config": {"drivers": ["slack"]}}
]
}
Definitions can be created via:
- Django Admin:
/admin/orchestration/pipelinedefinition/ - Setup wizard:
python manage.py setup_instance
Comparison
| Feature | Hardcoded | Definition-based |
|---|---|---|
| Configuration | Python code | JSON in database |
| Stages | Fixed 4 stages | Any combination of nodes |
| Deployment | Code deploy required | Admin UI |
| Retry logic | Built-in per stage | Built-in per node |
| Celery support | Yes (async mode) | Not yet (sync only) |
| Resume failed | Yes | Not yet |
Data Models
Core Models
Alert ──────┐
AlertHistory│──▶ Incident ──▶ PipelineRun ──▶ StageExecution
│ │
CheckRun ◀──┘ │
AnalysisRun ◀──────────────────────┘
NotificationChannel (standalone config)
PipelineDefinition (standalone config)
| Model | App | Purpose |
|---|---|---|
Alert | alerts | Normalized alert record (fingerprint, status, severity, labels, raw payload) |
Incident | alerts | Groups related alerts, tracks lifecycle (open → ack → resolved → closed) |
AlertHistory | alerts | Audit trail of alert state transitions |
CheckRun | checkers | Health check execution log (status, metrics, timing, trace_id) |
AnalysisRun | intelligence | AI analysis execution log (provider, status, timing, recommendations) |
PipelineRun | orchestration | Pipeline execution tracking (status, timing, correlation IDs) |
StageExecution | orchestration | Per-stage execution within a pipeline (input/output snapshots) |
NotificationChannel | notify | Persistent channel configuration (driver, config, enabled) |
PipelineDefinition | orchestration | JSON pipeline definition for definition-based orchestration |
State Machine
Pipeline runs progress through:
PENDING → INGESTED → CHECKED → ANALYZED → NOTIFIED (success)
└──→ FAILED (terminal)
└──→ RETRYING → (resume from last stage)
Correlation IDs
Every pipeline run carries:
trace_id— Correlation ID for tracing across all stages, logs, and DB recordsrun_id— Unique ID for the specific pipeline run
Configuration
Key Environment Variables
Environment variables configure infrastructure only. Application behavior (which checkers to run, intelligence provider, notification channels) is managed through Django Admin and pipeline definitions.
| Variable | Purpose | Default |
|---|---|---|
DJANGO_SECRET_KEY | Django secret key | Required in production |
DJANGO_DEBUG | Debug mode | 0 |
DJANGO_ALLOWED_HOSTS | Comma-separated allowed hosts | * |
CELERY_BROKER_URL | Redis broker URL | redis://localhost:6379/0 |
CELERY_TASK_ALWAYS_EAGER | Run tasks synchronously (dev) | False |
ORCHESTRATION_MAX_RETRIES_PER_STAGE | Retries before pipeline failure | 3 |
ORCHESTRATION_BACKOFF_FACTOR | Exponential backoff multiplier | 2.0 |
ORCHESTRATION_INTELLIGENCE_FALLBACK_ENABLED | Continue pipeline when AI fails | 1 |
ORCHESTRATION_METRICS_BACKEND | Metrics backend (logging or statsd) | logging |
STATSD_HOST | StatsD server host | localhost |
STATSD_PORT | StatsD server port | 8125 |
STATSD_PREFIX | StatsD metric prefix | pipeline |
Settings
Django settings live in config/settings.py. Copy .env.sample to .env for local development.