Architecture

Overview

Server-maintanence is a Django-based server monitoring and alerting system. It ingests alerts from external sources, runs health checks, generates AI-powered recommendations, and dispatches notifications — all coordinated through a strict 4-stage pipeline.

Tech stack: Django 5.2, Celery (async tasks), Redis (broker), psutil (system metrics), Jinja2 (notification templates).

Pipeline Stages

The core pipeline processes events through four sequential stages, each owned by a dedicated Django app:

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│ INGEST  │───▶│  CHECK  │───▶│ ANALYZE │───▶│ NOTIFY  │
│ alerts  │    │checkers │    │  intel  │    │ notify  │
└─────────┘    └─────────┘    └─────────┘    └─────────┘
Stage App What it does Input Output
INGEST apps.alerts Parse webhook payloads, create Alert + Incident records Raw JSON payload IngestResult (incident, alerts)
CHECK apps.checkers Run system health checks (CPU, memory, disk, network, process) Incident context CheckResult (status, metrics)
ANALYZE apps.intelligence Generate AI recommendations via provider pattern (local/OpenAI) Incident + check results AnalyzeResult (recommendations)
NOTIFY apps.notify Dispatch notifications via driver pattern (email, Slack, PagerDuty) Analysis results NotifyResult (delivery status)

The orchestration app (apps.orchestration) controls all stage transitions. Stages never call downstream stages directly.

Use Cases

Not every deployment uses all four stages. The pipeline is composable — pick the stages you need:

Local server monitoring — You want to monitor CPU, memory, and disk on this machine and get notified when something is wrong. No external monitoring tools required. Health checks run on a cron schedule, generate alerts locally, and dispatch notifications.

Checkers -> Notify                      (local-monitor)
Checkers -> Intelligence -> Notify      (local-smart, adds AI analysis)

External alert processing — You already use Grafana, AlertManager, PagerDuty, or other monitoring tools. This system receives their webhooks, optionally enriches them with local health checks and AI analysis, and forwards notifications to your preferred channels.

Alert -> Notify                                         (direct)
Alert -> Checkers -> Notify                             (health-checked)
Alert -> Intelligence -> Notify                         (ai-analyzed)
Alert -> Checkers -> Intelligence -> Notify             (full pipeline)

Central alert hub — This server acts as an aggregation point for multiple monitored servers. It receives webhooks from various sources, runs AI analysis, and dispatches notifications. No local health checks needed.

Alert -> Intelligence -> Notify         (ai-analyzed)

See the Setup Guide for step-by-step walkthroughs and the setup_instance wizard.

Stage Configuration

Stage behavior is controlled through pipeline definitions and Django Admin — not environment variables:

  • Checkers: Pipeline definitions specify which checkers to run via checker_names in the context node config.
  • Intelligence: The IntelligenceProvider model (Django Admin) controls which AI provider is active.
  • Notify: The NotificationChannel model (Django Admin) controls which channels are active via is_active.

Entry Points

Management Commands

Command App Purpose
check_health [checkers...] checkers Run health checks, display summary. Flags: --list, --json, --fail-on-warning, --fail-on-critical
run_check <checker> checkers Run a single checker with checker-specific options (--samples, --per-cpu, --paths, --hosts, --names)
run_pipeline --checks-only orchestration Run checks through pipeline. Additional flags: --checkers, --no-incidents, --hostname, --label, --warning-threshold, --critical-threshold
get_recommendations intelligence Get system recommendations. Flags: --incident-id, --memory, --disk, --provider, --json, --list-providers
test_notify [driver] notify Test notification delivery. Flags: per-driver config (--webhook-url, --smtp-host, etc.)
run_pipeline orchestration Run pipeline end-to-end. Flags: --sample, --payload, --dry-run, --definition, --checks-only
monitor_pipeline orchestration View pipeline run history. Flags: --limit, --status, --run-id

HTTP Endpoints

Alerts (/alerts/)

Method Path Description
POST /alerts/webhook/ Receive alert (auto-detect driver)
POST /alerts/webhook/<driver>/ Receive alert (specific driver: alertmanager, grafana, pagerduty, datadog, newrelic, opsgenie, zabbix, generic)

Intelligence (/intelligence/)

Method Path Description
GET /intelligence/health/ Health check
GET /intelligence/providers/ List available AI providers
POST /intelligence/recommendations/ Get recommendations for an incident
POST /intelligence/memory/ Memory-specific analysis
POST /intelligence/disk/ Disk-specific analysis

Notify (/notify/)

Method Path Description
POST /notify/send/ Send notification (auto-detect driver)
POST /notify/send/<driver>/ Send notification (specific driver)
POST /notify/batch/ Batch send multiple notifications
GET /notify/drivers/ List available drivers
GET /notify/drivers/<driver>/ Driver detail and config requirements

Orchestration (/orchestration/)

Method Path Description
POST /orchestration/pipeline/ Trigger pipeline (async via Celery)
POST /orchestration/pipeline/sync/ Trigger pipeline (sync, waits for completion)
GET /orchestration/pipelines/ List pipeline runs
GET /orchestration/pipeline/<run_id>/ Get pipeline run status
POST /orchestration/pipeline/<run_id>/resume/ Resume a failed pipeline
GET /orchestration/definitions/ List pipeline definitions
GET /orchestration/definitions/<name>/ Get definition detail
POST /orchestration/definitions/<name>/validate/ Validate a definition
POST /orchestration/definitions/<name>/execute/ Execute a definition

Celery Tasks

Pipeline tasks (apps.orchestration.tasks):

Task Purpose
run_pipeline_task Run pipeline asynchronously, return result
resume_pipeline_task Resume a failed pipeline from last successful stage
start_pipeline_task Queue pipeline for async execution, return immediately

Django Admin

All apps register their models at /admin/:

Admin Path Models
/admin/alerts/ Alert, Incident, AlertHistory
/admin/checkers/ CheckRun
/admin/intelligence/ AnalysisRun
/admin/notify/ NotificationChannel
/admin/orchestration/ PipelineRun, StageExecution, PipelineDefinition

Orchestration Systems

The project provides two pipeline execution systems:

Hardcoded Pipeline

Location: apps/orchestration/orchestrator.py

Fixed 4-stage sequence: INGEST → CHECK → ANALYZE → NOTIFY. Each stage has a dedicated executor class.

  • Endpoints: POST /orchestration/pipeline/ (async) and /pipeline/sync/ (sync)
  • Celery support: Yes — async mode queues via Celery
  • Resume: Yes — failed pipelines can be resumed from the last successful stage
  • Use when: Standard alert processing, existing webhook integrations

Definition-Based Pipeline

Location: apps/orchestration/definition_orchestrator.py

Dynamic stages configured via JSON stored in PipelineDefinition model. Supports any combination and ordering of node types.

  • Endpoints: POST /orchestration/definitions/<name>/execute/
  • CLI: python manage.py run_pipeline --definition <name> or --config path/to/file.json
  • Celery support: Not yet (sync only)
  • Resume: Not yet

Available node types:

Type Handler Purpose Config Keys
ingest IngestNodeHandler Parse alert webhooks, create Incident + Alert records driver (optional)
context ContextNodeHandler Run real system health checkers (CPU, memory, disk, etc.) checker_names (list, optional — defaults to all enabled)
intelligence IntelligenceNodeHandler AI analysis via provider pattern (local or OpenAI) provider (required), provider_config (optional)
notify NotifyNodeHandler Send notifications via DB-configured channels drivers (list) or driver (string)
transform TransformNodeHandler Extract, filter, or map data between nodes source_node (required), extract, mapping, filter_priority

Node output chaining: Each node’s output is stored in NodeContext.previous_outputs[node_id] and available to all downstream nodes. For example, the notify node reads checker results from previous context node output to build notification messages with appropriate severity (critical/warning/info).

Example definition (local health check → notify):

{
  "version": "1.0",
  "nodes": [
    {"id": "check_health", "type": "context", "config": {"checker_names": ["cpu", "memory", "disk"]}, "next": "notify"},
    {"id": "notify", "type": "notify", "config": {"drivers": ["slack"]}}
  ]
}

Definitions can be created via:

  • Django Admin: /admin/orchestration/pipelinedefinition/
  • Setup wizard: python manage.py setup_instance

Comparison

Feature Hardcoded Definition-based
Configuration Python code JSON in database
Stages Fixed 4 stages Any combination of nodes
Deployment Code deploy required Admin UI
Retry logic Built-in per stage Built-in per node
Celery support Yes (async mode) Not yet (sync only)
Resume failed Yes Not yet

Data Models

Core Models

Alert ──────┐
AlertHistory│──▶ Incident ──▶ PipelineRun ──▶ StageExecution
            │                      │
CheckRun ◀──┘                      │
AnalysisRun ◀──────────────────────┘
NotificationChannel (standalone config)
PipelineDefinition (standalone config)
Model App Purpose
Alert alerts Normalized alert record (fingerprint, status, severity, labels, raw payload)
Incident alerts Groups related alerts, tracks lifecycle (open → ack → resolved → closed)
AlertHistory alerts Audit trail of alert state transitions
CheckRun checkers Health check execution log (status, metrics, timing, trace_id)
AnalysisRun intelligence AI analysis execution log (provider, status, timing, recommendations)
PipelineRun orchestration Pipeline execution tracking (status, timing, correlation IDs)
StageExecution orchestration Per-stage execution within a pipeline (input/output snapshots)
NotificationChannel notify Persistent channel configuration (driver, config, enabled)
PipelineDefinition orchestration JSON pipeline definition for definition-based orchestration

State Machine

Pipeline runs progress through:

PENDING → INGESTED → CHECKED → ANALYZED → NOTIFIED (success)
                                    └──→ FAILED (terminal)
                                    └──→ RETRYING → (resume from last stage)

Correlation IDs

Every pipeline run carries:

  • trace_id — Correlation ID for tracing across all stages, logs, and DB records
  • run_id — Unique ID for the specific pipeline run

Configuration

Key Environment Variables

Environment variables configure infrastructure only. Application behavior (which checkers to run, intelligence provider, notification channels) is managed through Django Admin and pipeline definitions.

Variable Purpose Default
DJANGO_SECRET_KEY Django secret key Required in production
DJANGO_DEBUG Debug mode 0
DJANGO_ALLOWED_HOSTS Comma-separated allowed hosts *
CELERY_BROKER_URL Redis broker URL redis://localhost:6379/0
CELERY_TASK_ALWAYS_EAGER Run tasks synchronously (dev) False
ORCHESTRATION_MAX_RETRIES_PER_STAGE Retries before pipeline failure 3
ORCHESTRATION_BACKOFF_FACTOR Exponential backoff multiplier 2.0
ORCHESTRATION_INTELLIGENCE_FALLBACK_ENABLED Continue pipeline when AI fails 1
ORCHESTRATION_METRICS_BACKEND Metrics backend (logging or statsd) logging
STATSD_HOST StatsD server host localhost
STATSD_PORT StatsD server port 8125
STATSD_PREFIX StatsD metric prefix pipeline

Settings

Django settings live in config/settings.py. Copy .env.sample to .env for local development.


This site uses Just the Docs, a documentation theme for Jekyll.