Monitoring

FreeSDN exposes structured observability through three channels: health/readiness endpoints for container orchestration, a Prometheus /metrics endpoint for time-series metrics, and Flower for Celery task visibility.

Health endpoints

Three endpoints with distinct semantics - do not use them interchangeably.

Endpoint	Purpose	200 when	503 when
`GET /api/v1/health/live`	Liveness probe (is the process alive?)	Always, as long as the HTTP layer responds	Never - use readiness for traffic gating
`GET /api/v1/health/ready`	Readiness probe (can the app serve traffic?)	DB reachable AND `modules`/`event_bus` not degraded	DB unreachable OR critical subsystem degraded
`GET /api/v1/health/`	Public status snapshot (unauthenticated)	Always 200; read the `status` field in the body	Never 503; payload shows per-component `{status}` only - no latencies, versions, or uptime (FSDN-SEC-008)
`GET /api/v1/health/detail`	Full health snapshot (authenticated)	Always 200 when authenticated	Never 503; payload includes per-component `latency_ms`, app `version`, `uptime_seconds`, and `platform` versions. Requires `settings:read` permission.

# Quick smoke-test from the host
curl -fsS http://localhost:8000/api/v1/health/live
curl -fsS http://localhost:8000/api/v1/health/ready
curl -fsS http://localhost:8000/api/v1/health/ | jq '.status, .components'

# Full health detail  -  requires a valid session token with settings:read
curl -fsS http://localhost:8000/api/v1/health/detail \
  -H "Authorization: Bearer <your-token>" | jq '.status, .uptime_seconds, .platform, .components'

The /health/ endpoint runs three checks concurrently: database (Postgres SELECT 1), redis (Valkey PING), and celery (TTL heartbeat key refreshed by the beat scheduler every 30 s). After those complete, any subsystems that failed during startup (modules, event_bus, etc.) are appended to the component map in a sequential pass - they are not part of the concurrent gather. The celery component being degraded does not gate readiness - the API can serve read traffic without the worker.

When /ready returns 503, the body has the following structure:

{
  "status": "not_ready",
  "failed": ["database"],
  "degraded_subsystems": ["modules"],
  "checks": {
    "database": "unreachable",
    "redis": "ok",
    "logdb": "ok"
  }
}

failed: ["database"] - the primary DB is unreachable; check the postgres container and connection pool. checks.database will be "unreachable".
degraded_subsystems: ["modules"] - a module failed to load at startup; check API logs.
degraded_subsystems: ["event_bus"] - the pub/sub failed to initialize; restart the API.

Prometheus metrics (`/metrics`)

The API exposes a Prometheus-compatible /metrics endpoint (no /api/v1/ prefix - it is on the internal API port, not published through the edge by default).

Fail-closed gate: `METRICS_AUTH_TOKEN`

In production, the /metrics endpoint is not served unless METRICS_AUTH_TOKEN is set. Without METRICS_AUTH_TOKEN configured in a non-development environment, the /metrics route is never registered at all - a scrape request receives 404, not an auth error. This is intentional fail-closed behaviour (CAN-019): unauthenticated metric exposure is never available outside local/development. When the token is configured, requests that omit or send a wrong Authorization: Bearer header receive 401. Metrics can otherwise leak internal topology information.

Outer on/off switch: `ENABLE_METRICS`

Metrics collection is also controlled by ENABLE_METRICS (default true). Setting ENABLE_METRICS=false disables all instrumentation middleware and skips route registration entirely, regardless of METRICS_AUTH_TOKEN - useful on memory-constrained hosts where the Prometheus client overhead is unwanted. The two settings interact as follows:

`ENABLE_METRICS`	`METRICS_AUTH_TOKEN`	Result
`true`	set	Authenticated `/metrics` endpoint - recommended for production
`true`	not set, non-development env	Route not registered (fail-closed, CAN-019) - 404 on scrape
`true`	not set, `ENVIRONMENT=development`	Unauthenticated `/metrics` exposed - fine for local use
`false`	any	No route, no instrumentation - `/metrics` is always 404

If you set ENABLE_METRICS=false and later add METRICS_AUTH_TOKEN expecting scraping to re-enable, it will not - you must also set ENABLE_METRICS=true.

# In your env file:
METRICS_AUTH_TOKEN=<openssl rand -hex 32>

Prometheus scrapers must send the token as a Bearer header:

# In prometheus.yml scrape config:
- job_name: freesdn
  static_configs:
    - targets: ['freesdn-api:8000']
  authorization:
    credentials: '<your-METRICS_AUTH_TOKEN>'

Enabling the bundled Prometheus stack

The metrics compose profile starts a Prometheus instance plus three exporters (Valkey, Postgres primary, Postgres logdb):

# In .env.pro or .env.max:
COMPOSE_PROFILES=io-worker,monitoring,metrics
METRICS_AUTH_TOKEN=<openssl rand -hex 32>

Starter dashboards and alert rules are in docs/grafana/ and docs/prometheus/ in the repo.

Key custom metrics

All custom metrics are defined in backend/app/core/metrics.py.

Adapter and controller health

Metric	Labels	What to watch
`freesdn_adapter_circuit_state`	`adapter`, `host`	`0` = closed (healthy), `1` = open (failing fast), `2` = half-open. Alert when `== 1` for > 5 minutes.
`freesdn_adapter_request_duration_seconds`	`adapter`, `method`	p99 vendor-API latency - identify slow controllers before they trip the circuit breaker
`freesdn_adapter_errors_total`	`adapter`, `error_type`	Rate by class: `timeout`, `5xx`, `auth_failed`

Celery / background tasks

Metric	Labels	What to watch
`freesdn_celery_queue_depth`	`queue`	Backlog > 200 on `sync` for > 10 min → worker starved or stuck
`freesdn_celery_tasks_total`	`task`, `status`	Counter by task name × `success`/`failure`/`retry`
`freesdn_celery_task_duration_seconds`	`task`	p99 histogram - watch `discover_all_devices`, `sync_all_device_statuses`

Application layer

Metric	Labels	What to watch
`freesdn_websocket_connections`	-	Sudden drop to 0 with active sessions = LB / proxy timeout issue
`freesdn_audit_write_failures_total`	`resource_type`	Any non-zero rate is a P0 - audit trail is incomplete
`freesdn_auth_events_total`	`event_type`, `status`	Spike in `failure` = brute force attempt
`freesdn_rate_limit_hits_total`	`endpoint`	Correlates with auth failure spikes
`freesdn_devices_total`	`type`, `status`	Inventory gauge

Standard HTTP metrics (from the FastAPI instrumentation)

Metric	Description
`http_requests_total`	Request counter by method, handler, status
`http_request_duration_seconds`	Latency histogram by method, handler, status

Recommended Grafana panels

These are the most impactful panels for a minimum viable operations dashboard:

5xx rate - sum(rate(http_requests_total{status="5xx"}[5m])) by handler. Alert when > 0.5 req/s sustained for 5 min.
p99 API latency - histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler)). Alert when > 2 s for 10 min on non-/ws handlers.
WebSocket connections - freesdn_websocket_connections. Sudden drop to 0 = LB problem.
Celery queue depth - sum(freesdn_celery_queue_depth) by (queue). Alert when sync queue > 200 for 10 min.
Adapter circuit breakers - max by (adapter, host) (freesdn_adapter_circuit_state). Single-stat per controller, color 0/1/2. Page on == 1 for 5 min.
Auth failure rate - sum(rate(freesdn_auth_events_total{status="failure"}[5m])) by (event_type). Spikes indicate brute force or SSO breakage.
Audit write failures - freesdn_audit_write_failures_total - non-zero rate means the audit trail has gaps.

Flower (Celery task monitoring)

When the monitoring profile is enabled, freesdn-flower runs on internal port 5555 and provides a web UI showing worker status, queue depths, task history, and manual task revocation.

Flower is internal-only in production - it is not published through the edge. Access it via an SSH tunnel:

# Forward Flower to localhost:5555 for ad-hoc access
ssh -L 5555:<docker-host-ip>:5555 user@your-host
# Then open http://localhost:5555 in a browser

Set FLOWER_BASIC_AUTH=user:password in the env file; if omitted, Flower starts but the UI is completely unauthenticated.