Skip to content

Monitoring

FreeSDN exposes structured observability through three channels: health/readiness endpoints for container orchestration, a Prometheus /metrics endpoint for time-series metrics, and Flower for Celery task visibility.

Three endpoints with distinct semantics - do not use them interchangeably.

EndpointPurpose200 when503 when
GET /api/v1/health/liveLiveness probe (is the process alive?)Always, as long as the HTTP layer respondsNever - use readiness for traffic gating
GET /api/v1/health/readyReadiness probe (can the app serve traffic?)DB reachable AND modules/event_bus not degradedDB unreachable OR critical subsystem degraded
GET /api/v1/health/Public status snapshot (unauthenticated)Always 200; read the status field in the bodyNever 503; payload shows per-component {status} only - no latencies, versions, or uptime (FSDN-SEC-008)
GET /api/v1/health/detailFull health snapshot (authenticated)Always 200 when authenticatedNever 503; payload includes per-component latency_ms, app version, uptime_seconds, and platform versions. Requires settings:read permission.
Terminal window
# Quick smoke-test from the host
curl -fsS http://localhost:8000/api/v1/health/live
curl -fsS http://localhost:8000/api/v1/health/ready
curl -fsS http://localhost:8000/api/v1/health/ | jq '.status, .components'
# Full health detail - requires a valid session token with settings:read
curl -fsS http://localhost:8000/api/v1/health/detail \
-H "Authorization: Bearer <your-token>" | jq '.status, .uptime_seconds, .platform, .components'

The /health/ endpoint runs three checks concurrently: database (Postgres SELECT 1), redis (Valkey PING), and celery (TTL heartbeat key refreshed by the beat scheduler every 30 s). After those complete, any subsystems that failed during startup (modules, event_bus, etc.) are appended to the component map in a sequential pass - they are not part of the concurrent gather. The celery component being degraded does not gate readiness - the API can serve read traffic without the worker.

When /ready returns 503, the body has the following structure:

{
"status": "not_ready",
"failed": ["database"],
"degraded_subsystems": ["modules"],
"checks": {
"database": "unreachable",
"redis": "ok",
"logdb": "ok"
}
}
  • failed: ["database"] - the primary DB is unreachable; check the postgres container and connection pool. checks.database will be "unreachable".
  • degraded_subsystems: ["modules"] - a module failed to load at startup; check API logs.
  • degraded_subsystems: ["event_bus"] - the pub/sub failed to initialize; restart the API.

The API exposes a Prometheus-compatible /metrics endpoint (no /api/v1/ prefix - it is on the internal API port, not published through the edge by default).

In production, the /metrics endpoint is not served unless METRICS_AUTH_TOKEN is set. Without METRICS_AUTH_TOKEN configured in a non-development environment, the /metrics route is never registered at all - a scrape request receives 404, not an auth error. This is intentional fail-closed behaviour (CAN-019): unauthenticated metric exposure is never available outside local/development. When the token is configured, requests that omit or send a wrong Authorization: Bearer header receive 401. Metrics can otherwise leak internal topology information.

Metrics collection is also controlled by ENABLE_METRICS (default true). Setting ENABLE_METRICS=false disables all instrumentation middleware and skips route registration entirely, regardless of METRICS_AUTH_TOKEN - useful on memory-constrained hosts where the Prometheus client overhead is unwanted. The two settings interact as follows:

ENABLE_METRICSMETRICS_AUTH_TOKENResult
truesetAuthenticated /metrics endpoint - recommended for production
truenot set, non-development envRoute not registered (fail-closed, CAN-019) - 404 on scrape
truenot set, ENVIRONMENT=developmentUnauthenticated /metrics exposed - fine for local use
falseanyNo route, no instrumentation - /metrics is always 404

If you set ENABLE_METRICS=false and later add METRICS_AUTH_TOKEN expecting scraping to re-enable, it will not - you must also set ENABLE_METRICS=true.

Terminal window
# In your env file:
METRICS_AUTH_TOKEN=<openssl rand -hex 32>

Prometheus scrapers must send the token as a Bearer header:

# In prometheus.yml scrape config:
- job_name: freesdn
static_configs:
- targets: ['freesdn-api:8000']
authorization:
credentials: '<your-METRICS_AUTH_TOKEN>'

The metrics compose profile starts a Prometheus instance plus three exporters (Valkey, Postgres primary, Postgres logdb):

Terminal window
# In .env.pro or .env.max:
COMPOSE_PROFILES=io-worker,monitoring,metrics
METRICS_AUTH_TOKEN=<openssl rand -hex 32>

Starter dashboards and alert rules are in docs/grafana/ and docs/prometheus/ in the repo.

All custom metrics are defined in backend/app/core/metrics.py.

MetricLabelsWhat to watch
freesdn_adapter_circuit_stateadapter, host0 = closed (healthy), 1 = open (failing fast), 2 = half-open. Alert when == 1 for > 5 minutes.
freesdn_adapter_request_duration_secondsadapter, methodp99 vendor-API latency - identify slow controllers before they trip the circuit breaker
freesdn_adapter_errors_totaladapter, error_typeRate by class: timeout, 5xx, auth_failed
MetricLabelsWhat to watch
freesdn_celery_queue_depthqueueBacklog > 200 on sync for > 10 min → worker starved or stuck
freesdn_celery_tasks_totaltask, statusCounter by task name × success/failure/retry
freesdn_celery_task_duration_secondstaskp99 histogram - watch discover_all_devices, sync_all_device_statuses
MetricLabelsWhat to watch
freesdn_websocket_connections-Sudden drop to 0 with active sessions = LB / proxy timeout issue
freesdn_audit_write_failures_totalresource_typeAny non-zero rate is a P0 - audit trail is incomplete
freesdn_auth_events_totalevent_type, statusSpike in failure = brute force attempt
freesdn_rate_limit_hits_totalendpointCorrelates with auth failure spikes
freesdn_devices_totaltype, statusInventory gauge

Standard HTTP metrics (from the FastAPI instrumentation)

Section titled “Standard HTTP metrics (from the FastAPI instrumentation)”
MetricDescription
http_requests_totalRequest counter by method, handler, status
http_request_duration_secondsLatency histogram by method, handler, status

These are the most impactful panels for a minimum viable operations dashboard:

  1. 5xx rate - sum(rate(http_requests_total{status="5xx"}[5m])) by handler. Alert when > 0.5 req/s sustained for 5 min.
  2. p99 API latency - histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler)). Alert when > 2 s for 10 min on non-/ws handlers.
  3. WebSocket connections - freesdn_websocket_connections. Sudden drop to 0 = LB problem.
  4. Celery queue depth - sum(freesdn_celery_queue_depth) by (queue). Alert when sync queue > 200 for 10 min.
  5. Adapter circuit breakers - max by (adapter, host) (freesdn_adapter_circuit_state). Single-stat per controller, color 0/1/2. Page on == 1 for 5 min.
  6. Auth failure rate - sum(rate(freesdn_auth_events_total{status="failure"}[5m])) by (event_type). Spikes indicate brute force or SSO breakage.
  7. Audit write failures - freesdn_audit_write_failures_total - non-zero rate means the audit trail has gaps.

When the monitoring profile is enabled, freesdn-flower runs on internal port 5555 and provides a web UI showing worker status, queue depths, task history, and manual task revocation.

Flower is internal-only in production - it is not published through the edge. Access it via an SSH tunnel:

Terminal window
# Forward Flower to localhost:5555 for ad-hoc access
ssh -L 5555:<docker-host-ip>:5555 user@your-host
# Then open http://localhost:5555 in a browser

Set FLOWER_BASIC_AUTH=user:password in the env file; if omitted, Flower starts but the UI is completely unauthenticated.