Monitoring
FreeSDN exposes structured observability through three channels: health/readiness endpoints for container orchestration, a Prometheus /metrics endpoint for time-series metrics, and Flower for Celery task visibility.
Health endpoints
Section titled “Health endpoints”Three endpoints with distinct semantics - do not use them interchangeably.
| Endpoint | Purpose | 200 when | 503 when |
|---|---|---|---|
GET /api/v1/health/live | Liveness probe (is the process alive?) | Always, as long as the HTTP layer responds | Never - use readiness for traffic gating |
GET /api/v1/health/ready | Readiness probe (can the app serve traffic?) | DB reachable AND modules/event_bus not degraded | DB unreachable OR critical subsystem degraded |
GET /api/v1/health/ | Public status snapshot (unauthenticated) | Always 200; read the status field in the body | Never 503; payload shows per-component {status} only - no latencies, versions, or uptime (FSDN-SEC-008) |
GET /api/v1/health/detail | Full health snapshot (authenticated) | Always 200 when authenticated | Never 503; payload includes per-component latency_ms, app version, uptime_seconds, and platform versions. Requires settings:read permission. |
# Quick smoke-test from the hostcurl -fsS http://localhost:8000/api/v1/health/livecurl -fsS http://localhost:8000/api/v1/health/readycurl -fsS http://localhost:8000/api/v1/health/ | jq '.status, .components'
# Full health detail - requires a valid session token with settings:readcurl -fsS http://localhost:8000/api/v1/health/detail \ -H "Authorization: Bearer <your-token>" | jq '.status, .uptime_seconds, .platform, .components'The /health/ endpoint runs three checks concurrently: database (Postgres SELECT 1), redis (Valkey PING), and celery (TTL heartbeat key refreshed by the beat scheduler every 30 s). After those complete, any subsystems that failed during startup (modules, event_bus, etc.) are appended to the component map in a sequential pass - they are not part of the concurrent gather. The celery component being degraded does not gate readiness - the API can serve read traffic without the worker.
When /ready returns 503, the body has the following structure:
{ "status": "not_ready", "failed": ["database"], "degraded_subsystems": ["modules"], "checks": { "database": "unreachable", "redis": "ok", "logdb": "ok" }}failed: ["database"]- the primary DB is unreachable; check thepostgrescontainer and connection pool.checks.databasewill be"unreachable".degraded_subsystems: ["modules"]- a module failed to load at startup; check API logs.degraded_subsystems: ["event_bus"]- the pub/sub failed to initialize; restart the API.
Prometheus metrics (/metrics)
Section titled “Prometheus metrics (/metrics)”The API exposes a Prometheus-compatible /metrics endpoint (no /api/v1/ prefix - it is on the internal API port, not published through the edge by default).
Fail-closed gate: METRICS_AUTH_TOKEN
Section titled “Fail-closed gate: METRICS_AUTH_TOKEN”In production, the /metrics endpoint is not served unless METRICS_AUTH_TOKEN is set. Without METRICS_AUTH_TOKEN configured in a non-development environment, the /metrics route is never registered at all - a scrape request receives 404, not an auth error. This is intentional fail-closed behaviour (CAN-019): unauthenticated metric exposure is never available outside local/development. When the token is configured, requests that omit or send a wrong Authorization: Bearer header receive 401. Metrics can otherwise leak internal topology information.
Outer on/off switch: ENABLE_METRICS
Section titled “Outer on/off switch: ENABLE_METRICS”Metrics collection is also controlled by ENABLE_METRICS (default true). Setting ENABLE_METRICS=false disables all instrumentation middleware and skips route registration entirely, regardless of METRICS_AUTH_TOKEN - useful on memory-constrained hosts where the Prometheus client overhead is unwanted. The two settings interact as follows:
ENABLE_METRICS | METRICS_AUTH_TOKEN | Result |
|---|---|---|
true | set | Authenticated /metrics endpoint - recommended for production |
true | not set, non-development env | Route not registered (fail-closed, CAN-019) - 404 on scrape |
true | not set, ENVIRONMENT=development | Unauthenticated /metrics exposed - fine for local use |
false | any | No route, no instrumentation - /metrics is always 404 |
If you set ENABLE_METRICS=false and later add METRICS_AUTH_TOKEN expecting scraping to re-enable, it will not - you must also set ENABLE_METRICS=true.
# In your env file:METRICS_AUTH_TOKEN=<openssl rand -hex 32>Prometheus scrapers must send the token as a Bearer header:
# In prometheus.yml scrape config:- job_name: freesdn static_configs: - targets: ['freesdn-api:8000'] authorization: credentials: '<your-METRICS_AUTH_TOKEN>'Enabling the bundled Prometheus stack
Section titled “Enabling the bundled Prometheus stack”The metrics compose profile starts a Prometheus instance plus three exporters (Valkey, Postgres primary, Postgres logdb):
# In .env.pro or .env.max:COMPOSE_PROFILES=io-worker,monitoring,metricsMETRICS_AUTH_TOKEN=<openssl rand -hex 32>Starter dashboards and alert rules are in docs/grafana/ and docs/prometheus/ in the repo.
Key custom metrics
Section titled “Key custom metrics”All custom metrics are defined in backend/app/core/metrics.py.
Adapter and controller health
Section titled “Adapter and controller health”| Metric | Labels | What to watch |
|---|---|---|
freesdn_adapter_circuit_state | adapter, host | 0 = closed (healthy), 1 = open (failing fast), 2 = half-open. Alert when == 1 for > 5 minutes. |
freesdn_adapter_request_duration_seconds | adapter, method | p99 vendor-API latency - identify slow controllers before they trip the circuit breaker |
freesdn_adapter_errors_total | adapter, error_type | Rate by class: timeout, 5xx, auth_failed |
Celery / background tasks
Section titled “Celery / background tasks”| Metric | Labels | What to watch |
|---|---|---|
freesdn_celery_queue_depth | queue | Backlog > 200 on sync for > 10 min → worker starved or stuck |
freesdn_celery_tasks_total | task, status | Counter by task name × success/failure/retry |
freesdn_celery_task_duration_seconds | task | p99 histogram - watch discover_all_devices, sync_all_device_statuses |
Application layer
Section titled “Application layer”| Metric | Labels | What to watch |
|---|---|---|
freesdn_websocket_connections | - | Sudden drop to 0 with active sessions = LB / proxy timeout issue |
freesdn_audit_write_failures_total | resource_type | Any non-zero rate is a P0 - audit trail is incomplete |
freesdn_auth_events_total | event_type, status | Spike in failure = brute force attempt |
freesdn_rate_limit_hits_total | endpoint | Correlates with auth failure spikes |
freesdn_devices_total | type, status | Inventory gauge |
Standard HTTP metrics (from the FastAPI instrumentation)
Section titled “Standard HTTP metrics (from the FastAPI instrumentation)”| Metric | Description |
|---|---|
http_requests_total | Request counter by method, handler, status |
http_request_duration_seconds | Latency histogram by method, handler, status |
Recommended Grafana panels
Section titled “Recommended Grafana panels”These are the most impactful panels for a minimum viable operations dashboard:
- 5xx rate -
sum(rate(http_requests_total{status="5xx"}[5m]))by handler. Alert when > 0.5 req/s sustained for 5 min. - p99 API latency -
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler)). Alert when > 2 s for 10 min on non-/wshandlers. - WebSocket connections -
freesdn_websocket_connections. Sudden drop to 0 = LB problem. - Celery queue depth -
sum(freesdn_celery_queue_depth) by (queue). Alert whensyncqueue > 200 for 10 min. - Adapter circuit breakers -
max by (adapter, host) (freesdn_adapter_circuit_state). Single-stat per controller, color 0/1/2. Page on== 1for 5 min. - Auth failure rate -
sum(rate(freesdn_auth_events_total{status="failure"}[5m])) by (event_type). Spikes indicate brute force or SSO breakage. - Audit write failures -
freesdn_audit_write_failures_total- non-zero rate means the audit trail has gaps.
Flower (Celery task monitoring)
Section titled “Flower (Celery task monitoring)”When the monitoring profile is enabled, freesdn-flower runs on internal port 5555 and provides a web UI showing worker status, queue depths, task history, and manual task revocation.
Flower is internal-only in production - it is not published through the edge. Access it via an SSH tunnel:
# Forward Flower to localhost:5555 for ad-hoc accessssh -L 5555:<docker-host-ip>:5555 user@your-host# Then open http://localhost:5555 in a browserSet FLOWER_BASIC_AUTH=user:password in the env file; if omitted, Flower starts but the UI is completely unauthenticated.