Troubleshooting
This page covers the most common issues operators encounter, following the pattern: symptom → cause → fix.
For deep incident playbooks (circuit breakers, queue backlog, disk full, audit chain failure) see the Operator Runbook.
App refuses to boot in production
Section titled “App refuses to boot in production”Symptom
Section titled “Symptom”The api container exits immediately. Logs show a ValueError with CRITICAL: in the message:
ValueError: CRITICAL: SECRET_KEY is set to a default/insecure value.Or one of:
ValueError: CRITICAL: ENCRYPTION_SALT must be changed from the default …ValueError: CRITICAL: POSTGRES_PASSWORD is set to a default/insecure value.ValueError: CRITICAL: LOGDB_URL is required. …ValueError: SECURITY: Redis password is required in production.FreeSDN enforces secure-by-default in ENVIRONMENT=production or ENVIRONMENT=staging. The startup validators check each secret against a blocklist of known-insecure template strings (CHANGE_ME, changeme, secret, password, freesdn_dev, etc.) and reject them before the process reaches the HTTP layer.
A related variant: ENVIRONMENT is set to an unrecognised abbreviation such as prod or PRODUCTION. The normaliser rejects it:
ValueError: Invalid ENVIRONMENT 'prod' (normalized 'prod'): must be one of ['development', 'production', 'staging']. Refusing to start to avoid silently running with development fail-open security in a deployment that intended production hardening.-
Generate strong secrets:
Terminal window python -c "import secrets; print(secrets.token_urlsafe(64))"Run this separately for
SECRET_KEY,ENCRYPTION_SALT,POSTGRES_PASSWORD,LOGDB_PASSWORD, andREDIS_PASSWORD. -
Set each in your tier env file (
.env.pro,.env.max, etc.). Never commit env files to version control. -
Set
ENVIRONMENT=production(exact, lowercase). Do not useprod,PRODUCTION, or any other variant. -
Restart:
Terminal window docker compose --env-file .env.pro up -d api
See the Configuration Reference for the full list of required production variables.
/metrics returns 404 in production
Section titled “/metrics returns 404 in production”Symptom
Section titled “Symptom”Prometheus scrapes fail; curl http://<host>:8000/metrics returns a 404.
In production without METRICS_AUTH_TOKEN set, the /metrics endpoint is not served at all (fail-closed). This prevents route inventory, in-progress-request gauges, and auth-failure counters from leaking to anyone who can reach the API port.
Generate a token and configure both sides:
# 1. Generate a tokenopenssl rand -hex 32# → e.g. a3f8b2...
# 2. Add to your env fileMETRICS_AUTH_TOKEN=a3f8b2...
# 3. Restart the APIdocker compose --env-file .env.pro restart api
# 4. Configure Prometheus (docker/prometheus/prometheus.yml):scrape_configs: - job_name: freesdn scrape_interval: 30s bearer_token: a3f8b2... static_configs: - targets: ['freesdn-api:8000']After the restart, verify:
curl -H "Authorization: Bearer a3f8b2..." http://localhost:8000/metrics | head -5The endpoint is only reachable from within the Docker network by default (no host-port publish). Make sure your Prometheus container is on the same compose network.
Marketplace sync refused - no publisher key
Section titled “Marketplace sync refused - no publisher key”Symptom
Section titled “Symptom”POST /api/v1/marketplace/plugins/sync returns 403:
{"detail": "Marketplace catalog is unsigned and no publisher key is pinned. Set MARKETPLACE_PUBLISHER_PUBLIC_KEY (recommended) or, for a fully-trusted private/dev registry, MARKETPLACE_ALLOW_UNSIGNED=1."}The marketplace catalog is Ed25519-signed. Without a pinned publisher public key, POST /marketplace/plugins/sync is refused to prevent installation of unsigned or tampered plugins.
For the official public marketplace - pin the FreeSDN publisher’s public key in your env file:
MARKETPLACE_PUBLISHER_PUBLIC_KEY=<hex-encoded Ed25519 public key>For a private / dev registry (fully trusted, no public distribution):
MARKETPLACE_ALLOW_UNSIGNED=1Restart the API after changing either variable:
docker compose --env-file .env.pro restart apiWrites not applying to devices (ADAPTER_READ_ONLY)
Section titled “Writes not applying to devices (ADAPTER_READ_ONLY)”Symptom
Section titled “Symptom”Config changes made in the UI show as “pending” and are never pushed to the controller. Or the API returns:
{"detail": "ADAPTER_READ_ONLY (or the legacy OMADA_READ_ONLY) is set - staged changes cannot be pushed to the live controller. Set ADAPTER_READ_ONLY=false in the environment AND pass force=true to apply."}ADAPTER_READ_ONLY defaults to true. In this mode all writes are staged to the local database (visible as pending changes in the UI) and the live device is never touched. This is intentional safe-by-default behaviour for new deployments.
The dual gate requires both conditions to push a live write:
ADAPTER_READ_ONLY=falsein the environment.force=trueon the per-call API request (or via the UI Apply confirmation).
Verify the current setting from the running container:
docker compose --env-file .env.pro exec api \ python -c "from app.core.config import settings; print('ADAPTER_READ_ONLY =', settings.ADAPTER_READ_ONLY)"To enable live writes:
- Set
ADAPTER_READ_ONLY=falsein your env file. - Restart the API and workers:
Terminal window docker compose --env-file .env.pro restart api worker worker-io - In the UI, navigate to the pending change and click Apply (which sends
force=true). Or passforce=truein the API body.
API calls return 307 Temporary Redirect (trailing slash)
Section titled “API calls return 307 Temporary Redirect (trailing slash)”Background
Section titled “Background”Early FreeSDN deployments used FastAPI’s default redirect_slashes=True behaviour. This emitted a 307 Temporary Redirect with the internal Docker hostname (http://api:8000/…) in the Location header. Axios configured with withCredentials: true cannot follow a cross-origin redirect without re-sending credentials, so the browser dropped the follow-through and the SPA stayed on the skeleton loader.
Current behaviour (current release)
Section titled “Current behaviour (current release)”The backend sets redirect_slashes=False on the FastAPI application and installs TrailingSlashNormalizeMiddleware (backend/app/core/middleware.py). The middleware rewrites the incoming path to whichever spelling is registered before routing - no client-visible 307 is emitted. Both spellings reach the same handler.
If you still see 307s
Section titled “If you still see 307s”If you are running a custom or forked build and observe trailing-slash 307s, verify that:
- The FastAPI app is constructed with
redirect_slashes=False. TrailingSlashNormalizeMiddlewareis registered insetup_middleware.- No intermediary (nginx, Traefik, k8s ingress) is injecting its own slash-redirect rule.
Readiness probe returns 503
Section titled “Readiness probe returns 503”Symptom
Section titled “Symptom”GET /api/v1/health/ready returns 503:
{"status": "not_ready", "failed": ["database"], "degraded_subsystems": [], "checks": {"database": "unreachable", "redis": "ok", "logdb": "ok"}}Or:
{"status": "not_ready", "failed": [], "degraded_subsystems": ["modules"], "checks": {"database": "ok", "redis": "ok", "logdb": "ok"}}Load balancers stop sending traffic; the container may restart in a loop.
Cause - database_unreachable
Section titled “Cause - database_unreachable”The API cannot reach the primary PostgreSQL instance. Common causes:
- The
postgrescontainer OOMed and restarted. - The
postgres_datavolume ran out of disk. - The connection pool is exhausted (too many
WEB_CONCURRENCYworkers forDB_POOL_SIZE + DB_MAX_OVERFLOW).
Fix - database_unreachable
Section titled “Fix - database_unreachable”# 1. Check container healthdocker compose --env-file .env.pro ps postgres
# 2. Check recent postgres logsdocker compose --env-file .env.pro logs --tail=200 postgres
# 3. Test connectivity from inside the api containerdocker compose --env-file .env.pro exec api \ python -c "import asyncio, asyncpg, osasyncio.run(asyncpg.connect( os.environ['DATABASE_URL'].replace('+asyncpg','')).close())print('DB connection OK')"
# 4. Check disk usagedocker system df -vIf PostgreSQL is healthy but connections are exhausted, add PgBouncer:
# In your .env file, add pooling to COMPOSE_PROFILES:COMPOSE_PROFILES=io-worker,monitoring,poolingDB_HOST=pgbouncerDB_PORT=6432LOGDB_HOST=pgbouncer-logdbLOGDB_PORT=6432
docker compose --env-file .env.pro up -dCause - degraded_subsystems
Section titled “Cause - degraded_subsystems”A critical subsystem (modules or event_bus) failed to initialise at startup. Non-critical subsystems (automation, plugins) always degrade gracefully.
Fix - degraded_subsystems
Section titled “Fix - degraded_subsystems”# Search api logs for the failuredocker compose --env-file .env.pro logs --tail=500 api | \ grep -E "(modules|event_bus).*(FAIL|degraded|exception)"
# Restart the api containerdocker compose --env-file .env.pro restart apiIf the failure is persistent, check that LOGDB_URL is reachable (the event bus depends on TimescaleDB) and that no migration is pending:
docker compose --env-file .env.pro exec api python scripts/migrate.pyAgent shows as offline
Section titled “Agent shows as offline”Symptom
Section titled “Symptom”A registered freesdn-agent appears as offline or “never seen” in the Agents page despite the agent process running.
Common causes in order of frequency:
- Wrong WebSocket URL. The agent was registered with a URL containing
localhostor127.0.0.1but is running on a different machine. PUBLIC_BASE_URLmisconfigured. The backend constructs the agent WebSocket URL fromPUBLIC_BASE_URL. If set tolocalhost, agents on remote machines connect to the wrong address.- Firewall / port. The agent WebSocket endpoint at
wss://<your-domain>/api/v1/agents/ws/<agent-id>must be reachable from the agent host. Caddy handles TLS termination; port 443 must be open inbound. (Note:/api/v1/wsis the separate browser SPA WebSocket endpoint - testing that path will not confirm agent connectivity.) - Staleness detection delay. The agent sends a heartbeat over its persistent WebSocket connection every 30 seconds. The backend processes heartbeats in-process (in the
apicontainer, inside the asyncio event loop) and persistslast_heartbeatto the database immediately - no Celery involvement. A separate Celery task (agents.cleanup_stale, scheduled every 2 minutes) compareslast_heartbeattooffline_threshold_seconds(default 180 s, configurable per-agent) and flips the agent to offline. If the Celery worker is down or thedefaultqueue is backed up, staleness detection is delayed but heartbeat reception itself is unaffected.
Check PUBLIC_BASE_URL:
docker compose --env-file .env.pro exec api \ python -c "from app.core.config import settings; print(settings.PUBLIC_BASE_URL)"It should be https://your-real-domain.com, not localhost. Update it in your env file and restart:
docker compose --env-file .env.pro restart apiCheck agent logs (on the agent host):
# Headless daemon modejournalctl -u freesdn-agent -n 100 --no-pager# Desktop app: check the status badge in the system tray → View LogLook for WebSocket connection failed or SSL errors.
Check the Celery worker heartbeat:
# The worker refreshes freesdn:worker:heartbeat every 30s.# If this key is missing, the worker is down.docker compose --env-file .env.pro exec redis \ redis-cli -a "$REDIS_PASSWORD" GET freesdn:worker:heartbeat
docker compose --env-file .env.pro ps workerdocker compose --env-file .env.pro logs --tail=100 workerIf the worker is down, restart it:
docker compose --env-file .env.pro restart workerFlower dashboard unreachable
Section titled “Flower dashboard unreachable”Symptom
Section titled “Symptom”Browsing to https://<your-domain>/flower returns a 404 or connection refused.
Flower requires the monitoring compose profile and is an internal-only service on port 5555 - it is not exposed through the Caddy edge by default.
Ensure monitoring is in COMPOSE_PROFILES:
COMPOSE_PROFILES=io-worker,monitoringAccess Flower through an SSH tunnel or behind your VPN:
# SSH tunnel from local machine:ssh -L 5555:localhost:5555 user@your-freesdn-host# Then open http://localhost:5555 in your browserAuthenticate with FLOWER_BASIC_AUTH=admin:<password> from your env file.
Swagger UI not available
Section titled “Swagger UI not available”Symptom
Section titled “Symptom”GET /api/v1/docs returns a 404 in production.
Swagger and ReDoc are unconditionally disabled when ENVIRONMENT=production. The backend evaluates enable_docs = settings.ENABLE_DOCS and settings.ENVIRONMENT != "production", so the ENABLE_DOCS flag is short-circuited and has no effect in production - the 404 will persist regardless of its value.
ENABLE_DOCS=true cannot re-enable docs in a production environment. To access the API schema you have two options:
- Change the environment: Set
ENVIRONMENT=stagingin your env file and restart the API. Staging still enforces all secret-validation guards but lifts the docs lockout. Restrict access at the Caddy or nginx layer to your office IP range. - Export from a dev instance: Fetch
GET /api/v1/openapi.jsonfrom a development or staging deploy and serve it through a separate documentation tool (e.g., Redocly, Stoplight).
TimescaleDB continuous aggregate fails during migration
Section titled “TimescaleDB continuous aggregate fails during migration”Symptom
Section titled “Symptom”The migration script fails with:
ERROR: cannot create continuous aggregate in transaction blockTimescaleDB continuous aggregates (CREATE MATERIALIZED VIEW … WITH (timescaledb.continuous)) cannot run inside a transaction. Alembic wraps migrations in transactions by default.
The migrate script detects TimescaleDB objects and runs them outside a transaction block automatically. If you are writing a custom migration that includes a continuous aggregate, mark it as non-transactional:
# In your Alembic migration file:def upgrade(): op.execute(sa.text("SET LOCAL lock_timeout = '0'")) # ... your DDL ...Set transactional_ddl = False in the migration’s Alembic context when creating continuous aggregates.
SSO redirect loop after IdP change
Section titled “SSO redirect loop after IdP change”Symptom
Section titled “Symptom”After reconfiguring OIDC or LDAP, users complete the IdP flow but bounce between the IdP and the FreeSDN login page, or land on /login?error=....
Clock skew between FreeSDN and the IdP, a certificate rotation on the IdP that was not re-uploaded to FreeSDN, or a stale metadata URL.
# 1. Pull the active SSO configs and check their metadata URLdocker compose --env-file .env.pro exec postgres \ psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c \ "SELECT id, protocol, status FROM core.sso_providers WHERE status = 'active';"
# 2. If broken, temporarily disable SSO so users can log in with a password:docker compose --env-file .env.pro exec postgres \ psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c \ "UPDATE core.sso_providers SET status = 'inactive' WHERE id = '<sso_id>';"Then update the IdP configuration in the UI (Settings → SSO) with the correct metadata URL and re-enable.
Next steps
Section titled “Next steps”- Configuration Reference - full env-var reference including all security variables
- Operator Runbook - deep incident playbooks