Troubleshooting

This page covers the most common issues operators encounter, following the pattern: symptom → cause → fix.

For deep incident playbooks (circuit breakers, queue backlog, disk full, audit chain failure) see the Operator Runbook.

App refuses to boot in production

Symptom

The api container exits immediately. Logs show a ValueError with CRITICAL: in the message:

ValueError: CRITICAL: SECRET_KEY is set to a default/insecure value.

Or one of:

ValueError: CRITICAL: ENCRYPTION_SALT must be changed from the default …
ValueError: CRITICAL: POSTGRES_PASSWORD is set to a default/insecure value.
ValueError: CRITICAL: LOGDB_URL is required. …
ValueError: SECURITY: Redis password is required in production.

Cause

FreeSDN enforces secure-by-default in ENVIRONMENT=production or ENVIRONMENT=staging. The startup validators check each secret against a blocklist of known-insecure template strings (CHANGE_ME, changeme, secret, password, freesdn_dev, etc.) and reject them before the process reaches the HTTP layer.

A related variant: ENVIRONMENT is set to an unrecognised abbreviation such as prod or PRODUCTION. The normaliser rejects it:

ValueError: Invalid ENVIRONMENT 'prod' (normalized 'prod'): must be one of ['development', 'production', 'staging']. Refusing to start to avoid silently running with development fail-open security in a deployment that intended production hardening.

Fix

Generate strong secrets:
Terminal window
```
python -c "import secrets; print(secrets.token_urlsafe(64))"
```
Run this separately for SECRET_KEY, ENCRYPTION_SALT, POSTGRES_PASSWORD, LOGDB_PASSWORD, and REDIS_PASSWORD.
Set each in your tier env file (.env.pro, .env.max, etc.). Never commit env files to version control.
Set ENVIRONMENT=production (exact, lowercase). Do not use prod, PRODUCTION, or any other variant.

Restart:

docker compose --env-file .env.pro up -d api

See the Configuration Reference for the full list of required production variables.

`/metrics` returns 404 in production

Symptom

Prometheus scrapes fail; curl http://<host>:8000/metrics returns a 404.

Cause

In production without METRICS_AUTH_TOKEN set, the /metrics endpoint is not served at all (fail-closed). This prevents route inventory, in-progress-request gauges, and auth-failure counters from leaking to anyone who can reach the API port.

Fix

Generate a token and configure both sides:

# 1. Generate a token
openssl rand -hex 32
# → e.g. a3f8b2...

# 2. Add to your env file
METRICS_AUTH_TOKEN=a3f8b2...

# 3. Restart the API
docker compose --env-file .env.pro restart api

# 4. Configure Prometheus (docker/prometheus/prometheus.yml):
scrape_configs:
  - job_name: freesdn
    scrape_interval: 30s
    bearer_token: a3f8b2...
    static_configs:
      - targets: ['freesdn-api:8000']

After the restart, verify:

curl -H "Authorization: Bearer a3f8b2..." http://localhost:8000/metrics | head -5

The endpoint is only reachable from within the Docker network by default (no host-port publish). Make sure your Prometheus container is on the same compose network.

Marketplace sync refused - no publisher key

Symptom

POST /api/v1/marketplace/plugins/sync returns 403:

{"detail": "Marketplace catalog is unsigned and no publisher key is pinned. Set MARKETPLACE_PUBLISHER_PUBLIC_KEY (recommended) or, for a fully-trusted private/dev registry, MARKETPLACE_ALLOW_UNSIGNED=1."}

Cause

The marketplace catalog is Ed25519-signed. Without a pinned publisher public key, POST /marketplace/plugins/sync is refused to prevent installation of unsigned or tampered plugins.

Fix

For the official public marketplace - pin the FreeSDN publisher’s public key in your env file:

MARKETPLACE_PUBLISHER_PUBLIC_KEY=<hex-encoded Ed25519 public key>

For a private / dev registry (fully trusted, no public distribution):

MARKETPLACE_ALLOW_UNSIGNED=1

Restart the API after changing either variable:

docker compose --env-file .env.pro restart api

Writes not applying to devices (`ADAPTER_READ_ONLY`)

Symptom

Config changes made in the UI show as “pending” and are never pushed to the controller. Or the API returns:

{"detail": "ADAPTER_READ_ONLY (or the legacy OMADA_READ_ONLY) is set  -  staged changes cannot be pushed to the live controller. Set ADAPTER_READ_ONLY=false in the environment AND pass force=true to apply."}

Cause

ADAPTER_READ_ONLY defaults to true. In this mode all writes are staged to the local database (visible as pending changes in the UI) and the live device is never touched. This is intentional safe-by-default behaviour for new deployments.

The dual gate requires both conditions to push a live write:

ADAPTER_READ_ONLY=false in the environment.
force=true on the per-call API request (or via the UI Apply confirmation).

Fix

Verify the current setting from the running container:

docker compose --env-file .env.pro exec api \
  python -c "from app.core.config import settings; print('ADAPTER_READ_ONLY =', settings.ADAPTER_READ_ONLY)"

To enable live writes:

Set ADAPTER_READ_ONLY=false in your env file.

Restart the API and workers:

docker compose --env-file .env.pro restart api worker worker-io

In the UI, navigate to the pending change and click Apply (which sends force=true). Or pass force=true in the API body.

API calls return 307 Temporary Redirect (trailing slash)

Background

Early FreeSDN deployments used FastAPI’s default redirect_slashes=True behaviour. This emitted a 307 Temporary Redirect with the internal Docker hostname (http://api:8000/…) in the Location header. Axios configured with withCredentials: true cannot follow a cross-origin redirect without re-sending credentials, so the browser dropped the follow-through and the SPA stayed on the skeleton loader.

Current behaviour (current release)

The backend sets redirect_slashes=False on the FastAPI application and installs TrailingSlashNormalizeMiddleware (backend/app/core/middleware.py). The middleware rewrites the incoming path to whichever spelling is registered before routing - no client-visible 307 is emitted. Both spellings reach the same handler.

If you still see 307s

If you are running a custom or forked build and observe trailing-slash 307s, verify that:

The FastAPI app is constructed with redirect_slashes=False.
TrailingSlashNormalizeMiddleware is registered in setup_middleware.
No intermediary (nginx, Traefik, k8s ingress) is injecting its own slash-redirect rule.

Readiness probe returns 503

Symptom

GET /api/v1/health/ready returns 503:

{"status": "not_ready", "failed": ["database"], "degraded_subsystems": [], "checks": {"database": "unreachable", "redis": "ok", "logdb": "ok"}}

Or:

{"status": "not_ready", "failed": [], "degraded_subsystems": ["modules"], "checks": {"database": "ok", "redis": "ok", "logdb": "ok"}}

Load balancers stop sending traffic; the container may restart in a loop.

Cause - `database_unreachable`

The API cannot reach the primary PostgreSQL instance. Common causes:

The postgres container OOMed and restarted.
The postgres_data volume ran out of disk.
The connection pool is exhausted (too many WEB_CONCURRENCY workers for DB_POOL_SIZE + DB_MAX_OVERFLOW).

Fix - `database_unreachable`

# 1. Check container health
docker compose --env-file .env.pro ps postgres

# 2. Check recent postgres logs
docker compose --env-file .env.pro logs --tail=200 postgres

# 3. Test connectivity from inside the api container
docker compose --env-file .env.pro exec api \
  python -c "
import asyncio, asyncpg, os
asyncio.run(asyncpg.connect(
  os.environ['DATABASE_URL'].replace('+asyncpg','')
).close())
print('DB connection OK')
"

# 4. Check disk usage
docker system df -v

If PostgreSQL is healthy but connections are exhausted, add PgBouncer:

# In your .env file, add pooling to COMPOSE_PROFILES:
COMPOSE_PROFILES=io-worker,monitoring,pooling
DB_HOST=pgbouncer
DB_PORT=6432
LOGDB_HOST=pgbouncer-logdb
LOGDB_PORT=6432

docker compose --env-file .env.pro up -d

Cause - `degraded_subsystems`

A critical subsystem (modules or event_bus) failed to initialise at startup. Non-critical subsystems (automation, plugins) always degrade gracefully.

Fix - `degraded_subsystems`

# Search api logs for the failure
docker compose --env-file .env.pro logs --tail=500 api | \
  grep -E "(modules|event_bus).*(FAIL|degraded|exception)"

# Restart the api container
docker compose --env-file .env.pro restart api

If the failure is persistent, check that LOGDB_URL is reachable (the event bus depends on TimescaleDB) and that no migration is pending:

docker compose --env-file .env.pro exec api python scripts/migrate.py

Agent shows as offline

Symptom

A registered freesdn-agent appears as offline or “never seen” in the Agents page despite the agent process running.

Cause

Common causes in order of frequency:

Wrong WebSocket URL. The agent was registered with a URL containing localhost or 127.0.0.1 but is running on a different machine.
PUBLIC_BASE_URL misconfigured. The backend constructs the agent WebSocket URL from PUBLIC_BASE_URL. If set to localhost, agents on remote machines connect to the wrong address.
Firewall / port. The agent WebSocket endpoint at wss://<your-domain>/api/v1/agents/ws/<agent-id> must be reachable from the agent host. Caddy handles TLS termination; port 443 must be open inbound. (Note: /api/v1/ws is the separate browser SPA WebSocket endpoint - testing that path will not confirm agent connectivity.)
Staleness detection delay. The agent sends a heartbeat over its persistent WebSocket connection every 30 seconds. The backend processes heartbeats in-process (in the api container, inside the asyncio event loop) and persists last_heartbeat to the database immediately - no Celery involvement. A separate Celery task (agents.cleanup_stale, scheduled every 2 minutes) compares last_heartbeat to offline_threshold_seconds (default 180 s, configurable per-agent) and flips the agent to offline. If the Celery worker is down or the default queue is backed up, staleness detection is delayed but heartbeat reception itself is unaffected.

Fix

Check PUBLIC_BASE_URL:

docker compose --env-file .env.pro exec api \
  python -c "from app.core.config import settings; print(settings.PUBLIC_BASE_URL)"

It should be https://your-real-domain.com, not localhost. Update it in your env file and restart:

docker compose --env-file .env.pro restart api

Check agent logs (on the agent host):

# Headless daemon mode
journalctl -u freesdn-agent -n 100 --no-pager
# Desktop app: check the status badge in the system tray → View Log

Look for WebSocket connection failed or SSL errors.

Check the Celery worker heartbeat:

# The worker refreshes freesdn:worker:heartbeat every 30s.
# If this key is missing, the worker is down.
docker compose --env-file .env.pro exec redis \
  redis-cli -a "$REDIS_PASSWORD" GET freesdn:worker:heartbeat

docker compose --env-file .env.pro ps worker
docker compose --env-file .env.pro logs --tail=100 worker

If the worker is down, restart it:

docker compose --env-file .env.pro restart worker

Flower dashboard unreachable

Symptom

Browsing to https://<your-domain>/flower returns a 404 or connection refused.

Cause

Flower requires the monitoring compose profile and is an internal-only service on port 5555 - it is not exposed through the Caddy edge by default.

Fix

Ensure monitoring is in COMPOSE_PROFILES:

COMPOSE_PROFILES=io-worker,monitoring

Access Flower through an SSH tunnel or behind your VPN:

# SSH tunnel from local machine:
ssh -L 5555:localhost:5555 user@your-freesdn-host
# Then open http://localhost:5555 in your browser

Authenticate with FLOWER_BASIC_AUTH=admin:<password> from your env file.

Swagger UI not available

Symptom

GET /api/v1/docs returns a 404 in production.

Cause

Swagger and ReDoc are unconditionally disabled when ENVIRONMENT=production. The backend evaluates enable_docs = settings.ENABLE_DOCS and settings.ENVIRONMENT != "production", so the ENABLE_DOCS flag is short-circuited and has no effect in production - the 404 will persist regardless of its value.

Fix

ENABLE_DOCS=true cannot re-enable docs in a production environment. To access the API schema you have two options:

Change the environment: Set ENVIRONMENT=staging in your env file and restart the API. Staging still enforces all secret-validation guards but lifts the docs lockout. Restrict access at the Caddy or nginx layer to your office IP range.
Export from a dev instance: Fetch GET /api/v1/openapi.json from a development or staging deploy and serve it through a separate documentation tool (e.g., Redocly, Stoplight).

TimescaleDB continuous aggregate fails during migration

Symptom

The migration script fails with:

ERROR: cannot create continuous aggregate in transaction block

Cause

TimescaleDB continuous aggregates (CREATE MATERIALIZED VIEW … WITH (timescaledb.continuous)) cannot run inside a transaction. Alembic wraps migrations in transactions by default.

Fix

The migrate script detects TimescaleDB objects and runs them outside a transaction block automatically. If you are writing a custom migration that includes a continuous aggregate, mark it as non-transactional:

# In your Alembic migration file:
def upgrade():
    op.execute(sa.text("SET LOCAL lock_timeout = '0'"))
    # ... your DDL ...

Set transactional_ddl = False in the migration’s Alembic context when creating continuous aggregates.

SSO redirect loop after IdP change

Symptom

After reconfiguring OIDC or LDAP, users complete the IdP flow but bounce between the IdP and the FreeSDN login page, or land on /login?error=....

Cause

Clock skew between FreeSDN and the IdP, a certificate rotation on the IdP that was not re-uploaded to FreeSDN, or a stale metadata URL.

Fix

# 1. Pull the active SSO configs and check their metadata URL
docker compose --env-file .env.pro exec postgres \
  psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c \
  "SELECT id, protocol, status FROM core.sso_providers WHERE status = 'active';"

# 2. If broken, temporarily disable SSO so users can log in with a password:
docker compose --env-file .env.pro exec postgres \
  psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c \
  "UPDATE core.sso_providers SET status = 'inactive' WHERE id = '<sso_id>';"

Then update the IdP configuration in the UI (Settings → SSO) with the correct metadata URL and re-enable.

Next steps

Configuration Reference - full env-var reference including all security variables
Operator Runbook - deep incident playbooks

Troubleshooting

App refuses to boot in production

Symptom

Cause

Fix

/metrics returns 404 in production

Symptom

Cause

Fix

Marketplace sync refused - no publisher key

Symptom

Cause

Fix

Writes not applying to devices (ADAPTER_READ_ONLY)

Symptom

Cause

Fix

API calls return 307 Temporary Redirect (trailing slash)

Background

Current behaviour (current release)

If you still see 307s

Readiness probe returns 503

Symptom

Cause - database_unreachable

Fix - database_unreachable

Cause - degraded_subsystems

Fix - degraded_subsystems

Agent shows as offline

Symptom

Cause

Fix

Flower dashboard unreachable

Symptom

Cause

Fix

Swagger UI not available

Symptom

Cause

Fix

TimescaleDB continuous aggregate fails during migration

Symptom

Cause

Fix

SSO redirect loop after IdP change

Symptom

Cause

Fix

Next steps

`/metrics` returns 404 in production

Writes not applying to devices (`ADAPTER_READ_ONLY`)

Cause - `database_unreachable`

Fix - `database_unreachable`

Cause - `degraded_subsystems`

Fix - `degraded_subsystems`