Skip to content

Troubleshooting

This page covers the most common issues operators encounter, following the pattern: symptom → cause → fix.

For deep incident playbooks (circuit breakers, queue backlog, disk full, audit chain failure) see the Operator Runbook.


The api container exits immediately. Logs show a ValueError with CRITICAL: in the message:

ValueError: CRITICAL: SECRET_KEY is set to a default/insecure value.

Or one of:

ValueError: CRITICAL: ENCRYPTION_SALT must be changed from the default …
ValueError: CRITICAL: POSTGRES_PASSWORD is set to a default/insecure value.
ValueError: CRITICAL: LOGDB_URL is required. …
ValueError: SECURITY: Redis password is required in production.

FreeSDN enforces secure-by-default in ENVIRONMENT=production or ENVIRONMENT=staging. The startup validators check each secret against a blocklist of known-insecure template strings (CHANGE_ME, changeme, secret, password, freesdn_dev, etc.) and reject them before the process reaches the HTTP layer.

A related variant: ENVIRONMENT is set to an unrecognised abbreviation such as prod or PRODUCTION. The normaliser rejects it:

ValueError: Invalid ENVIRONMENT 'prod' (normalized 'prod'): must be one of ['development', 'production', 'staging']. Refusing to start to avoid silently running with development fail-open security in a deployment that intended production hardening.
  1. Generate strong secrets:

    Terminal window
    python -c "import secrets; print(secrets.token_urlsafe(64))"

    Run this separately for SECRET_KEY, ENCRYPTION_SALT, POSTGRES_PASSWORD, LOGDB_PASSWORD, and REDIS_PASSWORD.

  2. Set each in your tier env file (.env.pro, .env.max, etc.). Never commit env files to version control.

  3. Set ENVIRONMENT=production (exact, lowercase). Do not use prod, PRODUCTION, or any other variant.

  4. Restart:

    Terminal window
    docker compose --env-file .env.pro up -d api

See the Configuration Reference for the full list of required production variables.


Prometheus scrapes fail; curl http://<host>:8000/metrics returns a 404.

In production without METRICS_AUTH_TOKEN set, the /metrics endpoint is not served at all (fail-closed). This prevents route inventory, in-progress-request gauges, and auth-failure counters from leaking to anyone who can reach the API port.

Generate a token and configure both sides:

Terminal window
# 1. Generate a token
openssl rand -hex 32
# → e.g. a3f8b2...
# 2. Add to your env file
METRICS_AUTH_TOKEN=a3f8b2...
# 3. Restart the API
docker compose --env-file .env.pro restart api
# 4. Configure Prometheus (docker/prometheus/prometheus.yml):
scrape_configs:
- job_name: freesdn
scrape_interval: 30s
bearer_token: a3f8b2...
static_configs:
- targets: ['freesdn-api:8000']

After the restart, verify:

Terminal window
curl -H "Authorization: Bearer a3f8b2..." http://localhost:8000/metrics | head -5

The endpoint is only reachable from within the Docker network by default (no host-port publish). Make sure your Prometheus container is on the same compose network.


Marketplace sync refused - no publisher key

Section titled “Marketplace sync refused - no publisher key”

POST /api/v1/marketplace/plugins/sync returns 403:

{"detail": "Marketplace catalog is unsigned and no publisher key is pinned. Set MARKETPLACE_PUBLISHER_PUBLIC_KEY (recommended) or, for a fully-trusted private/dev registry, MARKETPLACE_ALLOW_UNSIGNED=1."}

The marketplace catalog is Ed25519-signed. Without a pinned publisher public key, POST /marketplace/plugins/sync is refused to prevent installation of unsigned or tampered plugins.

For the official public marketplace - pin the FreeSDN publisher’s public key in your env file:

MARKETPLACE_PUBLISHER_PUBLIC_KEY=<hex-encoded Ed25519 public key>

For a private / dev registry (fully trusted, no public distribution):

MARKETPLACE_ALLOW_UNSIGNED=1

Restart the API after changing either variable:

Terminal window
docker compose --env-file .env.pro restart api

Writes not applying to devices (ADAPTER_READ_ONLY)

Section titled “Writes not applying to devices (ADAPTER_READ_ONLY)”

Config changes made in the UI show as “pending” and are never pushed to the controller. Or the API returns:

{"detail": "ADAPTER_READ_ONLY (or the legacy OMADA_READ_ONLY) is set - staged changes cannot be pushed to the live controller. Set ADAPTER_READ_ONLY=false in the environment AND pass force=true to apply."}

ADAPTER_READ_ONLY defaults to true. In this mode all writes are staged to the local database (visible as pending changes in the UI) and the live device is never touched. This is intentional safe-by-default behaviour for new deployments.

The dual gate requires both conditions to push a live write:

  1. ADAPTER_READ_ONLY=false in the environment.
  2. force=true on the per-call API request (or via the UI Apply confirmation).

Verify the current setting from the running container:

Terminal window
docker compose --env-file .env.pro exec api \
python -c "from app.core.config import settings; print('ADAPTER_READ_ONLY =', settings.ADAPTER_READ_ONLY)"

To enable live writes:

  1. Set ADAPTER_READ_ONLY=false in your env file.
  2. Restart the API and workers:
    Terminal window
    docker compose --env-file .env.pro restart api worker worker-io
  3. In the UI, navigate to the pending change and click Apply (which sends force=true). Or pass force=true in the API body.

API calls return 307 Temporary Redirect (trailing slash)

Section titled “API calls return 307 Temporary Redirect (trailing slash)”

Early FreeSDN deployments used FastAPI’s default redirect_slashes=True behaviour. This emitted a 307 Temporary Redirect with the internal Docker hostname (http://api:8000/…) in the Location header. Axios configured with withCredentials: true cannot follow a cross-origin redirect without re-sending credentials, so the browser dropped the follow-through and the SPA stayed on the skeleton loader.

The backend sets redirect_slashes=False on the FastAPI application and installs TrailingSlashNormalizeMiddleware (backend/app/core/middleware.py). The middleware rewrites the incoming path to whichever spelling is registered before routing - no client-visible 307 is emitted. Both spellings reach the same handler.

If you are running a custom or forked build and observe trailing-slash 307s, verify that:

  1. The FastAPI app is constructed with redirect_slashes=False.
  2. TrailingSlashNormalizeMiddleware is registered in setup_middleware.
  3. No intermediary (nginx, Traefik, k8s ingress) is injecting its own slash-redirect rule.

GET /api/v1/health/ready returns 503:

{"status": "not_ready", "failed": ["database"], "degraded_subsystems": [], "checks": {"database": "unreachable", "redis": "ok", "logdb": "ok"}}

Or:

{"status": "not_ready", "failed": [], "degraded_subsystems": ["modules"], "checks": {"database": "ok", "redis": "ok", "logdb": "ok"}}

Load balancers stop sending traffic; the container may restart in a loop.

The API cannot reach the primary PostgreSQL instance. Common causes:

  • The postgres container OOMed and restarted.
  • The postgres_data volume ran out of disk.
  • The connection pool is exhausted (too many WEB_CONCURRENCY workers for DB_POOL_SIZE + DB_MAX_OVERFLOW).
Terminal window
# 1. Check container health
docker compose --env-file .env.pro ps postgres
# 2. Check recent postgres logs
docker compose --env-file .env.pro logs --tail=200 postgres
# 3. Test connectivity from inside the api container
docker compose --env-file .env.pro exec api \
python -c "
import asyncio, asyncpg, os
asyncio.run(asyncpg.connect(
os.environ['DATABASE_URL'].replace('+asyncpg','')
).close())
print('DB connection OK')
"
# 4. Check disk usage
docker system df -v

If PostgreSQL is healthy but connections are exhausted, add PgBouncer:

Terminal window
# In your .env file, add pooling to COMPOSE_PROFILES:
COMPOSE_PROFILES=io-worker,monitoring,pooling
DB_HOST=pgbouncer
DB_PORT=6432
LOGDB_HOST=pgbouncer-logdb
LOGDB_PORT=6432
docker compose --env-file .env.pro up -d

A critical subsystem (modules or event_bus) failed to initialise at startup. Non-critical subsystems (automation, plugins) always degrade gracefully.

Terminal window
# Search api logs for the failure
docker compose --env-file .env.pro logs --tail=500 api | \
grep -E "(modules|event_bus).*(FAIL|degraded|exception)"
# Restart the api container
docker compose --env-file .env.pro restart api

If the failure is persistent, check that LOGDB_URL is reachable (the event bus depends on TimescaleDB) and that no migration is pending:

Terminal window
docker compose --env-file .env.pro exec api python scripts/migrate.py

A registered freesdn-agent appears as offline or “never seen” in the Agents page despite the agent process running.

Common causes in order of frequency:

  1. Wrong WebSocket URL. The agent was registered with a URL containing localhost or 127.0.0.1 but is running on a different machine.
  2. PUBLIC_BASE_URL misconfigured. The backend constructs the agent WebSocket URL from PUBLIC_BASE_URL. If set to localhost, agents on remote machines connect to the wrong address.
  3. Firewall / port. The agent WebSocket endpoint at wss://<your-domain>/api/v1/agents/ws/<agent-id> must be reachable from the agent host. Caddy handles TLS termination; port 443 must be open inbound. (Note: /api/v1/ws is the separate browser SPA WebSocket endpoint - testing that path will not confirm agent connectivity.)
  4. Staleness detection delay. The agent sends a heartbeat over its persistent WebSocket connection every 30 seconds. The backend processes heartbeats in-process (in the api container, inside the asyncio event loop) and persists last_heartbeat to the database immediately - no Celery involvement. A separate Celery task (agents.cleanup_stale, scheduled every 2 minutes) compares last_heartbeat to offline_threshold_seconds (default 180 s, configurable per-agent) and flips the agent to offline. If the Celery worker is down or the default queue is backed up, staleness detection is delayed but heartbeat reception itself is unaffected.

Check PUBLIC_BASE_URL:

Terminal window
docker compose --env-file .env.pro exec api \
python -c "from app.core.config import settings; print(settings.PUBLIC_BASE_URL)"

It should be https://your-real-domain.com, not localhost. Update it in your env file and restart:

Terminal window
docker compose --env-file .env.pro restart api

Check agent logs (on the agent host):

Terminal window
# Headless daemon mode
journalctl -u freesdn-agent -n 100 --no-pager
# Desktop app: check the status badge in the system tray → View Log

Look for WebSocket connection failed or SSL errors.

Check the Celery worker heartbeat:

Terminal window
# The worker refreshes freesdn:worker:heartbeat every 30s.
# If this key is missing, the worker is down.
docker compose --env-file .env.pro exec redis \
redis-cli -a "$REDIS_PASSWORD" GET freesdn:worker:heartbeat
docker compose --env-file .env.pro ps worker
docker compose --env-file .env.pro logs --tail=100 worker

If the worker is down, restart it:

Terminal window
docker compose --env-file .env.pro restart worker

Browsing to https://<your-domain>/flower returns a 404 or connection refused.

Flower requires the monitoring compose profile and is an internal-only service on port 5555 - it is not exposed through the Caddy edge by default.

Ensure monitoring is in COMPOSE_PROFILES:

COMPOSE_PROFILES=io-worker,monitoring

Access Flower through an SSH tunnel or behind your VPN:

Terminal window
# SSH tunnel from local machine:
ssh -L 5555:localhost:5555 user@your-freesdn-host
# Then open http://localhost:5555 in your browser

Authenticate with FLOWER_BASIC_AUTH=admin:<password> from your env file.


GET /api/v1/docs returns a 404 in production.

Swagger and ReDoc are unconditionally disabled when ENVIRONMENT=production. The backend evaluates enable_docs = settings.ENABLE_DOCS and settings.ENVIRONMENT != "production", so the ENABLE_DOCS flag is short-circuited and has no effect in production - the 404 will persist regardless of its value.

ENABLE_DOCS=true cannot re-enable docs in a production environment. To access the API schema you have two options:

  • Change the environment: Set ENVIRONMENT=staging in your env file and restart the API. Staging still enforces all secret-validation guards but lifts the docs lockout. Restrict access at the Caddy or nginx layer to your office IP range.
  • Export from a dev instance: Fetch GET /api/v1/openapi.json from a development or staging deploy and serve it through a separate documentation tool (e.g., Redocly, Stoplight).

TimescaleDB continuous aggregate fails during migration

Section titled “TimescaleDB continuous aggregate fails during migration”

The migration script fails with:

ERROR: cannot create continuous aggregate in transaction block

TimescaleDB continuous aggregates (CREATE MATERIALIZED VIEW … WITH (timescaledb.continuous)) cannot run inside a transaction. Alembic wraps migrations in transactions by default.

The migrate script detects TimescaleDB objects and runs them outside a transaction block automatically. If you are writing a custom migration that includes a continuous aggregate, mark it as non-transactional:

# In your Alembic migration file:
def upgrade():
op.execute(sa.text("SET LOCAL lock_timeout = '0'"))
# ... your DDL ...

Set transactional_ddl = False in the migration’s Alembic context when creating continuous aggregates.


After reconfiguring OIDC or LDAP, users complete the IdP flow but bounce between the IdP and the FreeSDN login page, or land on /login?error=....

Clock skew between FreeSDN and the IdP, a certificate rotation on the IdP that was not re-uploaded to FreeSDN, or a stale metadata URL.

Terminal window
# 1. Pull the active SSO configs and check their metadata URL
docker compose --env-file .env.pro exec postgres \
psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c \
"SELECT id, protocol, status FROM core.sso_providers WHERE status = 'active';"
# 2. If broken, temporarily disable SSO so users can log in with a password:
docker compose --env-file .env.pro exec postgres \
psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c \
"UPDATE core.sso_providers SET status = 'inactive' WHERE id = '<sso_id>';"

Then update the IdP configuration in the UI (Settings → SSO) with the correct metadata URL and re-enable.