Skip to content

Architecture

This page describes how FreeSDN’s pieces fit together: what runs in each process, how a request travels through the stack, what makes writes safe by default, and how the background workers, agent, and storage tier interact. Read this before you start tuning deployment options or building integrations.

Browser / API client
│ HTTPS
┌─────────────┐
│ Caddy │ Edge: automatic HTTPS, TLS termination, static SPA files
└──────┬──────┘
│ HTTP (internal)
┌──────────────────────────────────────────────────────────────────┐
│ FastAPI (gunicorn + uvicorn workers) │
│ │
│ Middleware (outermost → innermost): │
│ RequestID + security headers → Request logging → │
│ Trailing-slash normalize → Body-size limit (1 MiB) → │
│ CSRF double-submit → Rate limiting (Valkey sliding window) │
│ │
│ Core endpoints /api/v1/auth /api/v1/users /api/v1/sites … │
│ Module routes /api/v1/{module-id}/… (filesystem-discovered) │
│ Vendor "gateway" surface /api/v1/gateway-*/… │
│ WebSocket /api/v1/ws (real-time event stream) │
│ │
│ Tenant context org_id from JWT / API key → org scope │
│ RBAC 7-tier roles + per-user site grants │
└──────┬───────────────────────────────────┬───────────────────────┘
│ │
▼ ▼
┌──────────────┐ ┌───────────────┐
│ Core + │ Fabric │ Adapter │
│ Modules │ Negotiator ─────►│ Registry │
│ (10 modules)│ │ (13 adapters)│
└──────────────┘ └───────┬───────┘
│ vendor protocol
Network devices /
cameras / PBX /
firewalls / hypervisors
┌────────────────────────────────────────────────────────────────┐
│ Celery workers │
│ quick-worker - API-side tasks (device sync, firmware check) │
│ io-worker - long I/O (backup, forensic export, scans) │
│ scheduler - Celery beat (cron: SLA eval, DPI roll-ups) │
└────────────────────────────────────────────────────────────────┘
┌──────────────┐ ┌────────────────────────┐ ┌────────────────┐
│ PostgreSQL │ │ TimescaleDB (logdb) │ │ Valkey 8.1 │
│ 19 (primary)│ │ metrics / events / │ │ cache, broker,│
│ schemas │ │ heartbeats / TSDB │ │ rate-limit, │
└──────────────┘ └────────────────────────┘ │ WS pubsub │
└────────────────┘
freesdn-agent (desktop / headless daemon - optional, MIT)
└─ WS ──► /api/v1/ws (command / heartbeat / scan-result channel)

Caddy sits at the network boundary. It:

  • Terminates TLS. Set CADDY_SITE_ADDRESS to control the mode:
    • :80 - plain HTTP (use behind an existing load balancer)
    • localhost - HTTPS with Caddy’s automatic internal CA
    • freesdn.example.com - HTTPS via Let’s Encrypt
  • Serves the compiled React SPA from the frontend/dist/ directory.
  • Proxies everything under /api/ and /ws to the FastAPI process.
  • Publishes no internal data-tier ports on the host - PostgreSQL, TimescaleDB, and Valkey are reachable only on the Docker-internal network.

An nginx edge escape-hatch is available as a Compose profile for environments that require it, but Caddy is the default.

FastAPI runs under gunicorn with multiple uvicorn async workers. The application factory (create_application()) builds the app, attaches the middleware stack, mounts all routers, then runs the lifespan startup sequence.

Middleware executes in the order below (outermost first). Every inbound request passes all layers before reaching endpoint logic.

LayerWhat it does
RequestIDReads or generates X-Request-ID; prefixes client-supplied IDs with ext- to flag log poisoning. Injects security headers on every response: X-Content-Type-Options, X-Frame-Options: DENY, a strict Content-Security-Policy, Permissions-Policy, and HSTS when the connection is HTTPS.
Request loggingStructured start / complete log per request; adds X-Response-Time to responses.
Trailing-slash normalizeRewrites the path internally so /sites and /sites/ both route correctly - no 307 redirect that would leak the internal proxy host in Location.
Body-size limitRejects requests where Content-Length exceeds 1 MiB (413). Guards against DoS via oversized stage payloads.
CSRFDouble-submit cookie check on all state-changing methods. GET, HEAD, OPTIONS, the public auth flow, and API-key-only requests (no cookie present) are exempt. When both a cookie and an X-API-Key / Bearer token are present, CSRF is still enforced.
Rate limitingValkey sorted-set sliding window. Per-user bucket keyed from the JWT sub claim (local HMAC verify, no DB round-trip), falling back to rl:ip:<ip>. Auth endpoints fail closed (503) when Valkey is unavailable; all other endpoints fail open. Adds X-RateLimit-Limit / X-RateLimit-Remaining headers.

Default limits: 600 requests/minute per principal, burst 120/second. Auth endpoints are limited separately at 5/minute per IP.

When the process starts, the lifespan hook wires subsystems in order. Each subsystem reports into SUBSYSTEM_STATUS which drives GET /health:

  1. Event bus connect + subscribe
  2. Adapter connection pool start
  3. Cross-pod WebSocket pubsub (Valkey channel; single_pod status when Valkey is absent)
  4. Module loader - discover_modules() scans the filesystem, loads all 10 modules, registers their routers at /api/v1/{module-id}/
  5. Automation engine start
  6. Fabric negotiator wire_and_start()
  7. Third-party plugin loader - loads each plugin, registers routes, starts per-org (respecting PluginOrganizationState.is_enabled)
  8. Background initial device sync + firmware check (not awaited - must not block boot)
  9. DPI built-in rule seeding
  10. In-process HLS session reaper (15-second loop - must run in the API process, not Celery)

Shutdown reverses this sequence: drain WebSocket sessions, unload modules and plugins, stop automation and Fabric, disconnect Valkey pubsub and the event bus.

The sequence for a typical authenticated REST request (for example, GET /api/v1/switches/ports?site_id=...):

  1. Caddy receives the HTTPS request, terminates TLS, proxies to FastAPI.
  2. Middleware stack runs in order: request ID assigned, security headers queued, body-size checked, CSRF skipped (GET), rate-limit token consumed.
  3. Dependency injection - FastAPI resolves get_current_active_user:
    • Tries Bearer header, then freesdn_access httpOnly cookie.
    • verify_token() validates JWT signature, expiry, aud=freesdn-api, iss=freesdn, and the jti revocation blacklist. After the user record is loaded from the database, get_current_user_optional() compares the JWT tv (token version) claim against user.token_version - a stale token minted before a password change or logout-all event is rejected at that point (the DB lookup is required to read token_version).
    • Builds a CurrentUser principal carrying user, permissions, accessible_site_ids, and the scoped flag (set when an API key with an explicit scope list is used).
  4. Permission check - the route dependency (e.g. require_permissions("network:read")) evaluates against the principal. For a scoped API key, super_admin implicit grants do not bypass the explicit scope ceiling.
  5. Tenant context - site_id from the query param is validated against the user’s accessible_site_ids + org membership. Service methods receive the org-scoped session; queries add WHERE organization_id = :org_id.
  6. Service / adapter - the service fetches data from PostgreSQL or calls the adapter to read live state from the controller.
  7. Response - serialized through Pydantic v2 schemas, returned as JSON. Secrets are stripped by redact_secrets before any adapter response reaches endpoint logic.

Reads and writes are org-scoped at the application layer throughout the service layer. There is no PostgreSQL Row-Level Security.

For per-user scoping within an organization, FreeSDN uses hybrid site grants: a user who has one or more UserSiteAccess rows becomes site-limited and can only see the sites explicitly granted. A user with no grants sees all sites in the organization (backward compatibility for small teams). Unknown access levels fail closed (deny).

Modules are not hard-coded into the router. The module loader scans the filesystem, discovers all 10 module packages, and registers each module’s router at startup. This means:

  • Module routes appear in the OpenAPI schema only after startup.
  • Enabling or disabling a module for an organization is a runtime toggle; the routes are registered regardless, but the service layer enforces the per-org enable flag.
  • The full module API surface is assembled at runtime rather than declared in one static router file.

Each module mounts at /api/v1/{module-id}/. The vendor adapter “gateway” surface (Omada, OPNsense, pfSense, MikroTik, Proxmox, UniFi, OpenWrt) registers additional routers at /api/v1/gateway-{area}/... (e.g. /api/v1/gateway-vpn/, /api/v1/gateway-opnsense-firewall/, /api/v1/gateway-mikrotik-routing/) plus /api/v1/unifi/... for the UniFi-specific surface.

An Adapter is a typed vendor driver. All 13 adapters are auto-registered at startup and pooled. When API logic needs to talk to a device, it resolves the adapter for that controller from the registry, calls the normalized operation, and the adapter translates it into the vendor protocol (REST, SOAP/ONVIF, WebSocket JSON-RPC, AMI, CLI).

Every adapter response passes through redact_secrets - a ~120-key camelCase-aware filter - before leaving the adapter layer. This strips credentials, PSKs, RADIUS secrets, and similar sensitive values regardless of which adapter returned them.

FreeSDN’s most important safety property: writes do not touch live devices by default.

The dual gate has two independent conditions that must both be true for a change to reach a controller:

  1. Both ADAPTER_READ_ONLY=false and OMADA_READ_ONLY=false (environment variables, both default true). The staging service uses OR logic: if either is true, all writes are staged. OMADA_READ_ONLY is a legacy per-Omada alias kept for clarity - both must be explicitly cleared for live writes to be dispatched.
  2. The apply call carries force=true in the request body

If either condition is false, the change is accepted and staged, but never dispatched to the device. This means you can connect FreeSDN to a live production controller in read-only mode and explore its state without any risk of accidental changes.

The staged-write flow:

  1. Stage - operator authors a change via the UI or API. FreeSDN writes a PendingChange row to the database. The controller is not contacted.
  2. Review - the pending change is visible as a diff. A second authorized user (or the same user, depending on your workflow) can inspect it.
  3. Apply - an explicit POST /api/v1/gateway-vpn/changes/{change_id}/apply with {"force": true} pushes the change to the controller via the adapter. (The change row already carries the controller and site associations - the apply endpoint takes only change_id.)
  4. Discard - a POST .../changes/{change_id}/discard removes the staged change without applying it.

The apply endpoint resolves the required permission from the change.feature field after fetching the row - one endpoint covers all feature domains:

Feature prefixRequired permission
vpn.*vpn:write
firewall.* / opnsense.* / pfsense.*firewall:write
proxmox.*hypervisor:write
mikrotik.*network:write (controller:write for destructive subsets)
unifi.*network:write (controller:write for destructive subset)
system.* / monitoring.*controller:write
(default)network:write

Catastrophic operations (VM destroy, node reboot/shutdown, snapshot rollback, backup restore, firmware installs, factory reset, config restore) additionally require has_min_role("site_admin") at both stage time and apply time. The stage-time gate closes the “queue-poison” window where a lower-privileged user stages a destructic change for a higher-privileged user to unknowingly apply.

MethodPathPurpose
POST/api/v1/gateway-{omada-area}/{controller_id}/sites/{site_id}/changes/{feature}Stage a change (Omada areas: vpn, firewall, wifi, bulk, firmware, hotspot, profiles, routing, switch-advanced, system)
POST/api/v1/gateway-{area}/{controller_id}/changes/{feature}Stage a change (non-Omada: mikrotik-vpn, opnsense-vpn, opnsense-firewall, pfsense-vpn, pfsense-firewall, proxmox-firewall, openwrt-firewall, unifi-networks)
GET/api/v1/gateway-{omada-area}/{controller_id}/sites/{site_id}/changesList pending changes (Omada areas)
GET/api/v1/gateway-{area}/{controller_id}/changesList pending changes (non-Omada areas)
POST/api/v1/gateway-vpn/changes/{change_id}/applyApply (push to device)
POST/api/v1/gateway-vpn/changes/{change_id}/discardDiscard without applying
GET/api/v1/gateway-vpn/changes/by-gateway/{gateway_id}Fanout pending-changes view

The Fabric is FreeSDN’s event-driven integration layer. It exposes a single tier-tagged catalog at GET /api/v1/fabric/catalog that lists every operation and event across all modules - native and plugin alike.

Operators author Connections: an inbound event (from any of the 7 event sources) triggers a step chain. Steps can invoke operations from any of the 6 modules that declare operations, send notifications, write log records, or call outbound webhooks.

The in-process Negotiator drives step execution. It uses Valkey SET-NX for at-most-once delivery under multi-worker fan-out, so the same event does not trigger a Connection twice when multiple API workers receive it.

Key safety properties of the Fabric:

  • Write steps are staged and require per-action sign-off.
  • Inbound ingestion (POST /api/v1/fabric/ingest) requires an org-scoped API key.
  • Outbound webhook targets are SSRF-validated: RFC 1918, CGNAT, loopback, and IPv4-mapped addresses are denied; redirects are not followed.
  • The n8n community node (n8n-nodes-freesdn) integrates FreeSDN with n8n workflows using the same ingest/webhook surface.

The WebSocket endpoint at /api/v1/ws provides a real-time event stream to browser clients and the desktop agent.

Authentication uses the freesdn_access httpOnly cookie or an auth-message frame sent within 10 seconds of connect. Query-string ?token= auth is deprecated (leaks tokens into server logs) and logs a warning.

Server-to-client events are filtered:

  • Org filter - drops events whose org_id does not match the connection’s organization. Fails closed in both directions: if either the receiver or the event has no org_id, the event is dropped.
  • Site scope - for site-limited users, drops site-tagged events outside their UserSiteAccess grants.
  • Payload sanitization - strips password, api_key, token, secret, refresh_token, and encryption_key fields before delivery.
  • Session revalidation - every 5 minutes the server checks is_active, token_version, and deleted_at for each live connection. Revoked sessions receive a session_revoked message and are closed.

Connection limits: 25 WebSocket connections per user, 5,000 globally, 200 event subscriptions per connection.

Cross-pod delivery - when multiple API replicas run, targeted send_to_user publishes via a Valkey pubsub channel so any pod can deliver to a user regardless of which pod holds the connection. No-op in single-pod deployments.

Background work runs in dedicated worker processes. Two worker types are defined:

WorkerPurpose
quick-workerShort API-side tasks: initial device sync on startup, firmware availability checks, notification dispatch
io-workerLong-running I/O: configuration backup jobs, forensic video export, large discovery scans, off-site DR transfers
schedulerCelery beat - cron-driven: SLA evaluation, DPI metric roll-ups, stale-agent cleanup, backup pruning

The scheduler (Celery beat) runs as a separate container to avoid clock-skew issues in multi-worker deployments. Workers use Valkey as both the broker and result backend.

The primary database holds all configuration and operational state across 19 schemas: core, devices, events, enterprise, analytics, agents, vpn, network, audit, ai, collector, cameras, firewall, voip, access, backup, gateway, hypervisor, and fabric.

The fabric schema (migration 039) stores Fabric Connection definitions and their per-firing audit runs (connection_runs).

SQLAlchemy 2.0 async with asyncpg. Connection pool: 20 connections + 30 overflow per worker process. Alembic manages schema migrations; the first boot runs them automatically.

TimescaleDB (logdb - time-series database)

Section titled “TimescaleDB (logdb - time-series database)”

A separate TimescaleDB instance holds all time-series data: SNMP trap events, syslog records, NetFlow samples, device heartbeats, SLA metrics, and camera event records. Continuous aggregates roll up metrics for dashboards without scanning raw tables.

LOGDB_URL is required in production and staging. The app refuses to boot without it.

Valkey 8.1 (cache, broker, rate-limit, pubsub)

Section titled “Valkey 8.1 (cache, broker, rate-limit, pubsub)”

Valkey is a drop-in Redis replacement. FreeSDN retains the redis:// URL scheme and redis service name for compatibility.

Valkey serves four distinct roles:

RoleDetails
Celery broker + resultsTask queue for quick-worker, io-worker, and scheduler
Rate-limit windowsSorted-set sliding windows per-user and per-IP
WebSocket pubsubCross-pod targeted delivery channel
Session cacheJWT blacklist (jti revocation), auth rate-limit counters

The high-availability configuration (docker-compose.ha.yml) runs one Valkey master, one replica, and three Sentinels. Valkey failover is automatic via Sentinel promotion. The API’s Redis client factory (app/core/redis_client.py) resolves the current master on every connection so it follows Sentinel promotions without restart.

The freesdn-agent package (MIT license, v1.0.0, alpha) is an optional desktop application and headless daemon that runs on Windows, Linux, and macOS (Python >= 3.11).

The agent connects to FreeSDN over WebSocket at /api/v1/ws and provides:

  • 14 active discovery scanners - network topology, device fingerprinting, and service detection
  • 5 passive listeners - monitors traffic and system events locally
  • Capability advertisement - the agent reports what it can do; the platform issues commands via the WebSocket command set
  • Cron scheduled scans - configurable scan schedules managed via /api/v1/agents/{agent_id}/schedules
  • ECDSA-P256 signed auto-update - updates are signature-verified and fail closed (a bad signature blocks the update rather than applying it)

The agent is useful for reaching networks where the FreeSDN server has no direct layer-3 path to the devices - for example, a remote branch with NAT between the branch LAN and the FreeSDN host.

Browse the full API surface in the interactive OpenAPI docs at /api/v1/docs on a running non-production instance. Module, plugin, and vendor adapter routers are registered at runtime.

Key platform-level endpoint groups:

AreaBase pathNotes
Auth/api/v1/auth/Login, MFA, refresh, sessions, password management
SSO/api/v1/auth/sso/OIDC (working), LDAP (working); SAML 501-gated
API keys/api/v1/api-keys/Scoped keys, 50-key per-user ceiling
Users / orgs / sites/api/v1/users/, /organizations/, /sites/Core admin
Controllers/api/v1/controllers/Add/remove/sync controllers
Discovery/api/v1/discovery/4-phase scan pipeline; adopt discovered devices
Switches / APs/api/v1/switches/, /access-points/Switch and access point management
VPN/api/v1/vpn/VPN management and orchestration
Fabric/api/v1/fabric/Catalog, connections, ingest, webhooks
Agents/api/v1/agents/Register, heartbeat, tasks, schedules, releases
WebSocket/api/v1/wsReal-time event stream
Health/health, /api/v1/health/Liveness, readiness, subsystem status

Module-specific routes (cameras, VoIP, firewall, hypervisor, etc.) mount under their own prefixes and are registered at startup by the module loader.

Several security mechanisms apply at the platform layer, not per-route:

  • JWT validation - signature, expiry, aud, iss, and token-version claims are verified on every authenticated request. Role is read from the database, not the JWT claim, so a stolen JWT cannot convey a promoted role.
  • CSRF - double-submit cookie on all mutations; API-key-only requests without a session cookie are exempt.
  • Scoped API keys - a key with an explicit scope list marks the principal as scoped=True. Even a super_admin owner’s key cannot exceed the declared scope.
  • Secret redaction - redact_secrets strips ~120 sensitive field names (camelCase-aware) from every adapter response before it leaves the adapter layer.
  • SSRF - safe_http_request resolves DNS once, pins the IP, follows no redirects, and blocks RFC 1918 / CGNAT / loopback / IPv4-mapped targets.
  • Rate-limit fail modes - auth endpoints fail closed (503) on Valkey outage; non-auth endpoints fail open to avoid service disruption from a Valkey blip.

See Security Model and Roles and Permissions for the full treatment.

  • Deployment Tiers - sizing the worker and database tier for Lite, Pro, Max, and HA
  • Configuration - every environment variable
  • Security Model - threat model, security controls, and what “application-layer isolation” means in practice
  • Roles and Permissions - the 7-tier hierarchy and per-user site grants in detail
  • Fabric - Connections, the catalog, and n8n integration
  • Adapters Overview - per-adapter capability matrix and maturity tiers