Skip to content

High Availability

The Max-tier HA overlay adds redundancy for the two stateful services most critical to availability: Valkey (the cache/broker) and PostgreSQL (the primary database). Failover behavior is different for each - know which is automatic before an incident.

ServiceHA mechanismFailover typeTypical RTO
Valkey (redis service)Master + replica + 3 SentinelsAutomatic - Sentinel elects a new master when the primary is unreachable~5-15 seconds
PostgreSQLStreaming standbyManual - operator must promote the standby~1-2 minutes (promotion) + config update
APISecond replica + nginx round-robin LBAutomatic - nginx LB drains the dead upstream~5 seconds

Layer the HA overlay on top of the Max tier with a second compose file:

Terminal window
docker compose --env-file .env.max \
-f docker-compose.yml \
-f docker-compose.ha.yml \
up -d

The HA overlay adds:

  • freesdn-postgres-standby - streaming replica of the primary, bootstrapped via pg_basebackup
  • freesdn-redis-sentinel-1/2/3 - three Sentinel nodes watching the Valkey master (quorum = 2, down-after-milliseconds = 5000)
  • freesdn-redis-replica - a Valkey replica of the master
  • freesdn-api-2 - a second API container running the same image and env
  • freesdn-ha-lb - nginx round-robin load balancer across the two API replicas

Add these env vars to .env.max:

Terminal window
POSTGRES_REPL_USER=repl
POSTGRES_REPL_PASSWORD=<strong-random-value>

Valkey Sentinel monitors the master. When the master is unreachable for down-after-milliseconds (5 seconds default), a quorum of Sentinels (2 of 3) elects a new master and promotes the replica. The application’s redis_client.py factory re-resolves the master via Sentinel on reconnect, so the application follows the new master without a restart.

Observed in a live drill: master paused → Sentinel promotes in ~9 s → API follows in ~5 s = ~14 s total recovery time.

The standby is a hot streaming replica that receives WAL continuously from the primary. When the primary fails:

  1. Promote the standby:

    Terminal window
    docker exec freesdn-postgres-standby pg_ctl promote
  2. Verify promotion - the result must be f (not in recovery):

    Terminal window
    docker exec freesdn-postgres-standby \
    bash -c 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "SELECT pg_is_in_recovery();"'
  3. Update DB_HOST in .env.max to point at the promoted standby, then restart the application layer:

    Terminal window
    docker compose --env-file .env.max \
    -f docker-compose.yml -f docker-compose.ha.yml \
    restart api api-2 worker worker-io scheduler

FreeSDN ships scripts/ha_drill.py - run it before relying on HA in production. The script injects failures and measures RTO against a configurable budget.

Terminal window
# Valkey Sentinel failover
python scripts/ha_drill.py \
--scenario redis_kill \
--lb-url http://127.0.0.1:18080 \
--report-dir drills/ \
--rto-budget 20
# Postgres manual promotion
python scripts/ha_drill.py \
--scenario primary_kill \
--lb-url http://127.0.0.1:18080 \
--report-dir drills/ \
--rto-budget 60
# API replica loss
python scripts/ha_drill.py \
--scenario api_kill \
--lb-url http://127.0.0.1:18080 \
--report-dir drills/ \
--rto-budget 5

The drill:

  1. Refuses to run against anything other than loopback or RFC1918 addresses (safety guard)
  2. Confirms baseline health via /api/v1/health/ready
  3. SIGKILLs the target container
  4. Polls /api/v1/health/ready until recovery or timeout
  5. Writes report.json, report.md, and health-timeline.csv into a timestamped subdirectory of the report directory (e.g. drills/redis_kill-20260605T143000Z/)
ScenarioTarget RTO
redis_kill (Sentinel failover)≤ 20 seconds
primary_kill (Postgres, manual promotion)≤ 60 seconds
api_kill (LB drain)≤ 5 seconds

A drill that exceeds its budget is a real finding - file an issue against the HA topology or the relevant retry configuration.

CodeMeaning
0PASS - RTO within budget
1Any failure - RTO exceeded budget, precondition failed (unsafe target or baseline unhealthy), or unexpected error

The shipped drill runs all containers on a single host. It is a proof-of-mechanism tool, not a full production HA exercise.

CoveredNot covered
Application-layer failover behaviorPhysical host separation / anti-affinity
Valkey Sentinel automatic promotionNetwork partition simulation
API LB drain and RTO measurementRPO / data-loss measurement
Single-host container topologyMulti-host topology

For a production-grade exercise, run against a multi-host staging cluster with anti-affinity rules, a real load balancer, and network partition injection (e.g. via Toxiproxy).

  • Before every release candidate - drill results are part of the release-gate evidence package
  • After any change to Postgres config, Valkey config, the event bus, the WebSocket pubsub layer, or the LB config
  • Before enterprise procurement reviews - include the most recent report.md as evidence

Next steps: Backups and Restore - the cold disaster-recovery procedure when failover is not enough.

All product names, logos, and brands are property of their respective owners. FreeSDN is an independent project and is not affiliated with or endorsed by the vendors it integrates with. See Trademarks.