Skip to content

High Availability

The Max-tier HA overlay adds redundancy for the two stateful services most critical to availability: Valkey (the cache/broker) and PostgreSQL (the primary database). Failover behavior is different for each - know which is automatic before an incident.

ServiceHA mechanismFailover typeTypical RTO
Valkey (redis service)Master + replica + 3 SentinelsAutomatic - Sentinel elects a new master when the primary is unreachable~5-15 seconds
PostgreSQLStreaming standbyManual - operator must promote the standby~1-2 minutes (promotion) + config update
APISecond replica + nginx round-robin LBAutomatic - nginx LB drains the dead upstream~5 seconds

Layer the HA overlay on top of the Max tier with a second compose file:

Terminal window
docker compose --env-file .env.max \
-f docker-compose.yml \
-f docker-compose.ha.yml \
up -d

The HA overlay adds:

  • freesdn-postgres-standby - streaming replica of the primary, bootstrapped via pg_basebackup
  • freesdn-redis-sentinel-1/2/3 - three Sentinel nodes watching the Valkey master (quorum = 2, down-after-milliseconds = 5000)
  • freesdn-redis-replica - a Valkey replica of the master
  • freesdn-api-2 - a second API container running the same image and env
  • freesdn-ha-lb - nginx round-robin load balancer across the two API replicas

Add these env vars to .env.max:

Terminal window
POSTGRES_REPL_USER=repl
POSTGRES_REPL_PASSWORD=<strong-random-value>

Valkey Sentinel monitors the master. When the master is unreachable for down-after-milliseconds (5 seconds default), a quorum of Sentinels (2 of 3) elects a new master and promotes the replica. The application’s redis_client.py factory re-resolves the master via Sentinel on reconnect, so the application follows the new master without a restart.

Observed in a live drill: master paused → Sentinel promotes in ~9 s → API follows in ~5 s = ~14 s total recovery time.

The standby is a hot streaming replica that receives WAL continuously from the primary. When the primary fails:

  1. Promote the standby:

    Terminal window
    docker exec freesdn-postgres-standby pg_ctl promote
  2. Verify promotion - the result must be f (not in recovery):

    Terminal window
    docker exec freesdn-postgres-standby \
    bash -c 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "SELECT pg_is_in_recovery();"'
  3. Update DB_HOST in .env.max to point at the promoted standby, then restart the application layer:

    Terminal window
    docker compose --env-file .env.max \
    -f docker-compose.yml -f docker-compose.ha.yml \
    restart api api-2 worker worker-io scheduler

FreeSDN ships scripts/ha_drill.py - run it before relying on HA in production. The script injects failures and measures RTO against a configurable budget.

Terminal window
# Valkey Sentinel failover
python scripts/ha_drill.py \
--scenario redis_kill \
--lb-url http://127.0.0.1:18080 \
--report-dir drills/ \
--rto-budget 20
# Postgres manual promotion
python scripts/ha_drill.py \
--scenario primary_kill \
--lb-url http://127.0.0.1:18080 \
--report-dir drills/ \
--rto-budget 60
# API replica loss
python scripts/ha_drill.py \
--scenario api_kill \
--lb-url http://127.0.0.1:18080 \
--report-dir drills/ \
--rto-budget 5

The drill:

  1. Refuses to run against anything other than loopback or RFC1918 addresses (safety guard)
  2. Confirms baseline health via /api/v1/health/ready
  3. SIGKILLs the target container
  4. Polls /api/v1/health/ready until recovery or timeout
  5. Writes report.json, report.md, and health-timeline.csv into a timestamped subdirectory of the report directory (e.g. drills/redis_kill-20260605T143000Z/)
ScenarioTarget RTO
redis_kill (Sentinel failover)≤ 20 seconds
primary_kill (Postgres, manual promotion)≤ 60 seconds
api_kill (LB drain)≤ 5 seconds

A drill that exceeds its budget is a real finding - file an issue against the HA topology or the relevant retry configuration.

CodeMeaning
0PASS - RTO within budget
1Any failure - RTO exceeded budget, precondition failed (unsafe target or baseline unhealthy), or unexpected error

The shipped drill runs all containers on a single host. It is a proof-of-mechanism tool, not a full production HA exercise.

CoveredNot covered
Application-layer failover behaviorPhysical host separation / anti-affinity
Valkey Sentinel automatic promotionNetwork partition simulation
API LB drain and RTO measurementRPO / data-loss measurement
Single-host container topologyMulti-host topology

For a production-grade exercise, run against a multi-host staging cluster with anti-affinity rules, a real load balancer, and network partition injection (e.g. via Toxiproxy).

  • Before every release candidate - drill results are part of the release-gate evidence package
  • After any change to Postgres config, Valkey config, the event bus, the WebSocket pubsub layer, or the LB config
  • Before enterprise procurement reviews - include the most recent report.md as evidence

Next steps: Backups and Restore - the cold disaster-recovery procedure when failover is not enough.