High Availability
The Max-tier HA overlay adds redundancy for the two stateful services most critical to availability: Valkey (the cache/broker) and PostgreSQL (the primary database). Failover behavior is different for each - know which is automatic before an incident.
Summary
Section titled “Summary”| Service | HA mechanism | Failover type | Typical RTO |
|---|---|---|---|
Valkey (redis service) | Master + replica + 3 Sentinels | Automatic - Sentinel elects a new master when the primary is unreachable | ~5-15 seconds |
| PostgreSQL | Streaming standby | Manual - operator must promote the standby | ~1-2 minutes (promotion) + config update |
| API | Second replica + nginx round-robin LB | Automatic - nginx LB drains the dead upstream | ~5 seconds |
Bringing up the HA stack
Section titled “Bringing up the HA stack”Layer the HA overlay on top of the Max tier with a second compose file:
docker compose --env-file .env.max \ -f docker-compose.yml \ -f docker-compose.ha.yml \ up -dThe HA overlay adds:
freesdn-postgres-standby- streaming replica of the primary, bootstrapped viapg_basebackupfreesdn-redis-sentinel-1/2/3- three Sentinel nodes watching the Valkey master (quorum = 2,down-after-milliseconds = 5000)freesdn-redis-replica- a Valkey replica of the masterfreesdn-api-2- a second API container running the same image and envfreesdn-ha-lb- nginx round-robin load balancer across the two API replicas
Add these env vars to .env.max:
POSTGRES_REPL_USER=replPOSTGRES_REPL_PASSWORD=<strong-random-value>Valkey: automatic failover
Section titled “Valkey: automatic failover”Valkey Sentinel monitors the master. When the master is unreachable for down-after-milliseconds (5 seconds default), a quorum of Sentinels (2 of 3) elects a new master and promotes the replica. The application’s redis_client.py factory re-resolves the master via Sentinel on reconnect, so the application follows the new master without a restart.
Observed in a live drill: master paused → Sentinel promotes in ~9 s → API follows in ~5 s = ~14 s total recovery time.
PostgreSQL: manual standby promotion
Section titled “PostgreSQL: manual standby promotion”The standby is a hot streaming replica that receives WAL continuously from the primary. When the primary fails:
-
Promote the standby:
Terminal window docker exec freesdn-postgres-standby pg_ctl promote -
Verify promotion - the result must be
f(not in recovery):Terminal window docker exec freesdn-postgres-standby \bash -c 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "SELECT pg_is_in_recovery();"' -
Update
DB_HOSTin.env.maxto point at the promoted standby, then restart the application layer:Terminal window docker compose --env-file .env.max \-f docker-compose.yml -f docker-compose.ha.yml \restart api api-2 worker worker-io scheduler
The HA failover drill
Section titled “The HA failover drill”FreeSDN ships scripts/ha_drill.py - run it before relying on HA in production. The script injects failures and measures RTO against a configurable budget.
# Valkey Sentinel failoverpython scripts/ha_drill.py \ --scenario redis_kill \ --lb-url http://127.0.0.1:18080 \ --report-dir drills/ \ --rto-budget 20
# Postgres manual promotionpython scripts/ha_drill.py \ --scenario primary_kill \ --lb-url http://127.0.0.1:18080 \ --report-dir drills/ \ --rto-budget 60
# API replica losspython scripts/ha_drill.py \ --scenario api_kill \ --lb-url http://127.0.0.1:18080 \ --report-dir drills/ \ --rto-budget 5The drill:
- Refuses to run against anything other than loopback or RFC1918 addresses (safety guard)
- Confirms baseline health via
/api/v1/health/ready - SIGKILLs the target container
- Polls
/api/v1/health/readyuntil recovery or timeout - Writes
report.json,report.md, andhealth-timeline.csvinto a timestamped subdirectory of the report directory (e.g.drills/redis_kill-20260605T143000Z/)
RTO budgets
Section titled “RTO budgets”| Scenario | Target RTO |
|---|---|
redis_kill (Sentinel failover) | ≤ 20 seconds |
primary_kill (Postgres, manual promotion) | ≤ 60 seconds |
api_kill (LB drain) | ≤ 5 seconds |
A drill that exceeds its budget is a real finding - file an issue against the HA topology or the relevant retry configuration.
Drill exit codes
Section titled “Drill exit codes”| Code | Meaning |
|---|---|
0 | PASS - RTO within budget |
1 | Any failure - RTO exceeded budget, precondition failed (unsafe target or baseline unhealthy), or unexpected error |
Drill scope and known limitations
Section titled “Drill scope and known limitations”The shipped drill runs all containers on a single host. It is a proof-of-mechanism tool, not a full production HA exercise.
| Covered | Not covered |
|---|---|
| Application-layer failover behavior | Physical host separation / anti-affinity |
| Valkey Sentinel automatic promotion | Network partition simulation |
| API LB drain and RTO measurement | RPO / data-loss measurement |
| Single-host container topology | Multi-host topology |
For a production-grade exercise, run against a multi-host staging cluster with anti-affinity rules, a real load balancer, and network partition injection (e.g. via Toxiproxy).
When to run drills
Section titled “When to run drills”- Before every release candidate - drill results are part of the release-gate evidence package
- After any change to Postgres config, Valkey config, the event bus, the WebSocket pubsub layer, or the LB config
- Before enterprise procurement reviews - include the most recent
report.mdas evidence
Next steps: Backups and Restore - the cold disaster-recovery procedure when failover is not enough.
All product names, logos, and brands are property of their respective owners. FreeSDN is an independent project and is not affiliated with or endorsed by the vendors it integrates with. See Trademarks.