High Availability
The Max-tier HA overlay adds redundancy for the two stateful services most critical to availability: Valkey (the cache/broker) and PostgreSQL (the primary database). Failover behavior is different for each - know which is automatic before an incident.
Summary
Section titled “Summary”| Service | HA mechanism | Failover type | Typical RTO |
|---|---|---|---|
Valkey (redis service) | Master + replica + 3 Sentinels | Automatic - Sentinel elects a new master when the primary is unreachable | ~5-15 seconds |
| PostgreSQL | Streaming standby | Manual - operator must promote the standby | ~1-2 minutes (promotion) + config update |
| API | Second replica + nginx round-robin LB | Automatic - nginx LB drains the dead upstream | ~5 seconds |
Bringing up the HA stack
Section titled “Bringing up the HA stack”Layer the HA overlay on top of the Max tier with a second compose file:
docker compose --env-file .env.max \ -f docker-compose.yml \ -f docker-compose.ha.yml \ up -dThe HA overlay adds:
freesdn-postgres-standby- streaming replica of the primary, bootstrapped viapg_basebackupfreesdn-redis-sentinel-1/2/3- three Sentinel nodes watching the Valkey master (quorum = 2,down-after-milliseconds = 5000)freesdn-redis-replica- a Valkey replica of the masterfreesdn-api-2- a second API container running the same image and envfreesdn-ha-lb- nginx round-robin load balancer across the two API replicas
Add these env vars to .env.max:
POSTGRES_REPL_USER=replPOSTGRES_REPL_PASSWORD=<strong-random-value>Valkey: automatic failover
Section titled “Valkey: automatic failover”Valkey Sentinel monitors the master. When the master is unreachable for down-after-milliseconds (5 seconds default), a quorum of Sentinels (2 of 3) elects a new master and promotes the replica. The application’s redis_client.py factory re-resolves the master via Sentinel on reconnect, so the application follows the new master without a restart.
Observed in a live drill: master paused → Sentinel promotes in ~9 s → API follows in ~5 s = ~14 s total recovery time.
PostgreSQL: manual standby promotion
Section titled “PostgreSQL: manual standby promotion”The standby is a hot streaming replica that receives WAL continuously from the primary. When the primary fails:
-
Promote the standby:
Terminal window docker exec freesdn-postgres-standby pg_ctl promote -
Verify promotion - the result must be
f(not in recovery):Terminal window docker exec freesdn-postgres-standby \bash -c 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "SELECT pg_is_in_recovery();"' -
Update
DB_HOSTin.env.maxto point at the promoted standby, then restart the application layer:Terminal window docker compose --env-file .env.max \-f docker-compose.yml -f docker-compose.ha.yml \restart api api-2 worker worker-io scheduler
The HA failover drill
Section titled “The HA failover drill”FreeSDN ships scripts/ha_drill.py - run it before relying on HA in production. The script injects failures and measures RTO against a configurable budget.
# Valkey Sentinel failoverpython scripts/ha_drill.py \ --scenario redis_kill \ --lb-url http://127.0.0.1:18080 \ --report-dir drills/ \ --rto-budget 20
# Postgres manual promotionpython scripts/ha_drill.py \ --scenario primary_kill \ --lb-url http://127.0.0.1:18080 \ --report-dir drills/ \ --rto-budget 60
# API replica losspython scripts/ha_drill.py \ --scenario api_kill \ --lb-url http://127.0.0.1:18080 \ --report-dir drills/ \ --rto-budget 5The drill:
- Refuses to run against anything other than loopback or RFC1918 addresses (safety guard)
- Confirms baseline health via
/api/v1/health/ready - SIGKILLs the target container
- Polls
/api/v1/health/readyuntil recovery or timeout - Writes
report.json,report.md, andhealth-timeline.csvinto a timestamped subdirectory of the report directory (e.g.drills/redis_kill-20260605T143000Z/)
RTO budgets
Section titled “RTO budgets”| Scenario | Target RTO |
|---|---|
redis_kill (Sentinel failover) | ≤ 20 seconds |
primary_kill (Postgres, manual promotion) | ≤ 60 seconds |
api_kill (LB drain) | ≤ 5 seconds |
A drill that exceeds its budget is a real finding - file an issue against the HA topology or the relevant retry configuration.
Drill exit codes
Section titled “Drill exit codes”| Code | Meaning |
|---|---|
0 | PASS - RTO within budget |
1 | Any failure - RTO exceeded budget, precondition failed (unsafe target or baseline unhealthy), or unexpected error |
Drill scope and known limitations
Section titled “Drill scope and known limitations”The shipped drill runs all containers on a single host. It is a proof-of-mechanism tool, not a full production HA exercise.
| Covered | Not covered |
|---|---|
| Application-layer failover behavior | Physical host separation / anti-affinity |
| Valkey Sentinel automatic promotion | Network partition simulation |
| API LB drain and RTO measurement | RPO / data-loss measurement |
| Single-host container topology | Multi-host topology |
For a production-grade exercise, run against a multi-host staging cluster with anti-affinity rules, a real load balancer, and network partition injection (e.g. via Toxiproxy).
When to run drills
Section titled “When to run drills”- Before every release candidate - drill results are part of the release-gate evidence package
- After any change to Postgres config, Valkey config, the event bus, the WebSocket pubsub layer, or the LB config
- Before enterprise procurement reviews - include the most recent
report.mdas evidence
Next steps: Backups and Restore - the cold disaster-recovery procedure when failover is not enough.