High Availability

The Max-tier HA overlay adds redundancy for the two stateful services most critical to availability: Valkey (the cache/broker) and PostgreSQL (the primary database). Failover behavior is different for each - know which is automatic before an incident.

Summary

Service	HA mechanism	Failover type	Typical RTO
Valkey (`redis` service)	Master + replica + 3 Sentinels	Automatic - Sentinel elects a new master when the primary is unreachable	~5-15 seconds
PostgreSQL	Streaming standby	Manual - operator must promote the standby	~1-2 minutes (promotion) + config update
API	Second replica + nginx round-robin LB	Automatic - nginx LB drains the dead upstream	~5 seconds

Bringing up the HA stack

Layer the HA overlay on top of the Max tier with a second compose file:

docker compose --env-file .env.max \
  -f docker-compose.yml \
  -f docker-compose.ha.yml \
  up -d

The HA overlay adds:

freesdn-postgres-standby - streaming replica of the primary, bootstrapped via pg_basebackup
freesdn-redis-sentinel-1/2/3 - three Sentinel nodes watching the Valkey master (quorum = 2, down-after-milliseconds = 5000)
freesdn-redis-replica - a Valkey replica of the master
freesdn-api-2 - a second API container running the same image and env
freesdn-ha-lb - nginx round-robin load balancer across the two API replicas

Add these env vars to .env.max:

POSTGRES_REPL_USER=repl
POSTGRES_REPL_PASSWORD=<strong-random-value>

Valkey: automatic failover

Valkey Sentinel monitors the master. When the master is unreachable for down-after-milliseconds (5 seconds default), a quorum of Sentinels (2 of 3) elects a new master and promotes the replica. The application’s redis_client.py factory re-resolves the master via Sentinel on reconnect, so the application follows the new master without a restart.

Observed in a live drill: master paused → Sentinel promotes in ~9 s → API follows in ~5 s = ~14 s total recovery time.

PostgreSQL: manual standby promotion

The standby is a hot streaming replica that receives WAL continuously from the primary. When the primary fails:

Promote the standby:

docker exec freesdn-postgres-standby pg_ctl promote

Verify promotion - the result must be f (not in recovery):

docker exec freesdn-postgres-standby \
  bash -c 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "SELECT pg_is_in_recovery();"'

Update DB_HOST in .env.max to point at the promoted standby, then restart the application layer:

docker compose --env-file .env.max \
  -f docker-compose.yml -f docker-compose.ha.yml \
  restart api api-2 worker worker-io scheduler

The HA failover drill

FreeSDN ships scripts/ha_drill.py - run it before relying on HA in production. The script injects failures and measures RTO against a configurable budget.

# Valkey Sentinel failover
python scripts/ha_drill.py \
    --scenario redis_kill \
    --lb-url http://127.0.0.1:18080 \
    --report-dir drills/ \
    --rto-budget 20

# Postgres manual promotion
python scripts/ha_drill.py \
    --scenario primary_kill \
    --lb-url http://127.0.0.1:18080 \
    --report-dir drills/ \
    --rto-budget 60

# API replica loss
python scripts/ha_drill.py \
    --scenario api_kill \
    --lb-url http://127.0.0.1:18080 \
    --report-dir drills/ \
    --rto-budget 5

The drill:

Refuses to run against anything other than loopback or RFC1918 addresses (safety guard)
Confirms baseline health via /api/v1/health/ready
SIGKILLs the target container
Polls /api/v1/health/ready until recovery or timeout
Writes report.json, report.md, and health-timeline.csv into a timestamped subdirectory of the report directory (e.g. drills/redis_kill-20260605T143000Z/)

RTO budgets

Scenario	Target RTO
`redis_kill` (Sentinel failover)	≤ 20 seconds
`primary_kill` (Postgres, manual promotion)	≤ 60 seconds
`api_kill` (LB drain)	≤ 5 seconds

A drill that exceeds its budget is a real finding - file an issue against the HA topology or the relevant retry configuration.

Drill exit codes

Code	Meaning
`0`	PASS - RTO within budget
`1`	Any failure - RTO exceeded budget, precondition failed (unsafe target or baseline unhealthy), or unexpected error

Drill scope and known limitations

The shipped drill runs all containers on a single host. It is a proof-of-mechanism tool, not a full production HA exercise.

Covered	Not covered
Application-layer failover behavior	Physical host separation / anti-affinity
Valkey Sentinel automatic promotion	Network partition simulation
API LB drain and RTO measurement	RPO / data-loss measurement
Single-host container topology	Multi-host topology

For a production-grade exercise, run against a multi-host staging cluster with anti-affinity rules, a real load balancer, and network partition injection (e.g. via Toxiproxy).

When to run drills

Before every release candidate - drill results are part of the release-gate evidence package
After any change to Postgres config, Valkey config, the event bus, the WebSocket pubsub layer, or the LB config
Before enterprise procurement reviews - include the most recent report.md as evidence

Next steps: Backups and Restore - the cold disaster-recovery procedure when failover is not enough.