High Availability

The Max-tier HA overlay adds redundancy for the two stateful services most critical to availability: Valkey (the cache/broker) and PostgreSQL (the primary database). Failover behavior is different for each - know which is automatic before an incident.

Summary

Service	HA mechanism	Failover type	Typical RTO
Valkey (`redis` service)	Master + replica + 3 Sentinels	Automatic - Sentinel elects a new master when the primary is unreachable	~5-15 seconds
PostgreSQL	Streaming standby	Manual - operator must promote the standby	~1-2 minutes (promotion) + config update
API	Second replica + nginx round-robin LB	Automatic - nginx LB drains the dead upstream	~5 seconds

Bringing up the HA stack

Layer the HA overlay on top of the Max tier with a second compose file:

docker compose --env-file .env.max \
  -f docker-compose.yml \
  -f docker-compose.ha.yml \
  up -d

The HA overlay adds:

freesdn-postgres-standby - streaming replica of the primary, bootstrapped via pg_basebackup
freesdn-redis-sentinel-1/2/3 - three Sentinel nodes watching the Valkey master (quorum = 2, down-after-milliseconds = 5000)
freesdn-redis-replica - a Valkey replica of the master
freesdn-api-2 - a second API container running the same image and env
freesdn-ha-lb - nginx round-robin load balancer across the two API replicas

Add these env vars to .env.max:

POSTGRES_REPL_USER=repl
POSTGRES_REPL_PASSWORD=<strong-random-value>

Valkey: automatic failover

Valkey Sentinel monitors the master. When the master is unreachable for down-after-milliseconds (5 seconds default), a quorum of Sentinels (2 of 3) elects a new master and promotes the replica. The application’s redis_client.py factory re-resolves the master via Sentinel on reconnect, so the application follows the new master without a restart.

Observed in a live drill: master paused → Sentinel promotes in ~9 s → API follows in ~5 s = ~14 s total recovery time.

PostgreSQL: manual standby promotion

The standby is a hot streaming replica that receives WAL continuously from the primary. When the primary fails:

Promote the standby:

docker exec freesdn-postgres-standby pg_ctl promote

Verify promotion - the result must be f (not in recovery):

docker exec freesdn-postgres-standby \
  bash -c 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "SELECT pg_is_in_recovery();"'

Update DB_HOST in .env.max to point at the promoted standby, then restart the application layer:

docker compose --env-file .env.max \
  -f docker-compose.yml -f docker-compose.ha.yml \
  restart api api-2 worker worker-io scheduler

The HA failover drill

FreeSDN ships scripts/ha_drill.py - run it before relying on HA in production. The script injects failures and measures RTO against a configurable budget.

# Valkey Sentinel failover
python scripts/ha_drill.py \
    --scenario redis_kill \
    --lb-url http://127.0.0.1:18080 \
    --report-dir drills/ \
    --rto-budget 20

# Postgres manual promotion
python scripts/ha_drill.py \
    --scenario primary_kill \
    --lb-url http://127.0.0.1:18080 \
    --report-dir drills/ \
    --rto-budget 60

# API replica loss
python scripts/ha_drill.py \
    --scenario api_kill \
    --lb-url http://127.0.0.1:18080 \
    --report-dir drills/ \
    --rto-budget 5

The drill:

Refuses to run against anything other than loopback or RFC1918 addresses (safety guard)
Confirms baseline health via /api/v1/health/ready
SIGKILLs the target container
Polls /api/v1/health/ready until recovery or timeout
Writes report.json, report.md, and health-timeline.csv into a timestamped subdirectory of the report directory (e.g. drills/redis_kill-20260605T143000Z/)

RTO budgets

Scenario	Target RTO
`redis_kill` (Sentinel failover)	≤ 20 seconds
`primary_kill` (Postgres, manual promotion)	≤ 60 seconds
`api_kill` (LB drain)	≤ 5 seconds

A drill that exceeds its budget is a real finding - file an issue against the HA topology or the relevant retry configuration.

Drill exit codes

Code	Meaning
`0`	PASS - RTO within budget
`1`	Any failure - RTO exceeded budget, precondition failed (unsafe target or baseline unhealthy), or unexpected error

Drill scope and known limitations

The shipped drill runs all containers on a single host. It is a proof-of-mechanism tool, not a full production HA exercise.

Covered	Not covered
Application-layer failover behavior	Physical host separation / anti-affinity
Valkey Sentinel automatic promotion	Network partition simulation
API LB drain and RTO measurement	RPO / data-loss measurement
Single-host container topology	Multi-host topology

For a production-grade exercise, run against a multi-host staging cluster with anti-affinity rules, a real load balancer, and network partition injection (e.g. via Toxiproxy).

When to run drills

Before every release candidate - drill results are part of the release-gate evidence package
After any change to Postgres config, Valkey config, the event bus, the WebSocket pubsub layer, or the LB config
Before enterprise procurement reviews - include the most recent report.md as evidence

Next steps: Backups and Restore - the cold disaster-recovery procedure when failover is not enough.

All product names, logos, and brands are property of their respective owners. FreeSDN is an independent project and is not affiliated with or endorsed by the vendors it integrates with. See Trademarks.