SLA Management

SLA Management lets you codify your uptime and availability expectations as named policies, evaluate actual device and network metrics against those expectations continuously, and act on any deviation before it becomes a user-visible problem.

Policies attach to a scope - an organization, site, site group, device group, SSID, camera, or NVR - and carry a set of metric thresholds. Every five minutes Celery evaluates every active policy for your organization and raises a breach when a threshold is violated. You acknowledge, track, and resolve breaches from the same interface.

How evaluation works

The Celery beat task sla-evaluate-all runs every five minutes on the metrics queue (rate-limited to 2 per minute). It calls SLAMonitoringService.evaluate_all_policies for your organization, which:

Loads every active policy and its thresholds dict (metric_name → threshold_value).
For each non-null threshold, computes the deviation percentage against the current actual.
Calls _is_threshold_violated and, on a violation, persists an SLABreach row and publishes sla.breach.created on the event bus.
On recovery, publishes sla.breach.resolved and updates the breach status automatically.

You can also trigger evaluation manually at any time (see Endpoints).

The sla.breach.created event is always published at HIGH priority. The sla.breach.acknowledged event priority reflects breach severity: critical → CRITICAL, warning → NORMAL (default fallback). The sla.breach.resolved event is published at NORMAL priority. Breach severity is either warning (deviation ≤ 20%) or critical (deviation > 20%).

Policy scopes

Every policy targets exactly one scope. When scope is organization, scope_id must be omitted. For all other scopes except ssid, scope_id is the UUID of the target entity and is required. The ssid scope is an exception - scope_id may be omitted (the SSID name is stored in the separate scope_name field instead).

Scope	`scope_id` required	Typical use
`organization`	No	Org-wide baseline thresholds
`site`	Yes - site UUID	Per-location SLA
`site_group`	Yes - site group UUID	Regional / campus grouping
`device_group`	Yes - device group UUID	Per-fleet (e.g. all APs)
`ssid`	Optional	Wi-Fi availability per SSID name
`camera`	Yes - camera UUID	Per-camera uptime
`nvr`	Yes - NVR UUID	Per-NVR availability

Creating a policy

Navigate to Enterprise → SLA in the UI and click New Policy, or use the API directly.

Required body fields:

Field	Type	Notes
`name`	string	Human-readable label
`scope`	string	One of the scope values above
`scope_id`	UUID or omit	Required for all scopes except `organization`
`thresholds`	object	`{ "uptime_percent_min": 99.9, "latency_ms_max": 50 }` - strictly typed; valid keys: `uptime_percent_min`, `latency_ms_max`, `packet_loss_percent_max`, `health_score_min`, `client_satisfaction_min`, `error_rate_max`

Optional fields:

Field	Type	Notes
`description`	string	Free-text explanation
`status`	string	`active` (default), `disabled`, or `draft`; set `disabled` to suspend evaluation without deleting - PATCH only, not accepted on POST

Example request:

POST /api/v1/sla/policies
Content-Type: application/json
Authorization: Bearer <token>

{
  "name": "Core Network Uptime",
  "scope": "site",
  "scope_id": "a1b2c3d4-...",
  "thresholds": {
    "uptime_percent_min": 99.5,
    "packet_loss_percent_max": 1.0,
    "latency_ms_max": 100
  }
}

Endpoints

All paths are under the prefix /api/v1/sla. Both config:read and config:write permissions are required at the appropriate tier - see the RBAC reference for the role-to-permission mapping.

Policy management

Method	Path	Purpose	Permission
GET	`/api/v1/sla/summary`	Org-wide compliance summary (pass `site_id` to scope)	`config:read`
GET	`/api/v1/sla/policies`	List policies - filter by `site_id`, `scope`, `status`; `limit≤200`	`config:read`
POST	`/api/v1/sla/policies`	Create a policy	`config:write`
GET	`/api/v1/sla/policies/{policy_id}`	Retrieve one policy	`config:read`
PATCH	`/api/v1/sla/policies/{policy_id}`	Update (scope re-verified on change)	`config:write`
DELETE	`/api/v1/sla/policies/{policy_id}`	Delete a policy	`config:write`

Breach management

Method	Path	Purpose	Permission
GET	`/api/v1/sla/breaches`	List breaches - filter by `site_id`, `policy_id`, `status`; `limit≤200`	`config:read`
POST	`/api/v1/sla/breaches/{breach_id}/acknowledge`	Acknowledge with an optional note	`config:write`
POST	`/api/v1/sla/evaluate`	Manually trigger evaluation for your org	`config:write`

Tracking and acknowledging breaches

When a policy is violated, the platform:

Persists an SLABreach row with the policy, scope, metric, actual value, threshold, and severity.
Publishes sla.breach.created on the event bus - any connected alert rule or notification provider will fire if configured to listen for this event type.
Updates the breach status to resolved and publishes sla.breach.resolved automatically once the metric recovers.

Acknowledge a breach to record that a human has reviewed it:

POST /api/v1/sla/breaches/{breach_id}/acknowledge
Content-Type: application/json

{
  "notes": "Investigating upstream ISP packet loss."
}

The acknowledgement publishes sla.breach.acknowledged on the event bus.

Filtering the breach list:

GET /api/v1/sla/breaches?status=active&site_id=<uuid>&limit=50

Valid status values: active, acknowledged, resolved.

Compliance summary

GET /api/v1/sla/summary (optionally scoped with ?site_id=<uuid>) returns an org-wide view showing:

Total active policies
Policies currently in breach
Breach counts by severity
Recent breach history

Use this endpoint to power a management dashboard or feed into a periodic report.

Reports and schedules

The report engine at /api/v1/sla/reports generates on-demand SLA compliance reports in PDF or CSV format.

Generating a report

POST /api/v1/sla/reports/generate
Content-Type: application/json

{
  "period_start": "2026-05-01T00:00:00Z",
  "period_end":   "2026-06-01T00:00:00Z",
  "policy_ids":   ["<uuid>", "<uuid>"],
  "format":       "pdf",
  "title":        "May 2026 SLA Report"
}

Constraints enforced by the server:

period_start must be before period_end
Period length must be 366 days or less
format must be pdf or csv

Once generated, download the file with:

GET /api/v1/sla/reports/{report_id}/download

If the file has not yet been rendered, the endpoint returns the report data as inline JSON instead. The download path is path-traversal guarded - the resolved file path must fall inside REPORTS_BASE_DIR or the request is rejected with 403.

Report endpoints

Method	Path	Purpose	Permission
POST	`/api/v1/sla/reports/generate`	Generate on demand	`config:read`
GET	`/api/v1/sla/reports`	List generated reports (`limit≤200`)	`config:read`
GET	`/api/v1/sla/reports/{report_id}/download`	Download file or inline JSON	`config:read`

Schedule endpoints

Method	Path	Purpose	Permission
GET	`/api/v1/sla/report-schedules`	List schedules	`config:read`
POST	`/api/v1/sla/report-schedules`	Create a schedule	`config:write`
PUT	`/api/v1/sla/report-schedules/{schedule_id}`	Update a schedule	`config:write`
DELETE	`/api/v1/sla/report-schedules/{schedule_id}`	Delete a schedule	`config:write`

Permissions reference

Permission	Who can hold it	What it unlocks
`config:read`	viewer and above (site-scoped by grant)	Read policies, breaches, summary, reports
`config:write`	site_admin and above	Create/edit/delete policies; acknowledge breaches; trigger evaluation; create reports and schedules

Role-to-permission mapping is defined in the Roles and permissions reference.

Connecting SLA to other enterprise features

Alert Rules - create a rule that fires on sla.breach.created to route breach notifications to any configured notification channel.
Health Dashboard - the SLAComplianceCard component on the health dashboard (/health) pulls from GET /api/v1/sla/summary and shows a high-level compliance rate alongside device health scores.
Event Correlation - SLA breach events appear in the correlation engine’s event stream; you can write a correlation rule that groups multiple simultaneous breaches into a single incident.
Notifications - configure an SMTP, Slack webhook, or Teams webhook provider under Enterprise → Notification Providers, then attach it to an alert rule targeting SLA breach events.

Security notes

Every scope_id supplied at create or update time is verified to be owned by your organization before the operation is committed. Cross-tenant probes receive a 404.
All policy and breach reads are org-scoped - you cannot read another organization’s SLA data.
The manual /evaluate endpoint derives the target organization from the authenticated user’s JWT. It cannot be directed at another organization’s policies.
Multi-tenant isolation is enforced at the application layer. There is no PostgreSQL row-level security - isolation is the responsibility of the service and endpoint code, which has been independently reviewed.

Next steps

Alert Rules - wire breach events to notifications and auto-resolve workflows
Health Dashboard (/health) - view SLA compliance alongside device health scores
Event Correlation - group breach events into incidents for coordinated response
Notification Providers - configure SMTP, Slack, Teams, and webhook delivery