Observability
Observability
Rakkr exposes both live metrics and investigation-grade event trails. Prometheus handles fast operational signals, Grafana gives operators one board to scan, and central health/audit events keep the story attached to recordings, nodes, jobs, and settings.
The operator promise
When something degrades, Rakkr should answer three questions quickly:
- What changed?
- Which node, recording, or job is affected?
- Is there enough evidence to recover without guessing?
Signal map
| Surface | Artifact |
|---|---|
| Controller metrics | GET /metrics (see the metrics reference) |
| Prometheus alerts | docs/observability/rakkr-alerts.yml |
| Prometheus + Mimir | docs/observability/prometheus-mimir.example.yml |
| Grafana dashboard | docs/observability/grafana-dashboard.example.json |
| Agent local log | Rotating JSONL health log or SQLite health-event store on recorder nodes |
| Controller events | Central health and audit event tables |
Operator path
- Scrape the controller
GET /metricsendpoint with TLS enabled. - Load
docs/observability/rakkr-alerts.ymlinto Prometheus. - Send long-term metrics to Mimir with
docs/observability/prometheus-mimir.example.yml. - Import
docs/observability/grafana-dashboard.example.jsoninto Grafana. - Use central health/audit events for incident context (the Health and Audit pages in the console).
- Fall back to the rotating JSONL health log or SQLite health-event store on recorder nodes when a node is isolated.
What to watch
| Category | Examples |
|---|---|
| Recorder health | Node liveness, xrun/device faults, clipping, flatline, low signal, channel correlation |
| Recording flow | Active recordings, cached recordings, watchdog alerts, upload failures |
| Controller health | API availability, audit totals, health-event totals, queue state |
| Capacity and storage | Recording duration, cache bytes, upload queue pressure |
The metric names behind these are in the metrics reference; the watchdog rules that raise alerts are in the health watchdog guide.
Checked artifacts
These example configs are validated as part of mise run check:
| Check | Command |
|---|---|
| Alert rules | mise run ops:check-alerts |
| Prometheus/Mimir config | mise run ops:check-prometheus |
| Grafana dashboard | mise run ops:check-grafana |
| Runbook links | mise run ops:check-observability-docs |