Skip to content

Observability

Observability

Rakkr exposes both live metrics and investigation-grade event trails. Prometheus handles fast operational signals, Grafana gives operators one board to scan, and central health/audit events keep the story attached to recordings, nodes, jobs, and settings.

The operator promise

When something degrades, Rakkr should answer three questions quickly:

  1. What changed?
  2. Which node, recording, or job is affected?
  3. Is there enough evidence to recover without guessing?

Signal map

SurfaceArtifact
Controller metricsGET /metrics (see the metrics reference)
Prometheus alertsdocs/observability/rakkr-alerts.yml
Prometheus + Mimirdocs/observability/prometheus-mimir.example.yml
Grafana dashboarddocs/observability/grafana-dashboard.example.json
Agent local logRotating JSONL health log or SQLite health-event store on recorder nodes
Controller eventsCentral health and audit event tables

Operator path

  1. Scrape the controller GET /metrics endpoint with TLS enabled.
  2. Load docs/observability/rakkr-alerts.yml into Prometheus.
  3. Send long-term metrics to Mimir with docs/observability/prometheus-mimir.example.yml.
  4. Import docs/observability/grafana-dashboard.example.json into Grafana.
  5. Use central health/audit events for incident context (the Health and Audit pages in the console).
  6. Fall back to the rotating JSONL health log or SQLite health-event store on recorder nodes when a node is isolated.

What to watch

CategoryExamples
Recorder healthNode liveness, xrun/device faults, clipping, flatline, low signal, channel correlation
Recording flowActive recordings, cached recordings, watchdog alerts, upload failures
Controller healthAPI availability, audit totals, health-event totals, queue state
Capacity and storageRecording duration, cache bytes, upload queue pressure

The metric names behind these are in the metrics reference; the watchdog rules that raise alerts are in the health watchdog guide.

Checked artifacts

These example configs are validated as part of mise run check:

CheckCommand
Alert rulesmise run ops:check-alerts
Prometheus/Mimir configmise run ops:check-prometheus
Grafana dashboardmise run ops:check-grafana
Runbook linksmise run ops:check-observability-docs