Skip to content

Health watchdog

Health watchdog

The watchdog exists to catch bad recordings while they are still happening, not afterward. It combines on-node quality scoring with controller-side rules that open, repeat, and auto-resolve health events.

Signals it watches

The recorder agent scores live audio and the watchdog reasons over it to detect:

  • No meaningful signal during a scheduled window;
  • Input too quiet (low signal);
  • Digital flatline / stuck samples;
  • Clipping;
  • Excessive noise, hum, static likelihood;
  • Device disconnects and audio-backend xruns;
  • Encoder / file-writer failure;
  • Recording file not growing;
  • Channel mapping / correlation issues;
  • Controller upload failures.

Meter frames carry RMS/peak dBFS, clipping ratio, speech vs noise, estimated SNR, first-pass intelligibility, hum/static/broadband-noise scores, and same/inverted channel correlation — the raw material for these rules.

The default scheduled voice rule

The flagship rule is scheduled low-signal:

During the scheduled recording window, after a grace period, alert if the signal does not exceed a configurable dBFS threshold for enough cumulative time.

This is intentionally not simple silence detection and not a preflight check — it watches the actual recording as it runs, so a room that goes silent mid-session (a dead mic, a pulled cable) is caught while you can still react.

Watchdog policies

Thresholds are configuration, not code. Watchdog policies (in Settings) tune when alerts open and auto-resolve for sustained:

  • low signal,
  • clipping,
  • digital flatline,
  • high channel correlation (suspicious mapping),
  • high broadband-noise / noise / hum / static likelihood,
  • and loud non-speech audio (for speech-required policies).

You can calibrate a policy from a node’s recent meter history — the watchdog calibration route recommends thresholds and can apply them, with RBAC-mirrored controls in the Settings UI.

Health-event lifecycle

The controller’s watchdog runner (default every 30s) turns sustained problems into health events and resolves them on recovery. Events also come directly from the agent (capture/upload failures, device faults, disk/CPU pressure) and from controller runners (stale-heartbeat nodes, failed uploads). Each event has a type, severity (info/warning/critical), status, and is attached to a node, recording, and/or schedule.

Operators work events on the Health page (health:read):

  • Search/filter by status, severity, type, node, schedule, recording, and opened/resolved date ranges, with active chips.
  • Acknowledge, suppress (mute), resolve, and reopen — individually or in bulk — all requiring health:acknowledge.
  • Export scoped/selected events as CSV.

Per-recording and per-schedule quality timelines show event-specific evidence (signal, speech, correlation, clipping, flatline, anomaly, upload-failure) laid out across the recording’s duration.

On-node evidence

Even when a node is isolated from the controller, the agent keeps a local health log — rotating JSONL by default, or a SQLite store — so an investigation can reconstruct what happened. Once connectivity returns, events sync to the controller. See the recorder agent for the event families and the CLI reference for the threshold and log-rotation knobs.

Metrics

Watchdog and audio-quality data is exported on /metrics (rakkr_input_*, rakkr_recording_watchdog_alerts_total, rakkr_device_xruns_total, …) for Prometheus alerting. See Metrics and Observability.

The checked contract — including which signals are validated and which long-duration real-room validations remain — is the HEALTH_WATCHDOG_BASELINE.