Health watchdog

The watchdog exists to catch bad recordings while they are still happening, not afterward. It combines on-node quality scoring with controller-side rules that open, repeat, and auto-resolve health events.

Signals it watches

The recorder agent scores live audio and the watchdog reasons over it to detect:

No meaningful signal during a scheduled window;
Input too quiet (low signal);
Digital flatline / stuck samples;
Clipping;
Excessive noise, hum, static likelihood;
Device disconnects and audio-backend xruns;
Encoder / file-writer failure;
Recording file not growing;
Channel mapping / correlation issues;
Controller upload failures.

Meter frames carry RMS/peak dBFS, clipping ratio, speech vs noise, estimated SNR, first-pass intelligibility, hum/static/broadband-noise scores, and same/inverted channel correlation — the raw material for these rules.

The default scheduled voice rule

The flagship rule is scheduled low-signal:

During the scheduled recording window, after a grace period, alert if the signal does not exceed a configurable dBFS threshold for enough cumulative time.

This is intentionally not simple silence detection and not a preflight check — it watches the actual recording as it runs, so a room that goes silent mid-session (a dead mic, a pulled cable) is caught while you can still react.

Watchdog policies

Thresholds are configuration, not code. Watchdog policies (in Settings) tune when alerts open and auto-resolve for sustained:

low signal,
clipping,
digital flatline,
high channel correlation (suspicious mapping),
high broadband-noise / noise / hum / static likelihood,
and loud non-speech audio (for speech-required policies).

You can calibrate a policy from a node’s recent meter history — the watchdog calibration route recommends thresholds and can apply them, with RBAC-mirrored controls in the Settings UI.

Health-event lifecycle

The controller’s watchdog runner (default every 30s) turns sustained problems into health events and resolves them on recovery. Events also come directly from the agent (capture/upload failures, device faults, disk/CPU pressure) and from controller runners (stale-heartbeat nodes, failed uploads). Each event has a type, severity (info/warning/critical), status, and is attached to a node, recording, and/or schedule.

Operators work events on the Health page (health:read):

Search/filter by status, severity, type, node, schedule, recording, and opened/resolved date ranges, with active chips.
Acknowledge, suppress (mute), resolve, and reopen — individually or in bulk — all requiring health:acknowledge.
Export scoped/selected events as CSV.

Per-recording and per-schedule quality timelines show event-specific evidence (signal, speech, correlation, clipping, flatline, anomaly, upload-failure) laid out across the recording’s duration.

On-node evidence

Even when a node is isolated from the controller, the agent keeps a local health log — rotating JSONL by default, or a SQLite store — so an investigation can reconstruct what happened. Once connectivity returns, events sync to the controller. See the recorder agent for the event families and the CLI reference for the threshold and log-rotation knobs.

Metrics

Watchdog and audio-quality data is exported on /metrics (rakkr_input_*, rakkr_recording_watchdog_alerts_total, rakkr_device_xruns_total, …) for Prometheus alerting. See Metrics and Observability.

The checked contract — including which signals are validated and which long-duration real-room validations remain — is the HEALTH_WATCHDOG_BASELINE.