halpid SIGSEGV crashloop at startup against MCU firmware 3.1.0

## Summary

`halpid` 5.1.0 segfaults during early initialization when the RP2040 MCU is running firmware 3.1.0. systemd restarts it (`Restart=on-failure`, `RestartSec=10`) and one of the retries eventually survives — recovery is possible but takes several minutes and is non-deterministic. While crashlooping, `halpi flash` fails with "Failed to connect to daemon", blocking the canonical upgrade path.

## Environment

- Daemon: `halpid 5.1.0-2` (`/usr/bin/halpid`, deb from `apt.hatlabs.fi` `trixie-stable`).
- MCU firmware: `3.1.0` (reads of `REG_HARDWARE_VERSION` 0x03 and `REG_DEVICE_ID` 0x25 return all-zeros / "N/A" → those registers post-date 3.1.0).
- Host: HALPI2 with CM5 running HaLOS (image built ~mid-May 2026; firmware package `halpi2-firmware 3.3.1-1` installed but not flashed — `postinst` auto-flash can't run at image-bake time because there is no live `halpid` socket; tracked separately).
- Hostname: `halpi.local` (one specific device; not yet confirmed across multiple units).

## Observed

- LEDs yellow (firmware in `OperationalSolo` because the watchdog is never being fed).
- `halpi status` / `halpi flash` → `Error: Failed to connect to daemon`.
- `journalctl -u halpid` shows ~10s restart loop, every attempt ends in `code=killed, status=11/SEGV` shortly after `Starting state machine`. Crash lands inconsistently — sometimes before `LED socket listening` is logged, sometimes after, sometimes after the state machine logs `Starting power management state machine`.
- After ~6 minutes (≈30 retries) one attempt survives into `Initializing watchdog` and `State transition: Start -> Ok` and the daemon stays up.

### Sample journal (captured during the incident)

```
May 27 11:25:17 halpi halpid[1174]: INFO halpid: Opened I2C device
May 27 11:25:17 halpi halpid[1174]: INFO halpid: Hardware version: N/A, LED count: 5
May 27 11:25:17 halpi halpid[1174]: INFO halpid: Starting HTTP server
May 27 11:25:17 halpi halpid[1174]: INFO halpid: Starting LED socket
May 27 11:25:17 halpi systemd[1]: halpid.service: Main process exited, code=killed, status=11/SEGV
May 27 11:25:17 halpi systemd[1]: halpid.service: Failed with result 'signal'.
May 27 11:25:27 halpi systemd[1]: halpid.service: Scheduled restart job, restart counter is at 1.
...
May 27 11:31:06 halpi halpid[10304]: INFO halpid: Opened I2C device
May 27 11:31:06 halpi halpid[10304]: INFO halpid: Hardware version: N/A, LED count: 5
May 27 11:31:06 halpi halpid[10304]: INFO halpid: Starting LED socket
May 27 11:31:06 halpi halpid[10304]: INFO halpid: Starting state machine
May 27 11:31:06 halpi halpid[10304]: INFO halpid::server::led_socket: LED socket listening on /run/halpid/led.sock
May 27 11:31:06 halpi halpid[10304]: INFO halpid: Starting HTTP server
May 27 11:31:06 halpi halpid[10304]: INFO halpid::server::app: HTTP server listening on /run/halpid/halpid.sock
May 27 11:31:06 halpi halpid[10304]: INFO halpid::state_machine::machine: Initializing watchdog
May 27 11:31:06 halpi halpid[10304]: INFO halpid::state_machine::machine: State transition: Start -> Ok
```

## Suspected cause

SIGSEGV in safe Rust almost always means FFI — here the `i2cdev` ioctl surface. The crash window is the early concurrent init in `halpid/src/main.rs`: three tokio tasks (HTTP server, LED socket, state machine) are spawned back-to-back and all share `Arc<Mutex<HalpiDevice>>`. The Mutex serialises I2C transactions, but the device file descriptor and ioctl marshalling are FFI. Plausible mechanisms:

1. A read against a register that doesn't exist in fw 3.1.0 (e.g. `REG_HARDWARE_VERSION` 0x03, `REG_DEVICE_ID` 0x25 — both newer than 3.1.0) returns NAK/garbage timing that causes the `i2cdev` `transfer()` ioctl to land on a memory layout the new daemon doesn't anticipate.
2. A race in `i2cdev` itself triggered by concurrent first-use across three tasks competing for the same fd.

Note: `halpid` has no firmware-version gate; `get_hardware_version()` failure is swallowed with `unwrap_or_else(|_| Version::from_bytes([0, 0, 0, 0]))` in `halpid/src/main.rs`. So the crash is not a guard; it's a real bug.

## Repro

1. Flash MCU with firmware 3.1.0 (or use a unit that's never been upgraded since summer 2025).
2. Install `halpid 5.1.0-2`.
3. Start the service. Observe SEGV crashloop.

## Workaround

1. Wait for systemd to land on a lucky restart (`Restart=on-failure`, `RestartSec=10`, no `StartLimitBurst` → it will retry indefinitely).
2. Once `halpid` is up, immediately run `sudo halpi flash /usr/share/halpi2-firmware/halpi2-rs-firmware_3.3.1.bin && sudo reboot`. With fw 3.3.1 the crashloop does not reproduce.

## Investigation pointers

- `halpid/src/main.rs` — early concurrent task spawn (server / led_socket / state_machine).
- `halpid/src/i2c/device.rs` — `read_bytes`/`write_bytes` over `i2cdev`.
- `halpid/src/state_machine/machine.rs` — first iteration touches I2C immediately after spawn.
- Consider: serialise first-time I2C reads on a single thread before spawning the concurrent tasks; or probe the device once at startup and bail out cleanly on `Err` rather than relying on three concurrent first-use attempts.

## Severity

Medium-high. The mechanism (flash latest fw → reboot) recovers a stuck unit, but the failure mode is invisible to users without journal access and the device looks bricked (yellow LEDs, daemon down, watchdog off). Likely to bite any unit being upgraded across the 3.1.0 → 3.3.x boundary, including HaLOS image installs onto pre-3.3 hardware.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

halpid SIGSEGV crashloop at startup against MCU firmware 3.1.0 #101

Summary

Environment

Observed

Sample journal (captured during the incident)

Suspected cause

Repro

Workaround

Investigation pointers

Severity

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

halpid SIGSEGV crashloop at startup against MCU firmware 3.1.0 #101

Description

Summary

Environment

Observed

Sample journal (captured during the incident)

Suspected cause

Repro

Workaround

Investigation pointers

Severity

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions