Summary
halpid 5.1.0 segfaults during early initialization when the RP2040 MCU is running firmware 3.1.0. systemd restarts it (Restart=on-failure, RestartSec=10) and one of the retries eventually survives — recovery is possible but takes several minutes and is non-deterministic. While crashlooping, halpi flash fails with "Failed to connect to daemon", blocking the canonical upgrade path.
Environment
- Daemon:
halpid 5.1.0-2 (/usr/bin/halpid, deb from apt.hatlabs.fi trixie-stable).
- MCU firmware:
3.1.0 (reads of REG_HARDWARE_VERSION 0x03 and REG_DEVICE_ID 0x25 return all-zeros / "N/A" → those registers post-date 3.1.0).
- Host: HALPI2 with CM5 running HaLOS (image built ~mid-May 2026; firmware package
halpi2-firmware 3.3.1-1 installed but not flashed — postinst auto-flash can't run at image-bake time because there is no live halpid socket; tracked separately).
- Hostname:
halpi.local (one specific device; not yet confirmed across multiple units).
Observed
- LEDs yellow (firmware in
OperationalSolo because the watchdog is never being fed).
halpi status / halpi flash → Error: Failed to connect to daemon.
journalctl -u halpid shows ~10s restart loop, every attempt ends in code=killed, status=11/SEGV shortly after Starting state machine. Crash lands inconsistently — sometimes before LED socket listening is logged, sometimes after, sometimes after the state machine logs Starting power management state machine.
- After ~6 minutes (≈30 retries) one attempt survives into
Initializing watchdog and State transition: Start -> Ok and the daemon stays up.
Sample journal (captured during the incident)
May 27 11:25:17 halpi halpid[1174]: INFO halpid: Opened I2C device
May 27 11:25:17 halpi halpid[1174]: INFO halpid: Hardware version: N/A, LED count: 5
May 27 11:25:17 halpi halpid[1174]: INFO halpid: Starting HTTP server
May 27 11:25:17 halpi halpid[1174]: INFO halpid: Starting LED socket
May 27 11:25:17 halpi systemd[1]: halpid.service: Main process exited, code=killed, status=11/SEGV
May 27 11:25:17 halpi systemd[1]: halpid.service: Failed with result 'signal'.
May 27 11:25:27 halpi systemd[1]: halpid.service: Scheduled restart job, restart counter is at 1.
...
May 27 11:31:06 halpi halpid[10304]: INFO halpid: Opened I2C device
May 27 11:31:06 halpi halpid[10304]: INFO halpid: Hardware version: N/A, LED count: 5
May 27 11:31:06 halpi halpid[10304]: INFO halpid: Starting LED socket
May 27 11:31:06 halpi halpid[10304]: INFO halpid: Starting state machine
May 27 11:31:06 halpi halpid[10304]: INFO halpid::server::led_socket: LED socket listening on /run/halpid/led.sock
May 27 11:31:06 halpi halpid[10304]: INFO halpid: Starting HTTP server
May 27 11:31:06 halpi halpid[10304]: INFO halpid::server::app: HTTP server listening on /run/halpid/halpid.sock
May 27 11:31:06 halpi halpid[10304]: INFO halpid::state_machine::machine: Initializing watchdog
May 27 11:31:06 halpi halpid[10304]: INFO halpid::state_machine::machine: State transition: Start -> Ok
Suspected cause
SIGSEGV in safe Rust almost always means FFI — here the i2cdev ioctl surface. The crash window is the early concurrent init in halpid/src/main.rs: three tokio tasks (HTTP server, LED socket, state machine) are spawned back-to-back and all share Arc<Mutex<HalpiDevice>>. The Mutex serialises I2C transactions, but the device file descriptor and ioctl marshalling are FFI. Plausible mechanisms:
- A read against a register that doesn't exist in fw 3.1.0 (e.g.
REG_HARDWARE_VERSION 0x03, REG_DEVICE_ID 0x25 — both newer than 3.1.0) returns NAK/garbage timing that causes the i2cdev transfer() ioctl to land on a memory layout the new daemon doesn't anticipate.
- A race in
i2cdev itself triggered by concurrent first-use across three tasks competing for the same fd.
Note: halpid has no firmware-version gate; get_hardware_version() failure is swallowed with unwrap_or_else(|_| Version::from_bytes([0, 0, 0, 0])) in halpid/src/main.rs. So the crash is not a guard; it's a real bug.
Repro
- Flash MCU with firmware 3.1.0 (or use a unit that's never been upgraded since summer 2025).
- Install
halpid 5.1.0-2.
- Start the service. Observe SEGV crashloop.
Workaround
- Wait for systemd to land on a lucky restart (
Restart=on-failure, RestartSec=10, no StartLimitBurst → it will retry indefinitely).
- Once
halpid is up, immediately run sudo halpi flash /usr/share/halpi2-firmware/halpi2-rs-firmware_3.3.1.bin && sudo reboot. With fw 3.3.1 the crashloop does not reproduce.
Investigation pointers
halpid/src/main.rs — early concurrent task spawn (server / led_socket / state_machine).
halpid/src/i2c/device.rs — read_bytes/write_bytes over i2cdev.
halpid/src/state_machine/machine.rs — first iteration touches I2C immediately after spawn.
- Consider: serialise first-time I2C reads on a single thread before spawning the concurrent tasks; or probe the device once at startup and bail out cleanly on
Err rather than relying on three concurrent first-use attempts.
Severity
Medium-high. The mechanism (flash latest fw → reboot) recovers a stuck unit, but the failure mode is invisible to users without journal access and the device looks bricked (yellow LEDs, daemon down, watchdog off). Likely to bite any unit being upgraded across the 3.1.0 → 3.3.x boundary, including HaLOS image installs onto pre-3.3 hardware.
Summary
halpid5.1.0 segfaults during early initialization when the RP2040 MCU is running firmware 3.1.0. systemd restarts it (Restart=on-failure,RestartSec=10) and one of the retries eventually survives — recovery is possible but takes several minutes and is non-deterministic. While crashlooping,halpi flashfails with "Failed to connect to daemon", blocking the canonical upgrade path.Environment
halpid 5.1.0-2(/usr/bin/halpid, deb fromapt.hatlabs.fitrixie-stable).3.1.0(reads ofREG_HARDWARE_VERSION0x03 andREG_DEVICE_ID0x25 return all-zeros / "N/A" → those registers post-date 3.1.0).halpi2-firmware 3.3.1-1installed but not flashed —postinstauto-flash can't run at image-bake time because there is no livehalpidsocket; tracked separately).halpi.local(one specific device; not yet confirmed across multiple units).Observed
OperationalSolobecause the watchdog is never being fed).halpi status/halpi flash→Error: Failed to connect to daemon.journalctl -u halpidshows ~10s restart loop, every attempt ends incode=killed, status=11/SEGVshortly afterStarting state machine. Crash lands inconsistently — sometimes beforeLED socket listeningis logged, sometimes after, sometimes after the state machine logsStarting power management state machine.Initializing watchdogandState transition: Start -> Okand the daemon stays up.Sample journal (captured during the incident)
Suspected cause
SIGSEGV in safe Rust almost always means FFI — here the
i2cdevioctl surface. The crash window is the early concurrent init inhalpid/src/main.rs: three tokio tasks (HTTP server, LED socket, state machine) are spawned back-to-back and all shareArc<Mutex<HalpiDevice>>. The Mutex serialises I2C transactions, but the device file descriptor and ioctl marshalling are FFI. Plausible mechanisms:REG_HARDWARE_VERSION0x03,REG_DEVICE_ID0x25 — both newer than 3.1.0) returns NAK/garbage timing that causes thei2cdevtransfer()ioctl to land on a memory layout the new daemon doesn't anticipate.i2cdevitself triggered by concurrent first-use across three tasks competing for the same fd.Note:
halpidhas no firmware-version gate;get_hardware_version()failure is swallowed withunwrap_or_else(|_| Version::from_bytes([0, 0, 0, 0]))inhalpid/src/main.rs. So the crash is not a guard; it's a real bug.Repro
halpid 5.1.0-2.Workaround
Restart=on-failure,RestartSec=10, noStartLimitBurst→ it will retry indefinitely).halpidis up, immediately runsudo halpi flash /usr/share/halpi2-firmware/halpi2-rs-firmware_3.3.1.bin && sudo reboot. With fw 3.3.1 the crashloop does not reproduce.Investigation pointers
halpid/src/main.rs— early concurrent task spawn (server / led_socket / state_machine).halpid/src/i2c/device.rs—read_bytes/write_bytesoveri2cdev.halpid/src/state_machine/machine.rs— first iteration touches I2C immediately after spawn.Errrather than relying on three concurrent first-use attempts.Severity
Medium-high. The mechanism (flash latest fw → reboot) recovers a stuck unit, but the failure mode is invisible to users without journal access and the device looks bricked (yellow LEDs, daemon down, watchdog off). Likely to bite any unit being upgraded across the 3.1.0 → 3.3.x boundary, including HaLOS image installs onto pre-3.3 hardware.