Skip to content

halpid SIGSEGV crashloop at startup against MCU firmware 3.1.0 #101

@mairas

Description

@mairas

Summary

halpid 5.1.0 segfaults during early initialization when the RP2040 MCU is running firmware 3.1.0. systemd restarts it (Restart=on-failure, RestartSec=10) and one of the retries eventually survives — recovery is possible but takes several minutes and is non-deterministic. While crashlooping, halpi flash fails with "Failed to connect to daemon", blocking the canonical upgrade path.

Environment

  • Daemon: halpid 5.1.0-2 (/usr/bin/halpid, deb from apt.hatlabs.fi trixie-stable).
  • MCU firmware: 3.1.0 (reads of REG_HARDWARE_VERSION 0x03 and REG_DEVICE_ID 0x25 return all-zeros / "N/A" → those registers post-date 3.1.0).
  • Host: HALPI2 with CM5 running HaLOS (image built ~mid-May 2026; firmware package halpi2-firmware 3.3.1-1 installed but not flashed — postinst auto-flash can't run at image-bake time because there is no live halpid socket; tracked separately).
  • Hostname: halpi.local (one specific device; not yet confirmed across multiple units).

Observed

  • LEDs yellow (firmware in OperationalSolo because the watchdog is never being fed).
  • halpi status / halpi flashError: Failed to connect to daemon.
  • journalctl -u halpid shows ~10s restart loop, every attempt ends in code=killed, status=11/SEGV shortly after Starting state machine. Crash lands inconsistently — sometimes before LED socket listening is logged, sometimes after, sometimes after the state machine logs Starting power management state machine.
  • After ~6 minutes (≈30 retries) one attempt survives into Initializing watchdog and State transition: Start -> Ok and the daemon stays up.

Sample journal (captured during the incident)

May 27 11:25:17 halpi halpid[1174]: INFO halpid: Opened I2C device
May 27 11:25:17 halpi halpid[1174]: INFO halpid: Hardware version: N/A, LED count: 5
May 27 11:25:17 halpi halpid[1174]: INFO halpid: Starting HTTP server
May 27 11:25:17 halpi halpid[1174]: INFO halpid: Starting LED socket
May 27 11:25:17 halpi systemd[1]: halpid.service: Main process exited, code=killed, status=11/SEGV
May 27 11:25:17 halpi systemd[1]: halpid.service: Failed with result 'signal'.
May 27 11:25:27 halpi systemd[1]: halpid.service: Scheduled restart job, restart counter is at 1.
...
May 27 11:31:06 halpi halpid[10304]: INFO halpid: Opened I2C device
May 27 11:31:06 halpi halpid[10304]: INFO halpid: Hardware version: N/A, LED count: 5
May 27 11:31:06 halpi halpid[10304]: INFO halpid: Starting LED socket
May 27 11:31:06 halpi halpid[10304]: INFO halpid: Starting state machine
May 27 11:31:06 halpi halpid[10304]: INFO halpid::server::led_socket: LED socket listening on /run/halpid/led.sock
May 27 11:31:06 halpi halpid[10304]: INFO halpid: Starting HTTP server
May 27 11:31:06 halpi halpid[10304]: INFO halpid::server::app: HTTP server listening on /run/halpid/halpid.sock
May 27 11:31:06 halpi halpid[10304]: INFO halpid::state_machine::machine: Initializing watchdog
May 27 11:31:06 halpi halpid[10304]: INFO halpid::state_machine::machine: State transition: Start -> Ok

Suspected cause

SIGSEGV in safe Rust almost always means FFI — here the i2cdev ioctl surface. The crash window is the early concurrent init in halpid/src/main.rs: three tokio tasks (HTTP server, LED socket, state machine) are spawned back-to-back and all share Arc<Mutex<HalpiDevice>>. The Mutex serialises I2C transactions, but the device file descriptor and ioctl marshalling are FFI. Plausible mechanisms:

  1. A read against a register that doesn't exist in fw 3.1.0 (e.g. REG_HARDWARE_VERSION 0x03, REG_DEVICE_ID 0x25 — both newer than 3.1.0) returns NAK/garbage timing that causes the i2cdev transfer() ioctl to land on a memory layout the new daemon doesn't anticipate.
  2. A race in i2cdev itself triggered by concurrent first-use across three tasks competing for the same fd.

Note: halpid has no firmware-version gate; get_hardware_version() failure is swallowed with unwrap_or_else(|_| Version::from_bytes([0, 0, 0, 0])) in halpid/src/main.rs. So the crash is not a guard; it's a real bug.

Repro

  1. Flash MCU with firmware 3.1.0 (or use a unit that's never been upgraded since summer 2025).
  2. Install halpid 5.1.0-2.
  3. Start the service. Observe SEGV crashloop.

Workaround

  1. Wait for systemd to land on a lucky restart (Restart=on-failure, RestartSec=10, no StartLimitBurst → it will retry indefinitely).
  2. Once halpid is up, immediately run sudo halpi flash /usr/share/halpi2-firmware/halpi2-rs-firmware_3.3.1.bin && sudo reboot. With fw 3.3.1 the crashloop does not reproduce.

Investigation pointers

  • halpid/src/main.rs — early concurrent task spawn (server / led_socket / state_machine).
  • halpid/src/i2c/device.rsread_bytes/write_bytes over i2cdev.
  • halpid/src/state_machine/machine.rs — first iteration touches I2C immediately after spawn.
  • Consider: serialise first-time I2C reads on a single thread before spawning the concurrent tasks; or probe the device once at startup and bail out cleanly on Err rather than relying on three concurrent first-use attempts.

Severity

Medium-high. The mechanism (flash latest fw → reboot) recovers a stuck unit, but the failure mode is invisible to users without journal access and the device looks bricked (yellow LEDs, daemon down, watchdog off). Likely to bite any unit being upgraded across the 3.1.0 → 3.3.x boundary, including HaLOS image installs onto pre-3.3 hardware.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions