On Madrid, I have 191 instances,
190 of them are running IO,
2 CPUs, 8G memory.
They all have a single 4GiB (512b block size) disk and are doing IO to ~2GiB of that disk.
I reboot a sled, and instances move around and restart, but there are a few instances that
just can't get through Live Repair and have all three downstairs back.
Once such instance the upstairs connects to the failed downstairs.
It begins Live Repair, but gets a failure before it can complete the repair:
2026-04-14 03:13:18.494Z INFO propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: Repair for extent 5 s:2 d:[ClientId(1)]
= downstairs
session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:18.494Z INFO propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: 1813552 final dependency list [[JobId(1813551)]]
= downstairs
client = 1
session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:18.999Z WARN propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: flush check fired despite having jobs; resetting it
session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
More flush check failed messages, then:
8798c-5f53-4816-b00f-f61c4507c6e3: flush check fired despite having jobs; resetting it
session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:32.839Z WARN propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: flush check fired despite having jobs; resetting it
session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:33.126Z WARN propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: timeout 1/3
= downstairs
client = 1
session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:33.521Z WARN propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: downstairs disconnected
= downstairs
client = 1
session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:33.521Z ERRO propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: job id 1813552 saw error RepairStreamError("repair File { fd: 76, read: true, write: true }: stream error: reqwest::Error { kind: Decode, source: reqwest::Error { kind: Body, source: TimedOut } }")
= downstairs
client = 1
session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:33.521Z WARN propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: client task is sending Done(Finished)
= downstairs
client = 1
session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:33.521Z ERRO propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: DS Reports error RepairStreamError("repair File { fd: 76, read: true, write: true }: stream error: reqwest::Error { kind: Decode, source: reqwest::Error { kind: Body, source: TimedOut } }") on job 1813552, DownstairsIO { work: ExtentLiveRepair { dependencies: [JobId(1813551)], extent: ExtentId(5), source_downstairs: ClientId(2), source_repair_address: [fdd5:cfdb:cca5:102::6]:23004, repair_downstairs: [ClientId(1)] }, state: ClientData([Done, InProgress, Done]), res: None, acked: false, replay: false, data: None, io_limits: ClientMap(ClientData([None, None, None])) } EC
= downstairs
client = 1
session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
The downstairs reports:
03:13:13.123Z INFO crucible: accepted connection from [fdd5:cfdb:cca5:103::1:42a]:64975
task = main
03:13:13.123Z INFO crucible: connection ([fdd5:cfdb:cca5:103::1:42a]:64975): tasks spawned
id = 277
03:13:13.123Z INFO crucible: Connection request from c6c4571c-d3b8-4dcd-9f28-fe8e067fb010 with version 13
03:13:13.123Z INFO crucible: upstairs UpstairsConnection { upstairs_id: c6c4571c-d3b8-4dcd-9f28-fe8e067fb010, session_id: 750e20f9-bfab
-4307-bc9b-2288b2676933, generation: 17 } connected, version 13
03:13:13.126Z INFO crucible: UpstairsConnection { upstairs_id: c6c4571c-d3b8-4dcd-9f28-fe8e067fb010, session_id: 750e20f9-bfab-4307-bc9
b-2288b2676933, generation: 17 } is now active (read-write)
03:13:13.184Z INFO crucible: Current flush_numbers [0..12]: [35411, 38793, 38890, 38793, 27421, 38824, 38400, 38548, 37731, 37074, 3278
8, 32786]
03:13:13.190Z WARN crucible: 1812880 job ELiveReopen waiting on 1 deps
role = work
upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:13.241Z WARN crucible: 1812908 job ELiveReopen waiting on 1 deps
role = work
upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:13.264Z WARN crucible: 1812914 job Flush waiting on 1 deps
role = work
upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:14.614Z WARN crucible: 1813077 job ELiveReopen waiting on 1 deps
role = work
upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:16.785Z WARN crucible: 1813342 job ELiveReopen waiting on 1 deps
role = work
upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:18.014Z WARN crucible: 1813530 job ELiveReopen waiting on 1 deps
role = work
upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:18.058Z WARN crucible: 1813554 job ELiveReopen waiting on 1 deps
role = work
upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:18.480Z WARN crucible: 1813562 job Flush waiting on 1 deps
role = work
upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:18.495Z INFO crucible: Created copy dir "/data/regions/659a224c-a175-4859-8275-baafbe548590/00/000/005.copy"
03:13:18.518Z INFO crucible: eid:5 Found repair files: ["005"]
03:13:33.520Z WARN crucible: handle_frame returns error proc_frame
Caused by:
Repair has failed, exiting task; disconnecting
03:13:33.522Z INFO crucible: upstairs UpstairsConnection { upstairs_id: c6c4571c-d3b8-4dcd-9f28-fe8e067fb010, session_id: 750e20f9-bfab
-4307-bc9b-2288b2676933, generation: 17 } (ConnectionId(277)) removed, 2 jobs left
03:13:33.522Z INFO crucible: upstairs UpstairsConnection { upstairs_id: c6c4571c-d3b8-4dcd-9f28-fe8e067fb010, session_id: 750e20f9-bfab
-4307-bc9b-2288b2676933, generation: 17 } (ConnectionId(277)) was previously active, clearing
03:13:33.522Z WARN crucible: got message for disconnected id ConnectionId(277); ignoring
03:13:33.522Z WARN crucible: got message for disconnected id ConnectionId(277); ignoring
03:13:33.522Z WARN crucible: got message for disconnected id ConnectionId(277); ignoring
03:13:33.522Z INFO crucible: got cancelled token; recv_task stopping
id = 277
task = recv
On Madrid, I have 191 instances,
190 of them are running IO,
2 CPUs, 8G memory.
They all have a single 4GiB (512b block size) disk and are doing IO to ~2GiB of that disk.
I reboot a sled, and instances move around and restart, but there are a few instances that
just can't get through Live Repair and have all three downstairs back.
Once such instance the upstairs connects to the failed downstairs.
It begins Live Repair, but gets a failure before it can complete the repair:
More flush check failed messages, then:
The downstairs reports: