Skip to content

A downstairs cant complete a Live Repair. #1921

@leftwo

Description

@leftwo

On Madrid, I have 191 instances,
190 of them are running IO,
2 CPUs, 8G memory.
They all have a single 4GiB (512b block size) disk and are doing IO to ~2GiB of that disk.

I reboot a sled, and instances move around and restart, but there are a few instances that
just can't get through Live Repair and have all three downstairs back.

Once such instance the upstairs connects to the failed downstairs.
It begins Live Repair, but gets a failure before it can complete the repair:

2026-04-14 03:13:18.494Z INFO propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: Repair for extent 5 s:2 d:[ClientId(1)]
     = downstairs
    session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:18.494Z INFO propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3:  1813552 final dependency list [[JobId(1813551)]]
     = downstairs
    client = 1
    session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:18.999Z WARN propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: flush check fired despite having jobs; resetting it
    session_id = 750e20f9-bfab-4307-bc9b-2288b2676933

More flush check failed messages, then:

8798c-5f53-4816-b00f-f61c4507c6e3: flush check fired despite having jobs; resetting it
    session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:32.839Z WARN propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: flush check fired despite having jobs; resetting it
    session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:33.126Z WARN propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: timeout 1/3
     = downstairs
    client = 1
    session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:33.521Z WARN propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: downstairs disconnected
     = downstairs
    client = 1
    session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:33.521Z ERRO propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: job id 1813552 saw error RepairStreamError("repair File { fd: 76, read: true, write: true }: stream error: reqwest::Error { kind: Decode, source: reqwest::Error { kind: Body, source: TimedOut } }")
     = downstairs
    client = 1
    session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:33.521Z WARN propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: client task is sending Done(Finished)
     = downstairs
    client = 1
    session_id = 750e20f9-bfab-4307-bc9b-2288b2676933
2026-04-14 03:13:33.521Z ERRO propolis-server/12822 (vm_state_driver) on oxz_propolis-server_5518798c-5f53-4816-b00f-f61c4507c6e3: DS Reports error RepairStreamError("repair File { fd: 76, read: true, write: true }: stream error: reqwest::Error { kind: Decode, source: reqwest::Error { kind: Body, source: TimedOut } }") on job 1813552, DownstairsIO { work: ExtentLiveRepair { dependencies: [JobId(1813551)], extent: ExtentId(5), source_downstairs: ClientId(2), source_repair_address: [fdd5:cfdb:cca5:102::6]:23004, repair_downstairs: [ClientId(1)] }, state: ClientData([Done, InProgress, Done]), res: None, acked: false, replay: false, data: None, io_limits: ClientMap(ClientData([None, None, None])) } EC
     = downstairs
    client = 1
    session_id = 750e20f9-bfab-4307-bc9b-2288b2676933

The downstairs reports:

03:13:13.123Z INFO crucible: accepted connection from [fdd5:cfdb:cca5:103::1:42a]:64975                                                
    task = main                                                                                                                        
03:13:13.123Z INFO crucible: connection ([fdd5:cfdb:cca5:103::1:42a]:64975): tasks spawned                                             
    id = 277                                                                                                                           
03:13:13.123Z INFO crucible: Connection request from c6c4571c-d3b8-4dcd-9f28-fe8e067fb010 with version 13                              
03:13:13.123Z INFO crucible: upstairs UpstairsConnection { upstairs_id: c6c4571c-d3b8-4dcd-9f28-fe8e067fb010, session_id: 750e20f9-bfab
-4307-bc9b-2288b2676933, generation: 17 } connected, version 13                                                                        
03:13:13.126Z INFO crucible: UpstairsConnection { upstairs_id: c6c4571c-d3b8-4dcd-9f28-fe8e067fb010, session_id: 750e20f9-bfab-4307-bc9
b-2288b2676933, generation: 17 } is now active (read-write)
03:13:13.184Z INFO crucible: Current flush_numbers [0..12]: [35411, 38793, 38890, 38793, 27421, 38824, 38400, 38548, 37731, 37074, 3278
8, 32786]
03:13:13.190Z WARN crucible: 1812880 job ELiveReopen waiting on 1 deps
    role = work
    upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:13.241Z WARN crucible: 1812908 job ELiveReopen waiting on 1 deps
    role = work
    upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:13.264Z WARN crucible: 1812914 job Flush waiting on 1 deps
    role = work
    upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:14.614Z WARN crucible: 1813077 job ELiveReopen waiting on 1 deps
    role = work
    upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:16.785Z WARN crucible: 1813342 job ELiveReopen waiting on 1 deps
    role = work
    upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:18.014Z WARN crucible: 1813530 job ELiveReopen waiting on 1 deps
    role = work
    upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:18.058Z WARN crucible: 1813554 job ELiveReopen waiting on 1 deps
    role = work
    upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:18.480Z WARN crucible: 1813562 job Flush waiting on 1 deps
    role = work
    upstairs_id = c6c4571c-d3b8-4dcd-9f28-fe8e067fb010
03:13:18.495Z INFO crucible: Created copy dir "/data/regions/659a224c-a175-4859-8275-baafbe548590/00/000/005.copy"
03:13:18.518Z INFO crucible: eid:5 Found repair files: ["005"]
03:13:33.520Z WARN crucible: handle_frame returns error proc_frame
     
    Caused by:
        Repair has failed, exiting task; disconnecting
03:13:33.522Z INFO crucible: upstairs UpstairsConnection { upstairs_id: c6c4571c-d3b8-4dcd-9f28-fe8e067fb010, session_id: 750e20f9-bfab
-4307-bc9b-2288b2676933, generation: 17 } (ConnectionId(277)) removed, 2 jobs left
03:13:33.522Z INFO crucible: upstairs UpstairsConnection { upstairs_id: c6c4571c-d3b8-4dcd-9f28-fe8e067fb010, session_id: 750e20f9-bfab
-4307-bc9b-2288b2676933, generation: 17 } (ConnectionId(277)) was previously active, clearing
03:13:33.522Z WARN crucible: got message for disconnected id ConnectionId(277); ignoring
03:13:33.522Z WARN crucible: got message for disconnected id ConnectionId(277); ignoring
03:13:33.522Z WARN crucible: got message for disconnected id ConnectionId(277); ignoring
03:13:33.522Z INFO crucible: got cancelled token; recv_task stopping
    id = 277
    task = recv

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions