Skip to content

[Bug]: dstackai/efa image fails to provision with Runpod #2729

@Bihan

Description

@Bihan

Steps to reproduce

  1. Create a task using the following configuration:
type: task
name: test-efa-runpod

image: dstackai/efa

commands:
  - export PATH=/opt/conda/envs/workflow/bin:$PATH
  - pip install torch

resources:
  gpu: L4:1
  1. Submit the task.

Actual behaviour

Task with dstackai/efa image on Runpod is stuck in provisioning state for over 18 minutes and ultimately fails with JOB_FAILED (FAILED_TO_START_DUE_TO_NO_CAPACITY)

Examining the Runpod logs.txt we find

2025-06-03T07:54:18Z create container dstackai/efa  
2025-06-03T07:54:20Z latest Pulling from dstackai/efa  
2025-06-03T07:54:20Z 7a2c55901189 Already exists  
...  
2025-06-03T07:54:20Z 0306448bfdf8 Pulling fs layer  
...  
2025-06-03T07:54:20Z 0306448bfdf8 Waiting  
...  
2025-06-03T07:54:21Z c7de18f1cb15 Downloading [=>           ]   4.86MB/231.1MB  
...  
2025-06-03T07:55:17Z 6b7ae4f9eaa6 Extracting [=================================================> ]  1.151GB/1.175GB  
2025-06-03T07:55:17Z 6b7ae4f9eaa6 Extracting [=================================================> ]  1.169GB/1.175GB  
2025-06-03T07:55:17Z 6b7ae4f9eaa6 Extracting [==================================================>]  1.175GB/1.175GB  
2025-06-03T07:55:18Z failed to pull image: failed to register layer: Container ID 78971 cannot be mapped to a host ID  
2025-06-03T07:55:19Z create container dstackai/efa  
2025-06-03T07:55:22Z latest Pulling from dstackai/efa  
2025-06-03T07:55:22Z 7a2c55901189 Already exists  
...  
<repeats again>

This shows repeated image pull attempts and failure. You can find the detail logs attached.

logs (1).txt

Expected behaviour

Provisioning (including image pull) should complete within 5–6 minutes, as it does when running the same configuration on GCP.

dstack version

master branch at commit 1e16fe1

Server logs

Additional information

Runpod POD ID: jmp45ktl3cfukn

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions