-
Notifications
You must be signed in to change notification settings - Fork 183
Open
Labels
Description
Steps to reproduce
- Create a task using the following configuration:
type: task
name: test-efa-runpod
image: dstackai/efa
commands:
- export PATH=/opt/conda/envs/workflow/bin:$PATH
- pip install torch
resources:
gpu: L4:1
- Submit the task.
Actual behaviour
Task with dstackai/efa
image on Runpod is stuck in provisioning state for over 18 minutes and ultimately fails with JOB_FAILED (FAILED_TO_START_DUE_TO_NO_CAPACITY)
Examining the Runpod logs.txt
we find
2025-06-03T07:54:18Z create container dstackai/efa
2025-06-03T07:54:20Z latest Pulling from dstackai/efa
2025-06-03T07:54:20Z 7a2c55901189 Already exists
...
2025-06-03T07:54:20Z 0306448bfdf8 Pulling fs layer
...
2025-06-03T07:54:20Z 0306448bfdf8 Waiting
...
2025-06-03T07:54:21Z c7de18f1cb15 Downloading [=> ] 4.86MB/231.1MB
...
2025-06-03T07:55:17Z 6b7ae4f9eaa6 Extracting [=================================================> ] 1.151GB/1.175GB
2025-06-03T07:55:17Z 6b7ae4f9eaa6 Extracting [=================================================> ] 1.169GB/1.175GB
2025-06-03T07:55:17Z 6b7ae4f9eaa6 Extracting [==================================================>] 1.175GB/1.175GB
2025-06-03T07:55:18Z failed to pull image: failed to register layer: Container ID 78971 cannot be mapped to a host ID
2025-06-03T07:55:19Z create container dstackai/efa
2025-06-03T07:55:22Z latest Pulling from dstackai/efa
2025-06-03T07:55:22Z 7a2c55901189 Already exists
...
<repeats again>
This shows repeated image pull attempts and failure. You can find the detail logs attached.
Expected behaviour
Provisioning (including image pull) should complete within 5–6 minutes, as it does when running the same configuration on GCP.
dstack version
master branch at commit 1e16fe1
Server logs
Additional information
Runpod POD ID: jmp45ktl3cfukn
