Skip to content

[Bug]: Lambda instance becomes ureachable after dstack server restart #2669

@jvstme

Description

@jvstme

Steps to reproduce

  1. Create an instance using the lambda backend, wait until it becomes idle or busy.
  2. Restart dstack server

Actual behaviour

The instance becomes unreachable and never recovers. If it was running a job, the job is terminated. dstack-shim no longer runs on the instance.

The first shim health check attempt fails with this error:

           DEBUG    dstack._internal.server.background.tasks.process_instances:747 Check instance cloud-0 status. shim health: Can't request shim:          
                    ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

All other health checks fail with this error:

           DEBUG    dstack._internal.server.background.tasks.process_instances:747 Check instance cloud-0 status. shim health: Can't request shim:          
                    ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

Expected behaviour

The instance remains idle or busy.

dstack version

master

Server logs

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmajor

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions