Skip to content

Add healthcheck Agent Lifecycle Hook that runs between jobs #1111

Open
@dbaggerman

Description

@dbaggerman

buildkite-agent has a disconnect-after-idle-timeout setting which causes it to shut down after an idle period.

It would be nice if there was an option to disconnect / shut down the agent if a healthcheck fails. We run an elastic autoscaled pool of agents, and shutting down in the case of a problem would allow the autoscaling group to replace unhealthy agents.

Currently when an agent has its disk fill up for example then all jobs allocated to it will fail, but it will still keep accepting jobs regardless. These often require manual intervention to kill the instance and let the pool recover. Having the agent stop accepting jobs and shut down when available disk space is low would allow them to be replaced with clean agents without manual intervention.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent healthRelating to whether the agent is or should pull additional workhook

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions