Skip to content

Conversation

@q5sys
Copy link

@q5sys q5sys commented Jan 11, 2026

This lets a user fix jobs issues that AI-toolkit cannot stop/kill

If the system crashes or there's a power outage, ai-toolkit seems to be unable to recover as it thinks there are jobs still active and processing and it gets stuck unable to stop them so training can continue. This is a simple external python script to manage jobs in the databse since the application itself cannot do so.

Usage Example:

[q5@apollo ai-toolkit]$ python ./jobs.py --help
Usage:
python jobs.py → list all jobs
python jobs.py --hung → list only stuck/running jobs
python jobs.py --stop → mark a job as stopping
python jobs.py --kill → forcibly complete a job
python jobs.py --delete → remove a job from the database

…bs issues that AI-toolkit cannot stop/kill

If the system crashes or there's a power outage, ai-toolkit seems to be unable to recover as it thinks there are jobs still active and processing and it gets stuck unable to stop them so training can continue.
This is a simple external python script to manage jobs in the databse since the application itself cannot do so.

Usage Example:

[q5@apollo ai-toolkit]$ python ./jobs.py --help
Usage:
  python jobs.py                 → list all jobs
  python jobs.py --hung          → list only stuck/running jobs
  python jobs.py --stop <uuid>   → mark a job as stopping
  python jobs.py --kill <uuid>   → forcibly complete a job
  python jobs.py --delete <uuid> → remove a job from the database
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant