Description
The Problem
Providers mostly don't report any errors they encounter back up to the core system. This means that if something fails, it's possible for it to just cause an unlimited "hang". In my case the hang causes the trigger to have been marked as started
, when it hasn't actually been started. In the docker provider for example, errors are caught and simply ignored (example).
I'm not familiar with the codebase so please let me know if there any mistakes. I spent some time following around the providers, and found a few examples like the following where, a effectively a request to deploy the trigger is just sent but never followed up upon.
I'm guessing what needs to happen is that the provider needs some way to return an error code, which core can then "bubble up" by changing the deployment status. I didn't want to attempt to make changes without creating an issue first, as I'm missing a lot of context. I'm also not sure how prs such as #1470 interact with this issue for example.
Example Reproduction
I originally reported this in #1476, but moved it here as I realised my issue was a symptom of a wider problem that I described above. I encountered this while I was setting up authentication for my self hosted docker registry. Trigger would try to deploy a task, and the docker provider would fail to run it, because it couldn't download the image. This would cause trigger to hang exponentially, as it was unaware that the docker provider failed to run the task.
During my testing last night I added a scheduled task that runs every 20 minutes. I then forgot about it, and was messing around with some other things in trigger. After some sleep, I came back to it and noticed that there was a long list of "running" scheduled tasks. Upon further investigation, before going to sleep I had made an incomplete deployment which resulted in a missing docker image from the registry. This lead to the same place where, trigger tries to deploy the image, it fails to, but trigger has no idea. This lead to the long list of running tasks, the longest of which was hanging for ~14 hours.