Description
openedon May 2, 2020
addresses #1053
addresses #1470
An issue with the current build environment is that we often assume everyone can write a perfect Dockerfile from scratch without any mistakes. In real-world there is a lot of trial and error for writing a complex Dockerfile. Users get errors, need to understand what is causing them, and react accordingly.
In the legacy builder, one of the methods for dealing with this situation was to use --rm=false
or look up the image ID of the last image layer from the build output and run docker run
session with it to understand what was wrong. Buildkit does not create intermediate images nor make the containers it runs visible in docker run
(both for very good reasons). Therefore this is even more complicated now and usually requires the user to set --target
to do a partial build and the debug the output of it.
To improve this, we shouldn't try to bring back --rm=false
that makes all the builds significantly slower and makes it impossible to manage storage for build cache. Instead, we could provide a better solution for this with a new --debugger
flag.
Using --debugger
on a build, should that build error, will take the user into a debugger shell similar to interactive docker run
experience. There the user can see the error and use control commands to debug the actual cause.
If the error happened on a RUN
command (execop
in LLB), the user can use shell to rerun the command and keep tweaking it. This will happen in an identical environment to the one where execop
runs, for example, this means access to secrets, ssh, cache mounts etc. They can also inspect the environment variables and files in the system that might be causing the issue. Using control commands, a user can switch between the broken state that was left behind by the failed command and the initial base state for that command. So in the case where they would try many possible fixes but end up in a bad state, they can just restore back to the initial state and start again.
If the error happened on a copy (or other file operation like rm), they can run ls
and similar tools to find out why the file path is not correct and not working.
For implementation, this depends on #749 for support to run processes on build mounts directly without going through the solver. We would first start by modifying the Executor
and ExecOp
to instead of releasing the mounts after error, return them together with the error. I believe typed errors #1454 support can be reused for this. They should be returned up to the client Solve
method, who can then decide to call llb.Exec
with these mounts. If mounts are left unhandled, they are released with the gateway api release.
Once the debugging has completed, and the user has made changes to the source files, it is easy to trigger a restart of the build with exactly the same settings. This is also useful if you think you might be hitting a temporary error. If the retry didn't fix it, user is brought back to the debugger.
It might make sense to introduce a concept of "debugger image" that is used as a basis of the debugging environment. This would allow avoiding hardcoded logic in an opinionated area.
Later this could be extended with the step-based debugger, and source mapping support could be used to make source code changes directly in the editor or tracking dependencies in the build graph.