Whenever you’re building Docker images, say, you want to bake your Java/Node/Python application into one, you’ll be confronted with the following two questions:
-
How can I make the
docker build
command run as fast as possible? -
How can I make sure that the resulting Docker image is as small as possible?
You will want to continue reading for answers to these questions.
Take a look at the following Dockerfile
.
FROM eclipse-temurin:17-jdk
ARG JAR_FILE=build/libs/*.jar
COPY ${JAR_FILE} app.jar
ENTRYPOINT ["java","-jar","/app.jar"]
By running docker build -t myapp .
on this Dockerfile, you will get (one) Docker image, which will be based on a Java 17 (Eclipse-Temurin) image, as well as contain and run our Java application (the app.jar file).
What might not immediately be obvious, is that every single line from your Docker line, will result in the creation of one Docker image layer - every image consists of several such layers.
You can confirm this by running e.g.:
docker image history myapp
Which will return the image layers on new lines:
IMAGE CREATED CREATED BY SIZE COMMENT
3ca5a60826f0 8 minutes ago ENTRYPOINT ["java" "-jar" "/app.jar"] 0B buildkit.dockerfile.v0
<missing> 8 minutes ago COPY build/libs/*.jar app.jar # buildkit 19.7MB buildkit.dockerfile.v0
<missing> 8 minutes ago ARG JAR_FILE=build/libs/*.jar 0B buildkit.dockerfile.v0
... (other layers from the base image left out)
There is a layer for our ENTRYPOINT
line, one for COPY
and one for ARG
.
The layer containing our app.jar
file (COPY
) is roughly 20MB large, with 0B metadata layers for the ENTRYPOINT
and ARG
lines.
Now, what do we do with this information?
Imagine you want to install a package through your package manager, and for that, you want to run apt update
, which updates the package manager’s index.
FROM eclipse-temurin:17-jdk
RUN apt update -y
ARG JAR_FILE=build/libs/*.jar
COPY ${JAR_FILE} app.jar
ENTRYPOINT ["java","-jar","/app.jar"]
Let’s have a look at the resulting layers (docker image history myapp
) and focus on the very last line (RUN /bin/sh -c…
):
IMAGE CREATED CREATED BY SIZE COMMENT
c14a18a04751 8 seconds ago ENTRYPOINT ["java" "-jar" "/app.jar"] 0B buildkit.dockerfile.v0
<missing> 8 seconds ago COPY build/libs/*.jar app.jar # buildkit 19.7MB buildkit.dockerfile.v0
<missing> 8 seconds ago ARG JAR_FILE=build/libs/*.jar 0B buildkit.dockerfile.v0
<missing> 8 seconds ago RUN /bin/sh -c apt update -y # buildkit 45.7MB buildkit.dockerfile.v0
Wooha! Running apt-update
has added a new layer with a whooping 45.7MB to our resulting Docker image. Now every time you push or pull your image, you’ll need to transfer those additional megabytes.
Let’s continue with the example above and add a couple more run commands, to install the latest mysql package.
FROM eclipse-temurin:17-jdk
RUN apt update -y
RUN apt install mysql -y
RUN rm -rf /var/lib/apt/lists/*
ARG JAR_FILE=build/libs/*.jar
COPY ${JAR_FILE} app.jar
ENTRYPOINT ["java","-jar","/app.jar"]
In addition, we’re removing the apt index cache (the 45.7MB from above) with the rm -rf /var/lib/apt/lists/*
command. Let’s see what our image history now looks like:
59f82a5b4c5a 6 seconds ago ENTRYPOINT ["java" "-jar" "/app.jar"] 0B buildkit.dockerfile.v0
<missing> 6 seconds ago COPY build/libs/*.jar app.jar # buildkit 19.7MB buildkit.dockerfile.v0
<missing> 6 seconds ago ARG JAR_FILE=build/libs/*.jar 0B buildkit.dockerfile.v0
<missing> 6 seconds ago RUN /bin/sh -c rm -rf /var/lib/apt/lists/* #… 0B buildkit.dockerfile.v0
<missing> 7 seconds ago RUN /bin/sh -c apt install -y mysql-server #… 605MB buildkit.dockerfile.v0
<missing> 8 minutes ago RUN /bin/sh -c apt update -y # buildkit 45.7MB buildkit.dockerfile.v0
Waah, what’s that? Even though we deleted the apt cache files, the 45.7MB layer is still there (in addition to the 605MB MySQL layer, btw).
That’s because layers are strictly additive / immutable. You can surely delete those files from your current layer, but the older/previous layers will still contain them.
How can you get around this? A simple workaround would be to run all three RUN
commands on a single line (== a single resulting layer)
FROM eclipse-temurin:17-jdk
RUN apt update -y && \
apt install -y mysql-server && \
rm -rf /var/lib/apt/lists/*
ARG JAR_FILE=build/libs/*.jar
COPY ${JAR_FILE} app.jar
ENTRYPOINT ["java","-jar","/app.jar"]
Let’s look at the image’s history now:
IMAGE CREATED CREATED BY SIZE COMMENT
4b8c0f7f895a 14 seconds ago ENTRYPOINT ["java" "-jar" "/app.jar"] 0B buildkit.dockerfile.v0
<missing> 14 seconds ago COPY build/libs/*.jar app.jar # buildkit 19.7MB buildkit.dockerfile.v0
<missing> 14 seconds ago ARG JAR_FILE=build/libs/*.jar 0B buildkit.dockerfile.v0
<missing> 14 seconds ago RUN /bin/sh -c apt update -y && apt ins… 605MB buildkit.dockerfile.v0
Ha! We at least saved the 45.7MB for now. What else is wrong with this, though?
You ideally want your builds to be reproducible (who would have thought). By running apt update
and then installing whatever latest package there is in the repo, you effectively break that reproducibility, because package versions might change between builds.
The gist:
-
Install only specific versions of whatever you are trying to install
-
Avoid (package-manager-of-your-choice)'ing in your Dockerfiles for your application in the first place - instead, build a new base image and use that in your Dockerfile’s
FROM
. This will also be a lot faster!
You’ll want to make sure to put layers that change a lot towards the bottom of your Dockerfile
, whereas more stable layers should be ordered on top.
Why? Because when building images, you’ll need to rebuild every layer starting from the layer(s) that changed between builds.
A practical example: Imagine that you want to package an index.html
file into your image, which changes a lot, i.e. more often than anything else.
FROM eclipse-temurin:17-jdk
COPY index.html index.html
RUN apt update -y && \
apt install -y mysql-server && \
rm -rf /var/lib/apt/lists/*
ARG JAR_FILE=build/libs/*.jar
COPY ${JAR_FILE} app.jar
ENTRYPOINT ["java","-jar","/app.jar"]
You can see the COPY index.html index.html
line added almost at the top of the Dockerfile
. Now, every time the index.html file changes, you’ll need to rebuild all subsequent layers, i.e. the _RUN apt-update, ARG & COPY app.jar
layers - a huge time sink. On my machine, all of the above takes roughly 17 seconds to finish.
If, however, you re-order the statement towards the bottom, Docker can re-use all previous layers, as they haven’t changed.
FROM eclipse-temurin:17-jdk
RUN apt update -y && \
apt install -y mysql-server && \
rm -rf /var/lib/apt/lists/*
ARG JAR_FILE=build/libs/*.jar
COPY ${JAR_FILE} app.jar
COPY index.html index.html
ENTRYPOINT ["java","-jar","/app.jar"]
Now a new docker build
only takes, 0.5 seconds (on my machine), much much better!
Here are the golden layering rules:
-
Files that rarely change or are time/network-intensive (e.g. installing new software) → Top
-
Files that change often (e.g. source code) → Very Low
-
ENV, CMD, etc → Bottom
Docker doesn’t always rebuild all image layers, whenever you run docker build
. There is a specific set of rules,on when and how Docker will cache your layers and you can read about them in the official documentation.
The gist is, whenever you run Docker build, Docker will:
-
Either check the commands in the Dockerfile for changes (e.g. did you change
RUN blah
toRUN doh
). -
Did any of the involved files (or rather their checksums), in the case of
ADD
orCOPY
, change?
When you run docker build -t <tag> .
, the .
, your current directory, will actually be your so-called build context
. Meaning all the files inside your current directory will be tar’ed up and sent to your local or remote Docker daemon to perform the build.
If you want to make sure that some directories never make it to your build daemon, thus keeping things snappy and small, you can create a .dockerignore
file, which has a similar syntax to .gitignore
.
In general, you should put any files/directories that are not relevant to your build here (e.g. your .git folder
), which is especially important when using commands like COPY . /somewhere
, because then your entire project will end up in the resulting image.
An npm example: You might want to run e.g. npm install
during build time and let it download its dependencies, instead of (slowly) copying your node_modules
folder in, so that would also make a good candidate for the dockerignore file. However, if you do that, here’s another trick you’d want to know about: directory caching.
Say you run npm install
, pip install
gradlew build
etc. to build your image. This will lead to dependencies being downloaded and a new image layer being created. Now, if that image layer has to be rebuilt, all dependencies will be re-downloaded on the next build, because there won’t be a .npm
, .cache
or .gradle
folder available with the already downloaded dependencies.
But you can change that! Let’s take pip
as an example and change the following line:
FROM ...
RUN pip install -r requirements.txt
CMD ...
to:
RUN --mount=type=cache,target=/root/.cache pip install -r requirements.txt
This will tell Docker to mount a caching layer/folder (/root/.cache
) into the container during build time - in this case, the folder that pip caches its dependencies in, for the root user. The trick is: this folder will not end up in the resulting image, but/and will be available to pip in all subsequent builds - and you’ll get a nice speed up!
The same goes for NPM, Gradle, or any other package manager out there. Just make sure to specify the correct target folder.
This article should have given you a good grasp of Docker image fundamentals. If you have any questions or other comments, please post them in the comment section below.
If you’d like to see this article as a video instead, have a look here:
mb_youtube::JcGwgNMZc_E[]