We have a fairly standard Jenkins setup, a primary node and a whole bunch of secondaries that most of the job steps get farmed out to.
These secondary nodes sit doing nothing overnight so we thought it would be a good idea to scale them down in the evening and then back up in the morning so we can use the money we’ve saved to run more secondaries during the day when they’re in demand.
We run various jobs on our Jenkins infrastructure but the most common job is for building and testing our monolith on every push to a PR and merge into master.
This basically involves doing a git checkout of the branch (into existing workspaces on the primary node so we don’t have to do a full initial checkout), building a Docker image that contains all of the dependencies and the version of the codebase at the time the job was ran and then running the tests inside the resulting container (on any of our secondary nodes so we can run lots of jobs in parallel, files are transferred between nodes using Jenkins stashes).
Building this Docker image from scratch takes around 10 minutes and we rely on the Docker image cache on these nodes to reduce our build times by placing the step which normally invalidates the cache (the copy’ing of the updated codebase) at the end of the Dockerfile.
The Dockerfile looks something like:
FROM ubuntu:14.04
# Things like apt-get installs, these take a decent amount of time<some bash commands>
# These live in our main codebase along side the code and basically never change<copy in some config files>
# There's a fair few of these, they take a while but rarely change<copy some requirements.txt files><some pip installs>
# This is where we expect the cache to be invalidated and copy'ing in the files doesn't take long<copy in the version of the codebase that's being built>
So if we want to autoscale our secondary nodes we need to make sure that they have the image cache before jobs start running.
Attempt #1 was to simply pull down a copy of the image from our registry in the Cloudformation userdata when the node comes up.
This failed because Docker wont use images pulled from a registry as part of the build cache.
This is considered a feature — https://github.com/moby/moby/issues/20316
Attempt #2 was to replace pulling the Docker image, with cloning of our repository into /tmp
and then building the Docker image from that version of the codebase when the node first comes up. This would give us a copy of the Docker image from a recent version of the codebase along with the build cache we needed to speed up subsequent builds.
This failed, during the first run of the Jenkins job the cache was invalidated at the first COPY
step when we copy over the config files even though the contents of them hadn’t changed.
This was slightly confusing because I thought that Docker only invalidated the cache at a COPY
line if the file had changed and it hadn’t. Some digging around revealed that Docker also uses the mtime
value on a file when considering if it has changed or not.
https://github.com/moby/moby/issues/4351#issuecomment-76222745
So what was happening was that the git checkout of the codebase in the Jenkins job created files with different mtimes
to that of the files from our checkout of the codebase into /tmp
at node creation time and so Docker considered the config files we copy in early on to be changed and invalidated the cache earlier than we wanted.
We didn’t see this issue before as we do the git checkout part of the job on our primary node into workspaces which are never deleted and so after an initial checkout these files are never touched and Docker never considers the early COPY
steps to have been invalidated.
We can replicate this behaviour with a simple example:
We have a project with two files a Dockerfile and a text file
vagrant@vagrant-ubuntu-trusty-64:~$ lsDockerfile test
The Dockerfile just copies in our file
vagrant@vagrant-ubuntu-trusty-64:~$ cat Dockerfilefrom ubuntu:16.04COPY test /srv/test
And lets take a note of the mtime
of our test file for later on
vagrant@vagrant-ubuntu-trusty-64:~$ stat testFile: ‘test’Size: 0 Blocks: 0 IO Block: 4096 regular empty fileDevice: 801h/2049d Inode: 140134 Links: 1Access: (0664/-rw-rw-r--) Uid: ( 1000/ vagrant) Gid: ( 1000/ vagrant)Access: 2017-12-31 18:03:33.995881878 +0000Modify: 2017-12-31 18:03:33.995881878 +0000Change: 2017-12-31 18:03:33.995881878 +0000Birth: -
Finally if we build it
vagrant@vagrant-ubuntu-trusty-64:~$ sudo docker build .Sending build context to Docker daemon 13.82 kBSending build context to Docker daemonStep 0 : FROM ubuntu:16.0416.04: Pulling from ubuntuStep 1 : COPY test /srv/test---> 3be8d8c094bf
We don’t have that step cached and so Docker just runs it.
If we modify test
vagrant@vagrant-ubuntu-trusty-64:~$ echo 'hello' > testvagrant@vagrant-ubuntu-trusty-64:~$ sudo docker build .Sending build context to Docker daemon 14.34 kBSending build context to Docker daemonStep 0 : FROM ubuntu:16.04---> c1ea3b5d13ddStep 1 : COPY test /srv/test---> 75bb69e528c3Removing intermediate container 6771d1e1f191Successfully built 75bb69e528c3
Docker picks up the change and invalidates the cache layer, nothing surprising here.
Now what if test
came from a git repo ?
I bundled these two files into a git repo, pushed them up then pulled the repo down to a seperate folder
vagrant@vagrant-ubuntu-trusty-64:~$ mkdir gitvagrant@vagrant-ubuntu-trusty-64:~$ cd git/vagrant@vagrant-ubuntu-trusty-64:~/git$ git pull [email protected]:AaronKalair/test.git
And does Docker use the cache we created with the earlier docker build
for the file?
vagrant@vagrant-ubuntu-trusty-64:~/git/test$ sudo docker build .Sending build context to Docker daemon 47.62 kBSending build context to Docker daemonStep 0 : FROM ubuntu:16.04---> c1ea3b5d13ddStep 1 : COPY test /srv/test---> f4fe0a287838Removing intermediate container 20c77a3dee81Successfully built f4fe0a287838
Nope!
Are the files identical?
The one from the git clone
vagrant@vagrant-ubuntu-trusty-64:~/git/test$ md5sum testa10edbbb8f28f8e98ee6b649ea2556f4 test
The original
vagrant@vagrant-ubuntu-trusty-64:~$ md5sum testa10edbbb8f28f8e98ee6b649ea2556f4 test
Yep, there identical.
What about the mtime
?
vagrant@vagrant-ubuntu-trusty-64:~/git/test$ stat testFile: ‘test’Size: 7 Blocks: 8 IO Block: 4096 regular fileDevice: 801h/2049d Inode: 262341 Links: 1Access: (0775/-rwxrwxr-x) Uid: ( 1000/ vagrant) Gid: ( 1000/ vagrant)Access: 2017-12-31 19:37:05.348272796 +0000Modify: 2017-12-31 19:37:05.348272796 +0000Change: 2017-12-31 19:37:05.348272796 +0000Birth: -
Nope its changed (it was 18:03
originally), the git pull sets the modified, access and change time to the time of the clone.
What about existing files does a git pull affect those?
If we add another file to the project push it up …
vagrant@vagrant-ubuntu-trusty-64:~$ touch test_testvagrant@vagrant-ubuntu-trusty-64:~$ git add test_testvagrant@vagrant-ubuntu-trusty-64:~$ git commitvagrant@vagrant-ubuntu-trusty-64:~$ git push origin master
and then pull it back down into the other git checkout …
vagrant@vagrant-ubuntu-trusty-64:~$ cd git/test/vagrant@vagrant-ubuntu-trusty-64:~/git/test$ git pull
vagrant@vagrant-ubuntu-trusty-64:~/git/test$ stat testFile: ‘test’Size: 7 Blocks: 8 IO Block: 4096 regular fileDevice: 801h/2049d Inode: 262341 Links: 1Access: (0775/-rwxrwxr-x) Uid: ( 1000/ vagrant) Gid: ( 1000/ vagrant)Access: 2017-12-31 19:40:05.032278718 +0000Modify: 2017-12-31 19:37:05.348272796 +0000Change: 2017-12-31 19:37:05.348272796 +0000Birth: -
vagrant@vagrant-ubuntu-trusty-64:~/git/test$ stat test_testFile: ‘test_test’Size: 0 Blocks: 0 IO Block: 4096 regular empty fileDevice: 801h/2049d Inode: 262359 Links: 1Access: (0664/-rw-rw-r--) Uid: ( 1000/ vagrant) Gid: ( 1000/ vagrant)Access: 2017-12-31 20:36:22.916378925 +0000Modify: 2017-12-31 20:36:22.916378925 +0000Change: 2017-12-31 20:36:22.916378925 +0000Birth: -
The existing test file is untouched and the new file appears with all the timestamps set to the time of the git pull
Ok so that explains the behaviour we’re seeing and why we’ve never had this problem before but we still can’t autoscale our secondary Jenkins nodes without fixing the caching issue.
So attempt #3 that turned out to be successful was to:
Create a base image that we use in FROM
at the start of the main Dockerfile which already has ran the expensive steps and have that pulled down to the nodes when they spin up.
This would then effectively replicate having the expensive steps in the build cache.
So we have a nightly build from a Dockerfile that looks like:
# NIGHTLY IMAGE
FROM ubuntu:14.04
<some bash commands><copy in some config files><copy some requirements.txt files><some pip installs>
With our Jenkins job now using a Dockerfile that looks like:
FROM NIGHTLY_IMAGE
<copy in some config files><copy some requirements.txt files><some pip installs><copy in the version of the codebase that's being built>
Now all the expensive commands that we wanted cached (initial bash commands and pip installs) are in the nightly image which we pull to an instance when its created and Jenkins uses it in the FROM
line.
The repeated pip installs
in the Dockerfile Jenkins uses for the job ensures that if the requirements change before the next nightly build those changes are reflected in the built image.
If the nightly image build uses versions of the files with different mtimes
then it doesn’t matter as a pip install
with the requirements already there only takes a few seconds.
The nightly image is built as a Jenkins job every night at 2am and cron jobs on the sidekicks as well as autoscaling operations ensure they have the latest nightly image every day.
And there we have it, we now spend less money and have more Jenkins capacity during the day when we need it!
Follow me on Twitter @AaronKalair