This tool is designed to store files, that were produced during a CI build, to an S3 bucket, so that
they can be used by subsequent builds. Although this tool is tailored specifically for stack
, it
is by no means limited to Haskell or stack
users.
cache-s3
is not simply a wrapper around a bunch of other tools, it is all written in Haskell,
which comes with a great benefit of being cross platform. Executable versions for common operating
systems can be downloaded from github releases page.
CI providers already have some form of caching capability in place, so natural question comes to
mind is why do we even need to pay AWS for storage on S3, which we already get for free from Travis,
AppVeyor, CircleCI, etc. Here are the limitations with CI providers that addressed by cache-s3
:
stack
awareness. None of the providers have support for stack, which can be solved by complicated scripts that figure out which paths need caching and move the copious amounts files around so that they can be saved and restored properly.- Cache size limit. Some providers limit the amount of data that can be retained between builds, while S3 is limited only by the cash in your pockets.
- Cache sharing. Most providers do not let you use cache from builds created by another branch.
- Access to cache. For providers like Travis, that do allow reading cache created for master, it
can be read even by the forked repositories during the pull requests, which has a potential of
leaking sensitive data. With
cache-s3
you have full access control by the means of S3 bucket IAM policies. Despite this, I would advise not to store any private data in the cache, there are better tools for managing sensitive information out there.
- Usage ain't free, gotta pay Amazon for S3.
- Saving and restoring cache will likely be slightly slower than CI provider's native solution, since data has to move over the Internet.
There is an implicit assumption in this document that the user knows how to configure communication
with AWS from the command line (credentials, roles, accounts, regions, etc.), same as with aws-cli
for example. There are plenty of guides online how to get this setup.
In order for the tool to work, an S3 bucket must already be setup on AWS. I would recommend setting up a dedicated S3 bucket to be used exclusively for caching CI data, thus promoting data isolation. The bucket should also be configured to expire older files, this way cache stored for ephemeral branches will be discarded, hence avoiding unnecessary storage costs. Creating a separate user that has full access only to that bucket is also a must. Easiest way to get all of this done is with help of terraform:
Most of the boiler plate has been taking care of by the reusable terraform module
ci-cache-s3. Although
not strictly required, I would recommend setting up a keybase.io account,
but having a regular PGP key will do just fine. All that is necessary is creating a main.tf
file
with this content:
module "ci-cache" {
source = "github.com/fpco/fpco-terraform-aws//tf-modules/ci-cache-s3"
prefix = "my-cache-" # <-- make sure to set this to a custom value.
pgp_key = "keybase:user_name" # <-- or a base64 encoded PGP public key
}
output "bucket_name" {
value = "${module.ci-cache.bucket_name}"
}
output "access_key" {
value = "${module.ci-cache.access_key}"
}
output "secret_key" {
value = "${module.ci-cache.secret_key}"
}
Then simply running commands below will set up the S3 bucket and IAM user with permissions to access it for you:
$ terraform init
$ terraform plan
$ terraform apply
After you apply terrafom it will deploy all of the resources and print out the bucket name,
access_key
and an encrypted version of the secret_key
. In order to get clear text version of it
run:
terraform output secret_key | base64 --decode | keybase pgp decrypt
You can inspect ci-cache-s3/variables.tf file for a few extra variables that can be customized in the module.
It is recommended to also setup a remote state for terraform, so it can be shared with all of your co-workers, but that's a totally separate discussion.
Read more on terraform if you'd like to avoid manual work in getting everything setup: terraform.io)
For every released version of cache-s3
there will be an executable uploaded to
github for Windows, Linux and Mac OS.
Linux binary is build on Ubuntu, but might work on others. In order for it to work, though,
gmp might need to be installed, which can be quickly done on Ubuntu
apt-get install libgmp-dev
.
Here are some examples on how to get cache-s3
into your CI environment:
- Linux and Mac
CACHE_S3_VERSION="v0.1.5"
OS_NAME=linux # can be set by CI, eg `TRAVIS_OS_NAME`
curl -f -L https://github.com/fpco/cache-s3/releases/download/$CACHE_S3_VERSION/cache-s3-$CACHE_S3_VERSION-$OS_NAME-x86_64.tar.gz | tar xz -C ~/.local/bin 'cache-s3'
- On Windows in PowerShell
$env:CACHE_S3_VERSION="v0.1.5"
$env:OS_NAME="windows"
[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
Invoke-WebRequest https://github.com/fpco/cache-s3/releases/download/$env:CACHE_S3_VERSION/cache-s3-$env:CACHE_S3_VERSION-windows-x86_64.zip -OutFile cache-s3.zip
Expand-Archive cache-s3.zip -Destination .
Every invocation of cache-s3
requires S3 bucket name and AWS credentials to be present in the
environment. Run cache-s3 --help
to get more on that. Most common way of supplying arguments to
tools on CI is through environment variables. Here is the list of variables that are understood by
the tool:
S3_BUCKET
- where to store the cache (-b
,--bucket
cli argument)AWS_ACCESS_KEY_ID
- access keyAWS_SECRET_ACCESS_KEY
- secret keyAWS_REGION
- region where the bucket is (-r
,--region
cli argument)GIT_DIR
- used only for inferring current git branch (--git-dir
cli argument)GIT_BRANCH
- used for namespacing the cache (--git-branch
cli argument)
Stack specific ones:
STACK_ROOT
- global stack folder (--stack-root
cli argument)STACK_YAML
- path to project configuration file (--stack-yaml
cli argument)STACK_WORK
- use to rename.stack-work
directory (--stack-work
cli argument)
Further examples will assume all of the AWS related variables are set.
Important: If the same bucket is being used for many projects, make sure to set --prefix
argument in order to place each of them in their own namespace and avoid cache clashes.
At the end of the CI build supply all of the relative or absolute paths to directories and/or individual files as arguments. Directories will be traversed and cached recursively:
$ cache-s3 save -p ~/.npm -p ~/.cabal
At the beginning of the build all of the files can be restored from cache simply by running:
$ cache-s3 restore --base-branch=master
Specifying base branch will let files be restored from another branch, like master
in example
above, if current branch doesn't yet have cache of its own.
Files and directories are restored to the exact same places on the files systems they were before. Attributes, permissions and modification times are preserved. Symlinks are not followed, so they are cached and restored as symlinks. On Windows they are ignored completely!
For those that do not know, stack is a comprehensive tool used for developing Haskell programs. In order to avoid rebuilding any of the stack projects every time, we need to cache these two location:
- Global stack root directory, usually located in
~/.stack
. This is used for keeping all sorts of files that can be reused by all projects developed by a single user. - Folder with programs, such as GHC compiler, is usually nested inside the stack global directory, but can sometimes reside in a separate folder, for example on Windows.
Below is the command that can be used to cache the mentioned locations, but make sure you call it
from within your project directory, or at least supply --stack-yaml
or --resolver
for the
project. This way cache-s3
will invalidate the cache if you later decide to change Stackage
resolver for your project.
$ cache-s3 save stack
Saving stack artifacts for a particular project is done in a separate step, and this is so by
design. Global stack folders rarely change, namely whenever there is a change to project
dependencies or a different resolver is being used. Local .stack-work
folder(s) on the other hand
do change frequently with the project under active development. Here is how to cache your project:
$ cache-s3 save stack work
This will cache your .stack-work
directory, or all of them, if your project consists of many
packages.
Restoring stack cache is just as easy as regular one:
$ cache-s3 restore stack
and
$ cache-s3 restore stack work
In certain setups that update same cache for extensive period of times can run into a problem of long save/restore times due to constantly increasing size of cache. There are two non-mutually exclusive possible solutions to this issue:
- Clear out cache that is older than specified lifespan with
--max-age
arg:
$ cache-s3 -c -b ci-cache-bucket --prefix test restore --max-age="5m 30s"
[Info ]: <cache-s3/test/master.cache> - Refusing to restore, cache is too old: 2 hours, 35 minutes, 21 seconds
[Info ]: <cache-s3/test/master.cache> - Clear cache request was successfully submitted.
$ cache-s3 -c -b ci-cache-bucket --prefix test restore --max-age="5m 30s"
[Info ]: <cache-s3/test/master.cache> - No previously stored cache was found.
- Prevent large cache form being either saved or restored or both with
--max-size
. Clears out cache too upon a failed restore attempt:
$ cache-s3 -c -b ci-cache-bucket --prefix test save -p src --max-size 10kb
[Info ]: Caching: /home/lehins/fpco/cache-s3/src
[Info ]: <cache-s3/test/master.cache> - Refusing to save, cache is too big: 13.6 KiB
$ cache-s3 -c -b ci-cache-bucket --prefix test save -p src --max-size 1MiB
[Info ]: Caching: /home/lehins/fpco/cache-s3/src
[Info ]: <cache-s3/test/master.cache> - Data change detected, caching 13.6 KiB with sha256: X7Abafwff4DETyWrKP6x2RpWK6o0gh5xfpwOU4++m2A=
[Info ]: Progress: 10%, speed: 57.2 KiB/s
[Info ]: <cache-s3/test/master.cache> - Finished uploading. Files are cached on S3.
$ cache-s3 -c -b ci-cache-bucket --prefix test restore --max-size 13KiB
[Info ]: <cache-s3/test/master.cache> - Refusing to restore, cache is too big: 13.6 KiB
[Info ]: <cache-s3/test/master.cache> - Clear cache request was successfully submitted.
$ cache-s3 -c -b ci-cache-bucket --prefix test restore --max-size 13KiB
[Info ]: <cache-s3/test/master.cache> - No previously stored cache was found.
If there is some other reason to remove cache for a particular build, all that is necessary is to
run cache-s3 clear
, cache-s3 clear stack
or cache-s3 clear stack work
with the same arguments
that cache restore
would be called with. Alternatively a file can be manually removed from an S3
bucket.
- Data will not be uploaded to S3 if it has not changed. By change here I don't mean only the
content of files, but also attributes of files and folders, such as modification time, changes in
permissions or ownership. So even
touch
of one of the files being cached will trigger an upload to S3. - Consistency of cache is verified when it's being restored.
- Compression algorithm is customizable. For now on Windows only gzip is available, gzip and lz4 on others.
- Default hashing algorithm SHA256 can also be overridden.
- Despite that files on S3 have extension
.cache
, they are simple.tar.gz
and can be manually inspected for content, if some CI build failure debugging is necessary.