Skip to content

How to set up and use cloud dev environment

Adam Kiezun edited this page Jun 7, 2016 · 2 revisions

Note: the steps have been originally described at a Broad-internal website https://broadinstitute.atlassian.net/wiki/display/DO/GATK+Dev+in+a+Docker+In+a+Cloud# They were verified to work in April 2016.

Preliminary steps

  1. Install VirtualBox on your machine (just in case we're launching locally, too). https://www.virtualbox.org/wiki/Downloads
  2. Have a Google Compute Engine account (for GATK developers it is broad-dsde-dev)
  3. Install gcloud (https://cloud.google.com/sdk/) and "init" with it.
  4. Install Docker and all of the Docker tools (https://www.docker.com/products/docker-toolbox)

Step-by-step guide

1. Create a Virtual Host (only the first time you launch a VM, not a Docker). Note: this takes ~5-10 minutes

    docker-machine create -d google --google-project <googleproject> --google-zone <google zone> --google-disk-size <in GB up to 2 TB> <your-machine-nickname-that-you'll-remember>
    example:
    docker-machine create -d google --google-project broad-dsde-dev --google-zone us-central1-c --google-disk-size 300 gatktest

Other important options:

  • IMPORTANT if you don't specify disk size, you will not be able to do almost anything (default disk is tiny)!
  • --google-machine-type=<google machine type like f1-micro> (See list WDL in Cromwell to determine what type is used for each task)
  • --google-disk-type=<pd-ssd or pd-standard>
  • --google-disk-size=<in GB up to 2 TB>

2. Make your build-machine in VirtualBox:

    docker-machine create -d virtualbox --virtualbox-disk-size "20000" gatkbuild

3. Make your machine point to this new Virtual Host (assuming gatkbuild is the name from above). Once you run this eval command, this shell that you're in is good to communicate with your VM.

    > bash
    bash-3.2$ eval "$(docker-machine env gatkbuild)"

4. Login to DockerHub using your DockerHub username.

    docker login -u <username>

5. Build. This is the "hard part". Take a look at this DockerFile below (https://github.com/broadinstitute/docker-gatk/blob/master/Dockerfile). Put this in a file called Dockerfile.

#Use a jdk, not a JRE because we want to compile things
FROM java:8-jdk
MAINTAINER DSDE <dsde@broadinstitute.org>
ENV TERM=xterm-256color
  
# Install python2, python3, and R, etc
RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y python && \
    apt-get install -y python3-pip && \
    apt-get install -y r-base wget curl unzip gcc python-dev python-setuptools emacs git less lynx hdfview
 
#Make sure we're using crcmod in gsutil
RUN easy_install -U pip
RUN pip install -U crcmod
 
# Install GIT LFS
RUN wget https://github.com/github/git-lfs/releases/download/v1.2.0/git-lfs-linux-386-1.2.0.tar.gz \
    && tar -zxvf git-lfs-linux-386-1.2.0.tar.gz \
    && cd git-lfs-1.2.0 && ./install.sh && cd ..
 
RUN wget https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.zip \
    && unzip google-cloud-sdk.zip \
    && rm google-cloud-sdk.zip
 
RUN google-cloud-sdk/install.sh --usage-reporting=true --path-update=true --bash-completion=true --rc-path=/.bashrc --disable-installation-options
VOLUME ["/root/.config"]
ENV PATH /google-cloud-sdk/bin:$PATH
 
RUN yes | gcloud components update
RUN yes | gcloud components update preview
 
WORKDIR /usr/gitc
 
# Install ggplot2
RUN echo 'install.packages(c("ggplot2"), repos="http://cran.us.r-project.org", dependencies=TRUE)' > /tmp/packages.R && Rscript /tmp/packages.R
 
# Copy picard, gatk, and verifybamid to the /usr/gitc directory. Assumes the binaries and jars are in the dir you're in right now.
COPY . .

a. Copy all required files (the ones mentioned in the LABEL lines above) into the same directory on your local machine as this Dockerfile. If you're building the GATK yourself put your version in the version line and the JAR in the same dir as this Dockerfile. You might want to write your own scripts to grab these files.

b. Now build! Be patient, the build can take a few minutes, especially the first time.

    docker build -t broadinstitute/gatk:<your_version_number> .

6. Now push your docker machine to the hub so you can use it from the Google Machine from step 1. The first time you push it'll take a few minutes. A deeper dive is here: https://docs.docker.com/mac/step_six/

    docker push broadinstitute/gatk:<your_version_number>

7. Make sure that before running it on a Google machine, you change your environment so that docker knows who to talk to (notice the name of the machine "gatktest" – the machine we created in step 1 not step 2). Rather than switching "evals" all the time, you can just just have an "eval" in multiple shell windows.

    bash-3.2$ eval "$(docker-machine env gatktest)"

8. Now run it. It'll launch you into a shell on the VM from step 1 running the Docker you just created. The "-v" option creates a persistent /scratch dir that survives when the Dockers die (but will die when the whole VM dies). This is where you should put your files and where to write to. This step requires that you made a "big disk" in step 1.

    docker run -it --privileged -v /scratch --rm broadinstitute/gatk:<your_version_number> bash

9. If you want to grab stuff from a bucket (like a BAM) and put it in the machine once you're logged in via "docker run". Use gsutil to access buckets. You'll have to ask around where the files are that are used for testing or make your own buckets for testing. You can just copy them to the machine. Note: this takes quite long, > 30 mins.

    gsutil -m cp gs://hellbender/test/resources/benchmark/*.* /scratch/

10. If you want to develop here (rather than just run), you can how clone gatk repo and build it. VERY important: everything not in /scratch will be gone when you exit the VM. Everything in /scratch will be gone when the VM is deleted.

    git clone https://github.com/broadinstitute/gatk.git
    cd gatk
    git lfs pull
    ./gradlew clean installDist