Skip to content

Latest commit

 

History

History
239 lines (169 loc) · 11 KB

LOCAL_INSTALL.md

File metadata and controls

239 lines (169 loc) · 11 KB

Prerequisites

  • Hardware: At least 16 Gb RAM, 4 CPUs, 50 Gb disk space

  • Software:

  • OS settings for elasticsearch:

    echo '
       vm.max_map_count=262144
    ' | sudo tee -a /etc/sysctl.conf
    
    sudo sysctl -w vm.max_map_count=262144

    This will prevent elasticsearch start up error: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

Starting seqr

The steps below describe how to create a new empty seqr instance with a single Admin user account.

SEQR_DIR=$(pwd)

wget https://raw.githubusercontent.com/ccmbioinfo/seqr-cfi/master/docker-compose.yml

docker compose up -d seqr   # start up the seqr docker image in the background after also starting other components it depends on (postgres, redis, elasticsearch). This may take 10+ minutes.
docker compose logs -f seqr  # (optional) continuously print seqr logs to see when it is done starting up or if there are any errors. Type Ctrl-C to exit from the logs.

docker compose exec seqr python manage.py createsuperuser  # create a seqr Admin user

open http://localhost     # open the seqr landing page in your browser. Log in to seqr using the email and password from the previous step

Updating seqr

Updating your local installation of seqr involves pulling the latest version of the seqr docker container, and then recreating the container.

# run this from the directory containing your docker-compose.yml file
docker compose pull
docker compose up -d seqr

docker compose logs -f seqr  # (optional) continuously print seqr logs to see when it is done starting up or if there are any errors. Type Ctrl-C to exit from the logs.

To update reference data in seqr, such as OMIM, HPO, etc., run the following

docker compose exec seqr ./manage.py update_all_reference_data --use-cached-omim --skip-gencode

Additionally, the pipeline-runner container has a script to download reference data for the specified genome build. To download Ensembl reference data for GRCh37 and GRCh38, run the following:

docker compose exec pipeline-runner /usr/local/bin/download_reference_data.sh 37
docker compose exec pipeline-runner /usr/local/bin/download_reference_data.sh 38

Note: These scripts take a long time to run. It is recommended to run them in the background using tmux or screen.

Annotating and loading VCF callsets

Option #1: annotate on a Google Dataproc cluster, then load in to an on-prem seqr instance

Google Dataproc makes it easy to start a spark cluster which can be used to parallelize annotation across many machines. The steps below describe how to annotate a callset and then load it into your on-prem elasticsearch instance.

  1. authenticate into your google cloud account.

    gcloud auth application-default login
  2. upload your .vcf.gz callset to a google bucket

    GS_BUCKET=gs://your-bucket       # your google bucket
    GS_FILE_PATH=data/GRCh38         # the desired file path. Good to include build version and/ or sample type to directory structure
    FILENAME=your-callset.vcf.gz     # the local file you want to load
    
    gsutil cp $FILENAME $GS_BUCKET/$GS_FILE_PATH
  3. start a pipeline-runner container which has the necessary tools and environment for starting and submitting jobs to a Dataproc cluster.

    docker compose up -d pipeline-runner            # start the pipeline-runner container
  4. if you haven't already, upload reference data to your own google bucket. This should be done once per build version, and does not need to be repeated for subsequent loading jobs. This is expected to take a while

    BUILD_VERSION=38                 # can be 37 or 38
    
    docker compose exec pipeline-runner copy_reference_data_to_gs.sh $BUILD_VERSION $GS_BUCKET
    

    Periodically, you may want to update the reference data in order to get the latest versions of these annotations. To do this, run the following commands to update the data. All subsequently loaded data will then have the updated annotations, but you will need to re-load previously loaded projects to get the updated annotations.

    GS_BUCKET=gs://your-bucket       # your google bucket
    BUILD_VERSION=38                 # can be 37 or 38
    
    # Update clinvar
    gsutil rm -r "${GS_BUCKET}/reference_data/GRCh${BUILD_VERSION}/clinvar.GRCh${BUILD_VERSION}.ht"
    gsutil rsync -r "gs://seqr-reference-data/GRCh${BUILD_VERSION}/clinvar/clinvar.GRCh${BUILD_VERSION}.ht" "${GS_BUCKET}/reference_data/GRCh${BUILD_VERSION}/clinvar.GRCh${BUILD_VERSION}.ht"
    
    # Update all other reference data
    gsutil rm -r "${GS_BUCKET}/reference_data/GRCh${BUILD_VERSION}/combined_reference_data_grch${BUILD_VERSION}.ht"
    gsutil rsync -r "gs://seqr-reference-data/GRCh${BUILD_VERSION}/all_reference_data/combined_reference_data_grch${BUILD_VERSION}.ht" "${GS_BUCKET}/reference_data/GRCh${BUILD_VERSION}/combined_reference_data_grch${BUILD_VERSION}.ht"
  5. run the loading command in the pipeline-runner container. Adjust the arguments as needed

    BUILD_VERSION=38                 # can be 37 or 38
    SAMPLE_TYPE=WES                  # can be WES or WGS
    INDEX_NAME=your-dataset-name     # the desired index name to output. Will be used later to link the data to the corresponding seqr project
    
    INPUT_FILE_PATH=/${GS_FILE_PATH}/${FILENAME}
    
    docker compose exec pipeline-runner load_data_dataproc.sh $BUILD_VERSION $SAMPLE_TYPE $INDEX_NAME $GS_BUCKET $INPUT_FILE_PATH
    

Option #2: annotate and load on-prem

Annotating a callset with VEP and reference data can be very slow - as slow as several variants / sec per CPU, so although it is possible to run the pipeline on a single machine, it is recommended to use multiple machines.

The steps below describe how to annotate a callset and then load it into your on-prem elasticsearch instance.

  1. create a directory for your vcf files. docker-compose will mount this directory into the pipeline-runner container.

    mkdir ./data/input_vcfs/
    
    FILE_PATH=GRCh38                 # the desired file path. Good to include build version and/ or sample type to directory structure
    FILENAME=your-callset.vcf.gz     # the local file you want to load. vcfs should be bgzip'ed
    
    cp $FILENAME ./data/input_vcfs/$FILE_PATH
  2. start a pipeline-runner container

    docker compose up -d pipeline-runner            # start the pipeline-runner container
  3. authenticate into your google cloud account. This is required for hail to access buckets hosted on gcloud.

    docker compose exec pipeline-runner  gcloud auth application-default login
  4. if you haven't already, download VEP and other reference data to the docker image's mounted directories. This should be done once per build version, and does not need to be repeated for subsequent loading jobs. This is expected to take a while

    BUILD_VERSION=38                 # can be 37 or 38
    
    docker compose exec pipeline-runner download_reference_data.sh $BUILD_VERSION
    

    Periodically, you may want to update the reference data in order to get the latest versions of these annotations. To do this, run the following commands to update the data. All subsequently loaded data will then have the updated annotations, but you will need to re-load previously loaded projects to get the updated annotations.

    BUILD_VERSION=38                 # can be 37 or 38
    
    # Update clinvar
    docker compose exec pipeline-runner rm -rf "/seqr-reference-data/GRCh${BUILD_VERSION}/clinvar.GRCh${BUILD_VERSION}.ht"
    docker compose exec pipeline-runner gsutil rsync -r "gs://seqr-reference-data/GRCh${BUILD_VERSION}/clinvar/clinvar.GRCh${BUILD_VERSION}.ht" "/seqr-reference-data/GRCh${BUILD_VERSION}/clinvar.GRCh${BUILD_VERSION}.ht"
    
    # Update all other reference data
    docker compose exec pipeline-runner rm -rf "/seqr-reference-data/GRCh${BUILD_VERSION}/combined_reference_data_grch${BUILD_VERSION}.ht"
    docker compose exec pipeline-runner gsutil rsync -r "gs://seqr-reference-data/GRCh${BUILD_VERSION}/all_reference_data/combined_reference_data_grch${BUILD_VERSION}.ht" "/seqr-reference-data/GRCh${BUILD_VERSION}/combined_reference_data_grch${BUILD_VERSION}.ht"
  5. run the loading command in the pipeline-runner container. Adjust the arguments as needed

    BUILD_VERSION=38                 # can be 37 or 38
    SAMPLE_TYPE=WES                  # can be WES or WGS
    INDEX_NAME=your-dataset-name     # the desired index name to output. Will be used later to link the data to the corresponding seqr project
    
    INPUT_FILE_PATH=${FILE_PATH}/${FILENAME}
    
    docker compose exec pipeline-runner load_data.sh $BUILD_VERSION $SAMPLE_TYPE $INDEX_NAME $INPUT_FILE_PATH
    

Adding a loaded dataset to a seqr project

After the dataset is loaded into elasticsearch, it can be added to your seqr project with these steps:

  1. Go to the project page
  2. Click on Edit Datasets
  3. Enter the elasticsearch index name (the $INDEX_NAME argument you provided at loading time), and submit the form.

Enable read viewing in the browser (optional)

To make .bam/.cram files viewable in the browser through igv.js, see ReadViz Setup Instructions

Loading RNASeq datasets

Currently, seqr has a preliminary integration for RNA data, which requires the use of publicly available pipelines run outside of the seqr platform. After these pipelines are run, the output must be annotated with metadata from seqr to ensure samples are properly associated with the correct seqr families. After calling is completed, it can be added to seqr from the "Data Management" > "Rna Seq" page. You will need to provide the file path for the data and the data type. Note that the file path can either be a gs:// path to a google bucket or as a local file any of the volumes specified in the docker-compose file. The following data types are supported:

Gene Expression

seqr accepts normalized expression TPMs from STAR or RNAseqQC. TSV files should have the following columns:

  • sample_id
  • project
  • gene_id
  • TPM
  • tissue

Expression Outliers

seqr accepts gene expression outliers from OUTRIDER. TSV files should have the following columns:

  • sampleID
  • geneID
  • pValue
  • padjust
  • zScore

IGV

Splice junctions (.junctions.bed.gz) and coverage (.bigWig) can be visualized in seqr using IGV. See ReadViz Setup Instructions for instructions on adding this data, as the process is identical for all IGV tracks.