Prerequisites

Hardware: At least 16 Gb RAM, 4 CPUs, 50 Gb disk space
Software:
- docker
  - under Preferences > Resources > Advanced set the memory limit to at least 12 Gb
- docker-compose
- gcloud
OS settings for elasticsearch:
- Linux only: elasticsearch needs higher-than-default virtual memory settings. To adjust this, run
```
echo '
   vm.max_map_count=262144
' | sudo tee -a /etc/sysctl.conf

sudo sysctl -w vm.max_map_count=262144
```
This will prevent elasticsearch start up error: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

Starting seqr

The steps below describe how to create a new empty seqr instance with a single Admin user account.

SEQR_DIR=$(pwd)

wget https://raw.githubusercontent.com/ccmbioinfo/seqr-cfi/master/docker-compose.yml

docker compose up -d seqr   # start up the seqr docker image in the background after also starting other components it depends on (postgres, redis, elasticsearch). This may take 10+ minutes.
docker compose logs -f seqr  # (optional) continuously print seqr logs to see when it is done starting up or if there are any errors. Type Ctrl-C to exit from the logs.

docker compose exec seqr python manage.py createsuperuser  # create a seqr Admin user

open http://localhost     # open the seqr landing page in your browser. Log in to seqr using the email and password from the previous step

Updating seqr

Updating your local installation of seqr involves pulling the latest version of the seqr docker container, and then recreating the container.

# run this from the directory containing your docker-compose.yml file
docker compose pull
docker compose up -d seqr

docker compose logs -f seqr  # (optional) continuously print seqr logs to see when it is done starting up or if there are any errors. Type Ctrl-C to exit from the logs.

To update reference data in seqr, such as OMIM, HPO, etc., run the following

docker compose exec seqr ./manage.py update_all_reference_data --use-cached-omim --skip-gencode

Additionally, the pipeline-runner container has a script to download reference data for the specified genome build. To download Ensembl reference data for GRCh37 and GRCh38, run the following:

docker compose exec pipeline-runner /usr/local/bin/download_reference_data.sh 37
docker compose exec pipeline-runner /usr/local/bin/download_reference_data.sh 38

Note: These scripts take a long time to run. It is recommended to run them in the background using tmux or screen.

Annotating and loading VCF callsets

Option #1: annotate on a Google Dataproc cluster, then load in to an on-prem seqr instance

Google Dataproc makes it easy to start a spark cluster which can be used to parallelize annotation across many machines. The steps below describe how to annotate a callset and then load it into your on-prem elasticsearch instance.

authenticate into your google cloud account.
```
gcloud auth application-default login
```

upload your .vcf.gz callset to a google bucket

GS_BUCKET=gs://your-bucket       # your google bucket
GS_FILE_PATH=data/GRCh38         # the desired file path. Good to include build version and/ or sample type to directory structure
FILENAME=your-callset.vcf.gz     # the local file you want to load

gsutil cp $FILENAME $GS_BUCKET/$GS_FILE_PATH

start a pipeline-runner container which has the necessary tools and environment for starting and submitting jobs to a Dataproc cluster.
```
docker compose up -d pipeline-runner            # start the pipeline-runner container
```

if you haven't already, upload reference data to your own google bucket. This should be done once per build version, and does not need to be repeated for subsequent loading jobs. This is expected to take a while

BUILD_VERSION=38                 # can be 37 or 38

docker compose exec pipeline-runner copy_reference_data_to_gs.sh $BUILD_VERSION $GS_BUCKET

Periodically, you may want to update the reference data in order to get the latest versions of these annotations. To do this, run the following commands to update the data. All subsequently loaded data will then have the updated annotations, but you will need to re-load previously loaded projects to get the updated annotations.

GS_BUCKET=gs://your-bucket       # your google bucket
BUILD_VERSION=38                 # can be 37 or 38

# Update clinvar
gsutil rm -r "${GS_BUCKET}/reference_data/GRCh${BUILD_VERSION}/clinvar.GRCh${BUILD_VERSION}.ht"
gsutil rsync -r "gs://seqr-reference-data/GRCh${BUILD_VERSION}/clinvar/clinvar.GRCh${BUILD_VERSION}.ht" "${GS_BUCKET}/reference_data/GRCh${BUILD_VERSION}/clinvar.GRCh${BUILD_VERSION}.ht"

# Update all other reference data
gsutil rm -r "${GS_BUCKET}/reference_data/GRCh${BUILD_VERSION}/combined_reference_data_grch${BUILD_VERSION}.ht"
gsutil rsync -r "gs://seqr-reference-data/GRCh${BUILD_VERSION}/all_reference_data/combined_reference_data_grch${BUILD_VERSION}.ht" "${GS_BUCKET}/reference_data/GRCh${BUILD_VERSION}/combined_reference_data_grch${BUILD_VERSION}.ht"

run the loading command in the pipeline-runner container. Adjust the arguments as needed

BUILD_VERSION=38                 # can be 37 or 38
SAMPLE_TYPE=WES                  # can be WES or WGS
INDEX_NAME=your-dataset-name     # the desired index name to output. Will be used later to link the data to the corresponding seqr project

INPUT_FILE_PATH=/${GS_FILE_PATH}/${FILENAME}

docker compose exec pipeline-runner load_data_dataproc.sh $BUILD_VERSION $SAMPLE_TYPE $INDEX_NAME $GS_BUCKET $INPUT_FILE_PATH

Option #2: annotate and load on-prem

Annotating a callset with VEP and reference data can be very slow - as slow as several variants / sec per CPU, so although it is possible to run the pipeline on a single machine, it is recommended to use multiple machines.

The steps below describe how to annotate a callset and then load it into your on-prem elasticsearch instance.

create a directory for your vcf files. docker-compose will mount this directory into the pipeline-runner container.

mkdir ./data/input_vcfs/

FILE_PATH=GRCh38                 # the desired file path. Good to include build version and/ or sample type to directory structure
FILENAME=your-callset.vcf.gz     # the local file you want to load. vcfs should be bgzip'ed

cp $FILENAME ./data/input_vcfs/$FILE_PATH

start a pipeline-runner container

docker compose up -d pipeline-runner            # start the pipeline-runner container

authenticate into your google cloud account. This is required for hail to access buckets hosted on gcloud.
```
docker compose exec pipeline-runner  gcloud auth application-default login
```

if you haven't already, download VEP and other reference data to the docker image's mounted directories. This should be done once per build version, and does not need to be repeated for subsequent loading jobs. This is expected to take a while

BUILD_VERSION=38                 # can be 37 or 38

docker compose exec pipeline-runner download_reference_data.sh $BUILD_VERSION

Periodically, you may want to update the reference data in order to get the latest versions of these annotations. To do this, run the following commands to update the data. All subsequently loaded data will then have the updated annotations, but you will need to re-load previously loaded projects to get the updated annotations.

BUILD_VERSION=38                 # can be 37 or 38

# Update clinvar
docker compose exec pipeline-runner rm -rf "/seqr-reference-data/GRCh${BUILD_VERSION}/clinvar.GRCh${BUILD_VERSION}.ht"
docker compose exec pipeline-runner gsutil rsync -r "gs://seqr-reference-data/GRCh${BUILD_VERSION}/clinvar/clinvar.GRCh${BUILD_VERSION}.ht" "/seqr-reference-data/GRCh${BUILD_VERSION}/clinvar.GRCh${BUILD_VERSION}.ht"

# Update all other reference data
docker compose exec pipeline-runner rm -rf "/seqr-reference-data/GRCh${BUILD_VERSION}/combined_reference_data_grch${BUILD_VERSION}.ht"
docker compose exec pipeline-runner gsutil rsync -r "gs://seqr-reference-data/GRCh${BUILD_VERSION}/all_reference_data/combined_reference_data_grch${BUILD_VERSION}.ht" "/seqr-reference-data/GRCh${BUILD_VERSION}/combined_reference_data_grch${BUILD_VERSION}.ht"

run the loading command in the pipeline-runner container. Adjust the arguments as needed

BUILD_VERSION=38                 # can be 37 or 38
SAMPLE_TYPE=WES                  # can be WES or WGS
INDEX_NAME=your-dataset-name     # the desired index name to output. Will be used later to link the data to the corresponding seqr project

INPUT_FILE_PATH=${FILE_PATH}/${FILENAME}

docker compose exec pipeline-runner load_data.sh $BUILD_VERSION $SAMPLE_TYPE $INDEX_NAME $INPUT_FILE_PATH

Adding a loaded dataset to a seqr project

After the dataset is loaded into elasticsearch, it can be added to your seqr project with these steps:

Go to the project page
Click on Edit Datasets
Enter the elasticsearch index name (the $INDEX_NAME argument you provided at loading time), and submit the form.

Enable read viewing in the browser (optional)

To make .bam/.cram files viewable in the browser through igv.js, see ReadViz Setup Instructions

Loading RNASeq datasets

Currently, seqr has a preliminary integration for RNA data, which requires the use of publicly available pipelines run outside of the seqr platform. After these pipelines are run, the output must be annotated with metadata from seqr to ensure samples are properly associated with the correct seqr families. After calling is completed, it can be added to seqr from the "Data Management" > "Rna Seq" page. You will need to provide the file path for the data and the data type. Note that the file path can either be a gs:// path to a google bucket or as a local file any of the volumes specified in the docker-compose file. The following data types are supported:

Gene Expression

seqr accepts normalized expression TPMs from STAR or RNAseqQC. TSV files should have the following columns:

sample_id
project
gene_id
TPM
tissue

Expression Outliers

seqr accepts gene expression outliers from OUTRIDER. TSV files should have the following columns:

sampleID
geneID
pValue
padjust
zScore

IGV

Splice junctions (.junctions.bed.gz) and coverage (.bigWig) can be visualized in seqr using IGV. See ReadViz Setup Instructions for instructions on adding this data, as the process is identical for all IGV tracks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LOCAL_INSTALL.md

LOCAL_INSTALL.md

Prerequisites

Starting seqr

Updating seqr

Annotating and loading VCF callsets

Option #1: annotate on a Google Dataproc cluster, then load in to an on-prem seqr instance

Option #2: annotate and load on-prem

Adding a loaded dataset to a seqr project

Enable read viewing in the browser (optional)

Loading RNASeq datasets

Gene Expression

Expression Outliers

IGV

Files

LOCAL_INSTALL.md

Latest commit

History

LOCAL_INSTALL.md

File metadata and controls

Prerequisites

Starting seqr

Updating seqr

Annotating and loading VCF callsets

Option #1: annotate on a Google Dataproc cluster, then load in to an on-prem seqr instance

Option #2: annotate and load on-prem

Adding a loaded dataset to a seqr project

Enable read viewing in the browser (optional)

Loading RNASeq datasets

Gene Expression

Expression Outliers

IGV