Skip to content

Latest commit

 

History

History
243 lines (196 loc) · 9.65 KB

getting_started.rst

File metadata and controls

243 lines (196 loc) · 9.65 KB

Getting started

STEP 0: Python

An installation of Python (version 3.6) is required to run PCGR. Check that Python is installed by typing python --version in your terminal window. In addition, a Python library for parsing configuration files encoded with TOML is needed. To install, simply run the following command:

pip install toml

STEP 1: Installation of Docker

  1. Install the Docker engine on your preferred platform
    • installing Docker on Linux
    • installing Docker on Mac OS
    • NOTE: We have not yet been able to perform enough testing on the Windows platform, and we have received feedback that particular versions of Docker/Windows do not work with PCGR (an example being mounting of data volumes)
  2. Test that Docker is running, e.g. by typing docker ps or docker images in the terminal window
  3. Adjust the computing resources dedicated to the Docker, i.e.:

STEP 2: Download PCGR and data bundle

Development version

  1. Clone the PCGR GitHub repository: git clone https://github.com/sigven/pcgr.git
  2. Download and unpack the latest data bundles in the PCGR directory
  3. Pull the PCGR Docker image (dev) from DockerHub (approx 5.2Gb):
    • docker pull sigven/pcgr:dev (PCGR annotation engine)

Latest release

  1. Download and unpack the latest software release (0.8.1)
  2. Download and unpack the assembly-specific data bundle in the PCGR directory
  1. Pull the PCGR Docker image (0.8.1) from DockerHub (approx 5.2Gb):
    • docker pull sigven/pcgr:0.8.1 (PCGR annotation engine)

STEP 3: Input preprocessing

The PCGR workflow accepts two types of input files:

  • An unannotated, single-sample VCF file (>= v4.2) with called somatic variants (SNVs/InDels)
  • A copy number segment file

PCGR can be run with either or both of the two input files present.

  • We strongly recommend that the input VCF is compressed and indexed using bgzip and tabix
  • If the input VCF contains multi-allelic sites, these will be subject to decomposition
  • Variants used for reporting should be designated as ‘PASS’ in the VCF FILTER column

The tab-separated values file with copy number aberrations MUST contain the following four columns:

  • Chromosome
  • Start
  • End
  • Segment_Mean

Here, Chromosome, Start, and End denote the chromosomal segment, and Segment_Mean denotes the log(2) ratio for a particular segment, which is a common output of somatic copy number alteration callers. Note that coordinates must be one-based (i.e. chromosomes start at 1, not 0). Below shows the initial part of a copy number segment file that is formatted correctly according to PCGR’s requirements:

Chromosome Start   End Segment_Mean
1 3218329 3550598 0.0024
1 3552451 4593614 0.1995
1 4593663 6433129 -1.0277

STEP 4: Configure PCGR

The PCGR software bundle comes with default configuration files per tumor type ( in the conf/ folder), to be used as a starting point for runnning the PCGR workflow. The configuration file, formatted using TOML, enables the user to configure a number of options related to the following:

  • Sequencing depth/allelic support (definition of tags + thresholds)
  • MSI prediction
  • Mutational signatures analysis
  • Mutational burden analysis (e.g. target size)
  • VCF to MAF conversion
  • Tumor-only analysis options (i.e. exclusion of germline variants/enrichment for somatic calls)
  • VEP/vcfanno options
  • Log-ratio thresholds for gains/losses in CNA analysis

More details about the exact usage of the configuration options.

STEP 5: Run example

A tumor sample report is generated by calling the Python script pcgr.py, which takes the following arguments and options:

usage: pcgr.py [options] <PCGR_DIR> <OUTPUT_DIR> <GENOME_ASSEMBLY> <CONFIG_FILE> <SAMPLE_ID>

Personal Cancer Genome Reporter (PCGR) workflow for clinical interpretation of
somatic nucleotide variants and copy number aberration segments

positional arguments:
pcgr_dir              PCGR base directory with accompanying data directory,
                e.g. ~/pcgr-0.8.1
output_dir            Output directory
{grch37,grch38}       Genome assembly build: grch37 or grch38
configuration_file    PCGR configuration file (TOML format, in conf/ folder)
sample_id             Tumor sample/cancer genome identifier - prefix for
                output files

optional arguments:
-h, --help            show this help message and exit
--input_vcf INPUT_VCF
                VCF input file with somatic query variants
                (SNVs/InDels). (default: None)
--input_cna INPUT_CNA
                Somatic copy number alteration segments (tab-separated
                values) (default: None)
--input_cna_plot INPUT_CNA_PLOT
                Somatic copy number alteration plot (default: None)
--pon_vcf PON_VCF     VCF file with germline calls from Panel of Normals
                (PON) - i.e. blacklist variants (default: None)
--tumor_purity TUMOR_PURITY
                Estimated tumor purity (between 0 and 1) (default:
                None)
--tumor_ploidy TUMOR_PLOIDY
                Estimated tumor ploidy (default: None)
--force_overwrite     By default, the script will fail with an error if any
                output file already exists. You can force the
                overwrite of existing result files by using this flag
                (default: False)
--version             show program's version number and exit
--basic               Run functional variant annotation on VCF through
                VEP/vcfanno, omit other analyses (i.e. CNA, MSI,
                report generation etc. (STEP 4) (default: False)
--no_vcf_validate    Skip validation of input VCF with Ensembl's vcf-
               validator (default: False)
--docker-uid DOCKER_USER_ID
                Docker user ID. Default is the host system user ID. If
                you are experiencing permission errors, try setting
                this up to root (`--docker-uid root`) (default: None)
--no-docker           Run the PCGR workflow in a non-Docker mode (see
                install_no_docker/ folder for instructions (default:
                False)

The examples folder contain input files from two tumor samples sequenced within TCGA (GRCh37 only). It also contains PCGR configuration files customized for these samples. A report for a colorectal tumor case can be generated by running the following command in your terminal window:

python pcgr.py --input_vcf ~/pcgr-0.8.1/examples/tumor_sample.COAD.vcf.gz --input_cna ~/pcgr-0.8.1/examples/tumor_sample.COAD.cna.tsv ~/pcgr-0.8.1 ~/pcgr-0.8.1/examples grch37 ~/pcgr-0.8.1/conf/Colorectal_Cancer_NOS.toml tumor_sample.COAD

This command will run the Docker-based PCGR workflow and produce the following output files in the examples folder:

  1. tumor_sample.COAD.pcgr_acmg.grch37.html - An interactive HTML report for clinical interpretation
  2. tumor_sample.COAD.pcgr_acmg.grch37.pass.vcf.gz (.tbi) - Bgzipped VCF file with rich set of annotations for precision oncology
  3. tumor_sample.COAD.pcgr_acmg.grch37.pass.tsv.gz - Compressed vcf2tsv-converted file with rich set of annotations for precision oncology
  4. tumor_sample.COAD.pcgr_acmg.grch37.snvs_indels.tiers.tsv - Tab-separated values file with variants organized according to tiers of functional relevance
  5. tumor_sample.COAD.pcgr_acmg.grch37.json.gz - Compressed JSON dump of HTML report content
  6. tumor_sample.COAD.pcgr_acmg.grch37.cna_segments.tsv.gz - Compressed tab-separated values file with annotations of gene transcripts that overlap with somatic copy number aberrations