Skip to content

Input & Usage

Jakub Vasicek edited this page Jul 8, 2024 · 13 revisions

As for every Snakemake pipeline, you will need to create a configuration file. This can be done using the simple GUI at https://progenno.github.io/ProHap/. Most of the parameters have a default value and do not have to be explicitly specified, unless a specific behavior of ProHap or ProVar is desired. Any required values that are missing will be highlighted by the MISSING keyword in the configuration file content.

Requirements

Install Snakemake following this guide, Installation via Conda/Mamba.

Note that ProHap has been developed and tested on Ubuntu. If there are any problems running it on other platforms (such as Mac or Windows), please report an issue and we will try to resolve it.

Using ProHap with the 1000 Genomes Project data set (as per default) requires about 1TB disk space!

Usage

Follow these steps to use the ProHap / ProVar pipeline:

  1. Clone the ProHap repository, and navigate to the corresponding directory: git clone https://github.com/ProGenNo/ProHap.git; cd ProHap/;
  2. Use https://progenno.github.io/ProHap/ to specify the configuration. Copy the configuration text at the bottom of the page.
  3. Create a file called config.yaml in the root directory of ProHap (next to the Snakefile). Paste the configuration text into this file. If you wish to name the configuration file differently (e.g., to keep track of different versions), specify the file name at line 1 in Snakefile.
  4. Activate the Conda environment to run Snakemake: conda activate snakemake
  5. Test Snakemake with a dry-run: snakemake -c3 -n -q
  6. Run the Snakemake pipeline to create your protein database, specifying the number of available CPU cores in the -cores parameter. E.g., when using 30 cores, run snakemake --cores 30 -p --use-conda

Below is the description of the expected input file format and all the configuration parameters:

Input files

ProVar

Provide a VCF file (variant call format, tab-separated) with variants on all chromosomes, as per the example below. It is possible to provide multiple VCF files to be combined -- this has to be specified in the configuration (see below).

Example VCF with minimum required information:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
5       1415176 var_1   G       A       .       .       AF=0.18
16      176467  var_99  G       C       .       .       .

Alternatively, you can collect your variants into a CSV file, and use the script in src/csv_to_vcf.py to convert it into the VCF file before running ProVar.

ProHap

Provide one phased genotype VCF file per chromosome and an associated sample metadata file (tab-separated) as per the example below, or refer to the 1000 Genomes Project data set.

sample_chr5.vcf:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4
5       1415176 var_1   G       A       .       .       AF=0.18 GT      0|0     1|0     0|1     1|1  

sample_chr16.vcf:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4
16      176467  var_99  G       C       .       .       AF=0.01 GT      0|0     0|0     0|1     0|1  

Sample metadata file with the minimal required information (if you do not wish to run population analysis, specify ALL in the Superpopulation code and Population code fields):

Sample name     Sex     Population code Superpopulation code
SAMPLE1         female  GBR             EUR
SAMPLE2         male    GBR             EUR
SAMPLE3         male    ASW             AFR
SAMPLE4         female  ASW             AFR

1: General parameters

These parameters apply to both ProHap and ProVar:

  • Ensembl release: The number of the Ensembl version to be used (e.g. 110).
  • Used transcripts: Choose the set of transcripts to be used in the in silico translation. There are three options:
    • Default: ProHap and ProVar will use all the transcripts that have an associated canonical protein sequence in Ensembl. The list of transcripts will be created automatically from the Ensembl reference proteome.
    • MANE Select: ProHap and ProVar will use only transcripts that are labeled as MANE Select. This is only available for Ensembl version 108 and higher. For genes that do not include a MANE Select transcript in Ensembl, "Ensembl Canonical" transcripts will be selected.
    • Select by biotype: Specify the desired biotypes (as per Gencode) to be included. The list of transcripts will be created automatically from the Ensembl GTF annotation file.
    • Provide transcript IDs manually: Give the path to the CSV file with all the desired transcripts to be included, see the example here.
  • Contaminants FASTA file: Give the full or relative path (including the filename) to the FASTA file of contaminant sequences. These will then be added to the final FASTA, and tagged as contaminants. The default contaminant database is created by the cRAP project, provided in this repository.
  • Final FASTA file: Give the full or relative path (including the filename) to the resulting FASTA file.
  • Simplify FASTA headers: Switch - turn on to extract the descriptions in the FASTA headers to a separate file. These descriptions are used to annotated peptides with matching proteins, genes, and alleles. The simplified FASTA file will contain only the artificial protein identifier, and the name of the associated gene (e.g., >prot_123ab GN=ABCC8). This option is recommended for compatibility with search engines and other tools.

2: ProHap parameters

ProHap is the main component of this pipeline, creating protein haplotype sequences from data sets of phased genotypes. ProHap accepts the following parameters, as available in the configuration GUI:

  • Use ProHap: Switch - turn on if you want to use ProHap in the pipeline and add protein haplotypes to the final database.
  • Data source: There are two ways to provide the phased genotype data to ProHap: a URL to an online resource, or a path to a directory containing the VCF files locally. One VCF file is expected per chromosome.
    • URL of the data set: Provide a full URL path to the online directory containing all the files to be downloaded. E.g., for the 1000 Genomes Project data set, the URL is http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/. It is expected that the VCF files are compressed with gzip, and therefore should end with .vcf.gz.
    • Path to the directory containing phased VCF files: Provide the absolute or relative path to the directory containing all the VCF files. In this case, the VCF files are not expected to be compressed.
    • Name of the VCF files: Give the file name uncompressed, i.e., without the .gz for online resources. For both cases, the files are expected to have exactly the same name, differing only by the chromosome number. Replace the chromosome number with {chr}. E.g., if your files are named phased_genotype_chr1.vcf, phased_genotype_chr2.vcf, ..., specify phased_genotype_chr{chr}.vcf.
  • MAF threshold: The minor allele frequency threshold to filter out rare variants before computing the protein haplotypes. Provide a number between 0 and 1 (0.01 by default for 1% MAF).
  • MAF field name: Name of the AF column in the VCF file ("AF" by default). Change if you want to use the frequency in a specific population within 1000 Genomes (e.g., AFR_AF for frequencies within the African superpopulation), or according to your own file.
  • Threshold haplotypes: If you want to filter out rare combinations of variants after the protein haplotypes are generated, you can decide to threshold these either by the frequency or by the actual number of observations of the haplotype. In both cases, we mean the frequency or number of the unique protein haplotype sequence, not the full haplotype genome-wide.
  • Threshold value: The actual lower threshold value. Specify a number between 0 and 1 if thresholding by frequency, otherwise specify the least number of occurrences.
  • Require annotation of the start codon in transcripts: Skip transcripts where the canonical location of the start codon is not provided in Ensembl. If not skipped, ProHap will give the 3-frame translation for all haplotypes in these transcripts, unless the canonical location of the stop codon is available.
  • Ignore variation in UTR regions: By default, ProHap will not include variants mapping to the untranslated regions (UTRs) of transcripts in the haplotypes. Sometimes, it can be desirable to see the linkage between these UTR variants and variants in the protein-coding regions -- disable this behavior in this case. Note that even with this option enabled, the translations of the UTR regions will not be included in the final concatenated FASTA file. However, they will be included in the translations of haplotype cDNAs.
  • Skip haplotypes where the start codon is lost: Similarly as above, if the start codon in a transcript is lost due to a variant, the haplotype will not be considered discoverable, and will be removed from the results.
  • Path to the haplotype FASTA file: Path to the result file containing all the translations of haplotype cDNAs, before removing UTRs and merging with other sources (e.g., with ProVar, canonical proteins, and contaminants) into the final FASTA
  • Path to the haplotype metadata table: Path to the result file containing the description of all the haplotype sequences in the FASTA file as above.

3: ProVar parameters

These parameters apply to ProVar, intended to use local variant call files (VCF) to create databases of protein variant sequences.

  • Use ProVar: Switch - turn on if you want to use ProVar in the pipeline and add individual variants to the final database.
  • VCF files: Add as many variation sources as you like. You need to provide the following information for each data set:
    • Dataset name: A name to the data set (please avoid using white space in the name).
    • VCF file path: A path to the VCF file. ProVar expects one VCF file containing all the included variants on all chromosomes.
    • MAF threshold: Lower threshold for the minor allele frequency (MAF). Specify 0 to skip thresholding, or when the MAF information is not provided.
  • Require annotation of the start codon in transcripts: Switch - turn on to skip transcripts where the canonical location of the start codon is not provided in Ensembl. If not skipped, ProVar will give the 3-frame translation for these transcripts, unless the canonical location of the stop codon is available.
  • Path to the variant FASTA file: Path to the result file containing all the translations of variant cDNAs, before removing UTRs and merging with other sources (e.g., with ProHap, canonical proteins, and contaminants) into the final FASTA
  • Path to the variant metadata table: Path to the result file containing the description of all the variant sequences in the FASTA file as above.

Clone this wiki locally