Skip to content

Output files

Jakub Vasicek edited this page Apr 14, 2025 · 13 revisions

Concatenated FASTA file

The main result of the pipeline is the concatenated FASTA file, consisting of the ProHap and/or ProVar output, reference sequences from Ensembl, and contaminant sequences. Translations of untranslated regions (UTRs) of transcripts are not included in this file, wherever the annotation of the canonical start codon is available.

The resulting file has the following format:

>tag|accession|<positions_within_protein> <protein_starts> <matching_proteins> <reading_frames>
PROTEINSEQUENCE

The header of the protein entry is formatted as >tag|accession|description. The accession field is used as the identifier of the entry when annotating the peptide-spectrum matches (PSMs). The description field is then used by the annotation pipeline to align the peptide with transcripts, genes, and variant coordinates. Please note that there can be multiple protein entries sharing the same sequence - therefore, the description field may contain information about multiple proteins. The tag can be used to quickly distinguish between contaminants, canonical, haplotype, and variant sequences.

Optionally, the concatenated FASTA can be simplified, and the tag and the information contained in the description fields extracted to a tab-separated file. The simplified FASTA file will then contain only the protein accession, and the name of the associated gene. Contaminant sequences are additionally rather marked as contaminants. This option is recommended for compatibility with search engines and other tools.

The simplified FASTA file has the following format:

>accession GN=<gene_name>
PROTEINSEQUENCE
>accession CONTAMINANT GN=<contaminant protein name (e.g., CAH1_HUMAN)>
PROTEINSEQUENCE

Possible tag values are:

  • generic_cont: At least one of the matching sequences is a contaminant.
  • generic_ensref: No matching contaminant, at least one of the matching sequences is an Ensembl canonical protein.
  • generic_var: No matching contaminant or canonical protein, at least one of the matching sequences is a variant protein (obtained by ProVar).
  • generic_enshap: No matching contaminant, canonical or variant protein, all of the matching sequences are non-canonical protein haplotypes (obtained by ProHap).

The fields included in the description of the FASTA elements are the following:

  • positions_within_protein: position of this sub-sequence within the whole protein sequences, delimited by semicolon
  • protein_starts: positions of the first residue (usually M) within the whole cDNA translation, indexed from 0.
  • matching_proteins: IDs of the whole protein sequences matching to this sub-sequence. Variant and haplotype IDs can be mapped to the metadata table provided.
  • reading_frames: Reading frames in which the matching sequences are translated, if known. These are denoted by the offset from the first nucleotide of the cDNA (i.e., 0, 1, or 2).

cDNA translations FASTA

FASTA file containing the original translations of variant / haplotype cDNA sequences prior to any optimization and merging with canonical proteins and contaminants. Note that if ignoring variation in UTRs (default configuration of ProHap), UTR sequences are not included here. Otherwise, they are kept in this file.

The file is formatted as below:

>tag|accession|<matching_proteins> <protein_starts> <reading_frames>
PROTEINSEQUENCE

Since several haplotypes can produce the same protein sequence, the matching_proteins field contains the haplotype / variant IDs as found in the metadata table files (see below), separated by semicolon.

cDNA sequence FASTA (optional)

Optionally cDNA sequences before translation can be written to a separate file. These are cDNA sequences prior to any optimization and merging with canonical proteins and contaminants. Note that if ignoring variation in UTRs (default configuration of ProHap), UTR sequences are not included here. Otherwise, they are kept in this file.

When using ProHap to investigate combinations of variants in the cDNA sequences, we recommend to disable the option to ignore variation in UTRs. This will produce the complete unique cDNA haplotype sequences encoded by the provided genotypes, while the translations of UTRs will still be removed in the concatenated FASTA.

The file is formatted as below:

>sequence_ID start:<number>
AGCTCGGCCGCCGGGACCCAGGGCATGGATGGAGCCCCGAGGCGGTGGGAG...

The sequence_ID field will contain the ID of the corresponding haplotype / variant as found in the metadata table files (see below). We generally expect each haplotype or variant to encode a unique cDNA sequence. However, should two haplotypes / variants encode the same sequence, the IDs will be separated by semicolon. The start field denotes the position of the canonical start codon within the cDNA sequence, indexed from 0.

Metadata tables

ProHap and ProVar produce a tab-separated file with information about the corresponding haplotype / variant sequences:

ProHap output

  1. Haplotype table file provided in a tab-separated text-file format. The columns given by ProHap are:
  • chromosome
  • TranscriptID: Identifier of the transcript in Ensembl format (ENSTxxx)
  • transcript_biotype: Biotype of the matching transcript in Ensembl.
  • HaplotypeID: ID of the haplotype sequence, matching to the ID in the FASTA entry description.
  • VCF_IDs: IDs of the matching lines in the VCF file if provided
  • DNA_changes: List of changes in the format position:REF>ALT, mapped to the DNA coordinates within the chromosome
  • allele_frequencies: List of allele frequencies of the variants included in the haplotype
  • cDNA_changes: List of changes in the format position:REF>ALT, mapped to the coordinates within the cDNA sequence of this transcript
  • all_protein_changes: List of amino acid changes in the format position:REF>position:ALT, mapped to the coordinates within the protein sequence. The start codon is at position 0, so if a change happens in the 5' untranslated region (UTR), its coordinates within the protein are negative.
  • variant_types: Consequence type of variant (e.g., SAV, inframe-indel, synonymous, ...) for every variant on the protein level
  • protein_changes: List of amino acid changes in the protein excluding synonymous variants.
  • reading_frame: Canonical reading frame for this transcript, if known.
  • protein_prefix_length: Number of codons in the 5' UTR
  • start_missing: Boolean - is the canonical annotation of the start codon missing for this transcript?
  • start_lost: Boolean - does one of the variants cause a loss of the start codon?
  • splice_sites_affected: List of splice sites affected by a variant, if any. (Splice site 0 happens between exon 1 and 2)
  • occurrence_count: Number of occurrences of this haplotype within the participants of the 1000 Genomes project (or within the cohort provided in the phased genotype VCF)
  • frequency: Frequency of this haplotype within the participants of the 1000 Genomes project (or within the cohort provided in the phased genotype VCF)
  • frequency_population: Frequency of this haplotype among populations (assignment of individuals to populations given as input)
  • frequency_superpopulation: Frequency of this haplotype among superpopulations (assignment of individuals to superpopulations given as input)
  1. Tab-separated file containing the list of samples in which has each of the protein haplotype sequences been predicted. For example, if the file contains the following:
HaplotypeID     samples
haplo_chr1_4    HG02572:2;HG02717:1

The haplotype sequence haplo_chr1_4 is encoded by the second copy of respective gene in individual HG02572, and the first copy in individual HG02717.

ProVar output

Metadata file provided in a tab-separated text-file format. The columns given by ProVar are:

  • chromosome
  • TranscriptID: Identifier of the transcript in Ensembl format (ENSTxxx)
  • transcript_biotype: Biotype of the matching transcript in Ensembl.
  • variantID: ID of the variant sequence (unique per transcript x allele), matching to the ID in the FASTA entry description.
  • vcfID: ID of the matching line in the VCF file of provided
  • DNA_change: Change in the format position:REF>ALT, mapped to the DNA coordinates within the chromosome
  • cDNA_change: Change in the format position:REF>ALT, mapped to the coordinates within the cDNA of this transcript
  • protein_change: Amino acid change in the format position:REF>ALT, mapped to the coordinates within the protein sequence. The start codon is at position 0, so if the change happens in the 5' untranslated region (UTR), its coordinates within the protein are negative.
  • reading_frame: Canonical reading frame for this transcript, if known.
  • protein_prefix_length: Number of codons in the 5' UTR
  • start_missing: Boolean - is the canonical annotation of the start codon missing for this transcript?
  • start_lost: Boolean - does the variant cause a loss of the start codon?
  • splice_site_affected: Which splicing site, if any is affected by the variant. (Splice site 0 happens between exon 1 and 2)

Clone this wiki locally