Description
openedon Jun 11, 2024
Description of the bug
The code parsing vep related options can't deal with the --refseq/--merged
options in the vep command.
If setting:
vep_cache_version: "111"
vep_genome: "GRCh37"
vep_cache: "path/to/vep-cache/"
vep_custom_args: "--refseq --offline --everything"
vep_species: "homo_sapiens_refseq"
then the error is:
-------------------- EXCEPTION --------------------
MSG: Should not use homo_sapiens_refseq as --species.
Try using flags --refseq or --merged with --species homo_sapiens
the command.sh
:
vep \
-i sample.freebayes.vcf.gz \
-o sample.freebayes_VEP.ann.vcf.gz \
--stats_file sample.freebayes_VEP.ann.summary.html \
--vcf --refseq --offline --hgvsg --allele_number --everything --format vcf \
--compress_output bgzip \
--fasta human_g1k_v37_decoy.fasta \
--assembly GRCh37 \
--species homo_sapiens_refseq \
--cache \
--cache_version 111 \
--dir_cache ${PWD}/ensembl-vep \
--fork 6
Notice that --species homo_sapiens_refseq
is incorrect.
The correct way is to use --species homo_sapiens
along with --refseq
(latter is present because we include it in vep_custom_args
.
The --dir_cache ${PWD}/ensembl-vep
(corresponds to vep_cache
) is organized as:
${PWD}/ensembl-vep
├── homo_sapiens_refseq
│ ├── 111_GRCh37
│ └── 111_GRCh38
...
which is the default structure if downloaded via ensembl-vep (these files are downloaded via docker myself). For homo_sapiens, there are two more versions, the default homo_sapien
and the other one homo_sapiens_merged
. If download them all, then the folder would look like:
${PWD}/ensembl-vep
├── homo_sapiens
│ ├── 111_GRCh37
│ └── 111_GRCh38
├── homo_sapiens_merged
│ ├── 111_GRCh37
│ └── 111_GRCh38
├── homo_sapiens_refseq
│ ├── 111_GRCh37
│ └── 111_GRCh38
VEP's annotation folder selection logic:
- By default vep will use
${dir_cache}/${species}/${cache_version}_${assembly}
- If
--refseq
is provided, it will use${dir_cache}/${species}_refseq/${cache_version}_${assembly}
- If
--merged
is provided, it will use${dir_cache}/${species}_merged/${cache_version}_${assembly}
Below are mappings between vep command options and sarek input params:
$dir_cache: $vep_cache
$species: $vep_species
$cache_version: $vep_cache_version
$assembly: $vep_genome
However, I'm not sure if this is the case for other species other than homo_sapiens.
For homo_sapiens, the correct logic to parse might be:
- check if
vep_custom_args
contains--refseq/--merged
- if both not exist, then validate annotation folder existence:
${vep_cache}/${vep_species}/${vep_cache_version}_${vep_genome}
- if any of them exists, should validate annotation folder existence:
${vep_cache}/${vep_species}_refseq/${vep_cache_version}_${vep_genome}
or${vep_cache}/${vep_species}_merged/${vep_cache_version}_${vep_genome}
; note that$vep_species
should still behomo_sapiens
(because we need--species homo_sapiens
in the vep command)
A more consistent way might be to provide another sarek input param to deal with vep's --refseq/--merged
option
Command used and terminal output
No response
Relevant files
No response
System information
No response