Skip to content

ensembl-vep refseq/merged option broken #1559

Closed

Description

Description of the bug

The code parsing vep related options can't deal with the --refseq/--merged options in the vep command.

If setting:

vep_cache_version: "111"
vep_genome: "GRCh37"
vep_cache: "path/to/vep-cache/"
vep_custom_args: "--refseq --offline --everything"
vep_species: "homo_sapiens_refseq"

then the error is:

  -------------------- EXCEPTION --------------------
  MSG: Should not use homo_sapiens_refseq as --species.
  Try using flags --refseq or --merged with --species homo_sapiens

the command.sh:

vep \
      -i sample.freebayes.vcf.gz \
      -o sample.freebayes_VEP.ann.vcf.gz \
      --stats_file sample.freebayes_VEP.ann.summary.html \
      --vcf --refseq --offline --hgvsg --allele_number --everything --format vcf \
      --compress_output bgzip \
      --fasta human_g1k_v37_decoy.fasta \
      --assembly GRCh37 \
      --species homo_sapiens_refseq \
      --cache \
      --cache_version 111 \
      --dir_cache ${PWD}/ensembl-vep \
      --fork 6

Notice that --species homo_sapiens_refseq is incorrect.
The correct way is to use --species homo_sapiens along with --refseq (latter is present because we include it in vep_custom_args.

The --dir_cache ${PWD}/ensembl-vep (corresponds to vep_cache) is organized as:

${PWD}/ensembl-vep
├── homo_sapiens_refseq
│   ├── 111_GRCh37
│   └── 111_GRCh38
...

which is the default structure if downloaded via ensembl-vep (these files are downloaded via docker myself). For homo_sapiens, there are two more versions, the default homo_sapien and the other one homo_sapiens_merged. If download them all, then the folder would look like:

${PWD}/ensembl-vep
├── homo_sapiens
│   ├── 111_GRCh37
│   └── 111_GRCh38
├── homo_sapiens_merged
│   ├── 111_GRCh37
│   └── 111_GRCh38
├── homo_sapiens_refseq
│   ├── 111_GRCh37
│   └── 111_GRCh38

VEP's annotation folder selection logic:

  • By default vep will use ${dir_cache}/${species}/${cache_version}_${assembly}
  • If --refseq is provided, it will use ${dir_cache}/${species}_refseq/${cache_version}_${assembly}
  • If --merged is provided, it will use ${dir_cache}/${species}_merged/${cache_version}_${assembly}

Below are mappings between vep command options and sarek input params:

  • $dir_cache: $vep_cache
  • $species: $vep_species
  • $cache_version: $vep_cache_version
  • $assembly: $vep_genome

However, I'm not sure if this is the case for other species other than homo_sapiens.
For homo_sapiens, the correct logic to parse might be:

  • check if vep_custom_args contains --refseq/--merged
  • if both not exist, then validate annotation folder existence: ${vep_cache}/${vep_species}/${vep_cache_version}_${vep_genome}
  • if any of them exists, should validate annotation folder existence: ${vep_cache}/${vep_species}_refseq/${vep_cache_version}_${vep_genome} or ${vep_cache}/${vep_species}_merged/${vep_cache_version}_${vep_genome} ; note that $vep_species should still be homo_sapiens (because we need --species homo_sapiens in the vep command)

A more consistent way might be to provide another sarek input param to deal with vep's --refseq/--merged option

Command used and terminal output

No response

Relevant files

No response

System information

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions