Open
Description
https://github.com/dpryan79/ChromosomeMappings provides an outstanding repo of contig name maps across builds.
In the simplest case, UCSC/Gencode naming calls the first human chromosome chr1
, while Ensembl calls it 1
. It is not merely enough to slice off (or add) chr
however, because chrM == MT
, and there are numerous unlocalized and unplaced contigs. In addition, UCSC and Gencode are only identical with respect to the basic 23 chromosomes; they have different names for alt/unloc/unplaced contigs.
Crossmap takes the naive approach of renaming based on the chr
prefix, which is of course a hug ehelp to users who face the very real problem of mismatching contig names, but an incomplete solution.
Here, I propose two possible remappings:
- Remap the contig name from the BED or VCF file immediately, before hitting the chain file. This would be useful for instance if you had a VCF from Gnomad with Ensembl contigs, but your chain file expected UCSC/Gencode style contigs
- Remap the contig name coming out of the liftover. Incidentally, this also means we could create "identity" chain files that performed zero coordinate translation but the tool would essentially then be a contig renaming tool (albeit an overcomplicated one, but I don't know of another good tool that does this)