Skip to content

Mapping Ensembl seqlevels to UCSC for Mmul_10/rheMac10 #79

@hpages

Description

@hpages

Hi Hervé,

Happy New Year!

I’m working with Macaque single cell ATAC-seq data recently, and the package Signac requires the annotation formatted to UCSC style. However, I encountered the problem when I ran:

> ah= AnnotationHub()
> query(ah, c("EnsDb","Macaca", "mulatta")) 
> annotation <- ah[["AH95772"]]
> seqlevelsStyle(annotation) <- "UCSC"
Error in `seqlevelsStyle<-`(`*tmp*`, value = "UCSC") :
 No mapping of seqlevel styles available in GenomeInfoDb for species Macaca mulatta! Please refer to the Vignette of the GenomeInfoDb package if you would like to provide this mapping.

Could you please let me know if there’s a way to fix this?

Thanks!

Best,

Wen

There's a lot going on here.

The main issue is that this mapping is not properly supported in GenomeInfoDb at the moment.

For example:

library(AnnotationHub)
ah <- AnnotationHub()
ensdb <- ah[["AH95772"]]
x <- seqinfo(ensdb) ## Seqinfo object
> x
Seqinfo object with 329 sequences (1 circular) from Mmul_10 genome:
  seqnames       seqlengths isCircular  genome
  20               77137495      FALSE Mmul_10
  10               99517758      FALSE Mmul_10
  13              108737130      FALSE Mmul_10
  3               185288947      FALSE Mmul_10
  1               223616942      FALSE Mmul_10
  ...                   ...        ...     ...
  QNVO02000404.1      82950      FALSE Mmul_10
  QNVO02000958.1      23808      FALSE Mmul_10
  QNVO02002331.1       8826      FALSE Mmul_10
  QNVO02001857.1      53877      FALSE Mmul_10
  MT                  16564       TRUE Mmul_10

The seqlevelsStyle() getter gets it wrong (the seqnames are the Ensembl seqnames, not the NCBI ones):

> seqlevelsStyle(x)
[1] "NCBI"

And the seqlevelsStyle() setter does a very poor job:

> seqlevelsStyle(x) <- "UCSC"
Warning message:
In (function (seqlevels, genome, new_style)  :
  cannot switch some Mmul_10's seqlevels from NCBI to UCSC style

> x
Seqinfo object with 329 sequences (1 circular) from 2 genomes (Mmul_10, rheMac10):
  seqnames       seqlengths isCircular   genome
  20               77137495      FALSE  Mmul_10
  10               99517758      FALSE  Mmul_10
  13              108737130      FALSE  Mmul_10
  3               185288947      FALSE  Mmul_10
  1               223616942      FALSE  Mmul_10
  ...                   ...        ...      ...
  QNVO02000404.1      82950      FALSE  Mmul_10
  QNVO02000958.1      23808      FALSE  Mmul_10
  QNVO02002331.1       8826      FALSE  Mmul_10
  QNVO02001857.1      53877      FALSE  Mmul_10
  chrM                16564       TRUE rheMac10

> table(genome(x))

 Mmul_10 rheMac10 
     328        1 

Here's how low-level utilities getChromInfoFromEnsembl() and getChromInfoFromUCSC() can be used to do the job. The code is specifically taylored towards Mmul_10/rheMac10 so lacks generality, but it's a start:

## To use on a Seqinfo object only.
switch_Mmul_10_seqlevelsStyle_from_Ensembl_to_UCSC <- function(x)
{
    stopifnot(is(x, "Seqinfo"))
    if (!all(genome(x) %in% "Mmul_10"))
        stop(wmsg("'genome(x)' is not set to Mmul_10"))
    x_seqlevels <- seqlevels(x)
    ens_chrominfo  <- getChromInfoFromEnsembl("Mmul_10")
    ens_seqlevels <- ens_chrominfo[ , "name"]
    m0 <- match(x_seqlevels, ens_seqlevels)
    if (any(isNA(m0)))
        stop(wmsg("not all the seqlevels in 'x' belong to Ensembl Mmul_10"))
    
    ucsc_chrominfo <- getChromInfoFromUCSC("rheMac10", map.NCBI=TRUE)
    ucsc_seqlevels <- ucsc_chrominfo[ , "chrom"]

    ## Map 'ens_seqlevels' to 'ucsc_seqlevels'.
    pseudo_ucsc_seqlevels <- paste0("chr",  ens_seqlevels)
    pseudo_ucsc_seqlevels[pseudo_ucsc_seqlevels == "chrMT"] <- "chrM"
    m1 <- match(pseudo_ucsc_seqlevels, ucsc_seqlevels)
    m2 <- match(ens_seqlevels, ucsc_chrominfo[ , "NCBI.GenBankAccn"])
    idx <- which(is.na(m1))
    m1[idx] <- m2[idx]
    idx <- which(is.na(m2))
    m2[idx] <- m1[idx]

    ## Sanity checks.
    stopifnot(identical(m1, m2))  
    stopifnot(identical(ens_chrominfo[ , "length"],
                        ucsc_chrominfo[m1 , "size"]))

    ## Map 'x_seqlevels' to 'ucsc_seqlevels'.
    seqlevels(x) <- ucsc_seqlevels[m1[m0]]
    genome(x) <- "rheMac10"
    x
}

Then:

x <- seqinfo(ensdb)
x2 <- switch_Mmul_10_seqlevelsStyle_from_Ensembl_to_UCSC(x)
x2
# Seqinfo object with 329 sequences (1 circular) from rheMac10 genome:
#   seqnames                   seqlengths isCircular   genome
#   chr20                        77137495      FALSE rheMac10
#   chr10                        99517758      FALSE rheMac10
#   chr13                       108737130      FALSE rheMac10
#   chr3                        185288947      FALSE rheMac10
#   chr1                        223616942      FALSE rheMac10
#   ...                               ...        ...      ...
#   chr6_NW_021160188v1_random      82950      FALSE rheMac10
#   chrUn_NW_021160819v1            23808      FALSE rheMac10
#   chrUn_NW_021162185v1             8826      FALSE rheMac10
#   chrUn_NW_021161713v1            53877      FALSE rheMac10
#   chrM                            16564       TRUE rheMac10

Another problem is that EnsDb objects don't support the seqinfo() setter so x2 cannot be put back on ensdb:

> seqinfo(ensdb) <- x2
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'seqinfo<-' for signature '"EnsDb"'

Note that this is a limitation of EnsDb objects (implemented in the ensembldb package), so is kind of a separate issue.

Finally, about this error raised by the seqlevelsStyle() setter for EnsDb objects:

> seqlevelsStyle(ensdb) <- "UCSC"
Error in `seqlevelsStyle<-`(`*tmp*`, value = "UCSC") : 
  No mapping of seqlevel styles available in GenomeInfoDb for species Macaca mulatta! Please refer to the Vignette of the GenomeInfoDb package if you would like to provide this mapping.

I guess the error message refers to the Accept-organism-for-GenomeInfoDb.Rmd vignette. However note that those mappings were introduced a long time ago and were never intended to handle scaffolds, only the "main chromosomes". That's because scaffolds are specific to a particular assembly version (e.g. they're not the same in Mmul_10/rheMac10 and in Mmul_8.0.1/rheMac8). So adding a mapping for Macaca mulatta wouldn't actually help here.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions