-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Hi Hervé,
Happy New Year!
I’m working with Macaque single cell ATAC-seq data recently, and the package Signac requires the annotation formatted to UCSC style. However, I encountered the problem when I ran:
> ah= AnnotationHub() > query(ah, c("EnsDb","Macaca", "mulatta")) > annotation <- ah[["AH95772"]] > seqlevelsStyle(annotation) <- "UCSC" Error in `seqlevelsStyle<-`(`*tmp*`, value = "UCSC") : No mapping of seqlevel styles available in GenomeInfoDb for species Macaca mulatta! Please refer to the Vignette of the GenomeInfoDb package if you would like to provide this mapping.
Could you please let me know if there’s a way to fix this?
Thanks!
Best,
Wen
There's a lot going on here.
The main issue is that this mapping is not properly supported in GenomeInfoDb at the moment.
For example:
library(AnnotationHub)
ah <- AnnotationHub()
ensdb <- ah[["AH95772"]]
x <- seqinfo(ensdb) ## Seqinfo object
> x
Seqinfo object with 329 sequences (1 circular) from Mmul_10 genome:
seqnames seqlengths isCircular genome
20 77137495 FALSE Mmul_10
10 99517758 FALSE Mmul_10
13 108737130 FALSE Mmul_10
3 185288947 FALSE Mmul_10
1 223616942 FALSE Mmul_10
... ... ... ...
QNVO02000404.1 82950 FALSE Mmul_10
QNVO02000958.1 23808 FALSE Mmul_10
QNVO02002331.1 8826 FALSE Mmul_10
QNVO02001857.1 53877 FALSE Mmul_10
MT 16564 TRUE Mmul_10
The seqlevelsStyle()
getter gets it wrong (the seqnames are the Ensembl seqnames, not the NCBI ones):
> seqlevelsStyle(x)
[1] "NCBI"
And the seqlevelsStyle()
setter does a very poor job:
> seqlevelsStyle(x) <- "UCSC"
Warning message:
In (function (seqlevels, genome, new_style) :
cannot switch some Mmul_10's seqlevels from NCBI to UCSC style
> x
Seqinfo object with 329 sequences (1 circular) from 2 genomes (Mmul_10, rheMac10):
seqnames seqlengths isCircular genome
20 77137495 FALSE Mmul_10
10 99517758 FALSE Mmul_10
13 108737130 FALSE Mmul_10
3 185288947 FALSE Mmul_10
1 223616942 FALSE Mmul_10
... ... ... ...
QNVO02000404.1 82950 FALSE Mmul_10
QNVO02000958.1 23808 FALSE Mmul_10
QNVO02002331.1 8826 FALSE Mmul_10
QNVO02001857.1 53877 FALSE Mmul_10
chrM 16564 TRUE rheMac10
> table(genome(x))
Mmul_10 rheMac10
328 1
Here's how low-level utilities getChromInfoFromEnsembl()
and getChromInfoFromUCSC()
can be used to do the job. The code is specifically taylored towards Mmul_10/rheMac10 so lacks generality, but it's a start:
## To use on a Seqinfo object only.
switch_Mmul_10_seqlevelsStyle_from_Ensembl_to_UCSC <- function(x)
{
stopifnot(is(x, "Seqinfo"))
if (!all(genome(x) %in% "Mmul_10"))
stop(wmsg("'genome(x)' is not set to Mmul_10"))
x_seqlevels <- seqlevels(x)
ens_chrominfo <- getChromInfoFromEnsembl("Mmul_10")
ens_seqlevels <- ens_chrominfo[ , "name"]
m0 <- match(x_seqlevels, ens_seqlevels)
if (any(isNA(m0)))
stop(wmsg("not all the seqlevels in 'x' belong to Ensembl Mmul_10"))
ucsc_chrominfo <- getChromInfoFromUCSC("rheMac10", map.NCBI=TRUE)
ucsc_seqlevels <- ucsc_chrominfo[ , "chrom"]
## Map 'ens_seqlevels' to 'ucsc_seqlevels'.
pseudo_ucsc_seqlevels <- paste0("chr", ens_seqlevels)
pseudo_ucsc_seqlevels[pseudo_ucsc_seqlevels == "chrMT"] <- "chrM"
m1 <- match(pseudo_ucsc_seqlevels, ucsc_seqlevels)
m2 <- match(ens_seqlevels, ucsc_chrominfo[ , "NCBI.GenBankAccn"])
idx <- which(is.na(m1))
m1[idx] <- m2[idx]
idx <- which(is.na(m2))
m2[idx] <- m1[idx]
## Sanity checks.
stopifnot(identical(m1, m2))
stopifnot(identical(ens_chrominfo[ , "length"],
ucsc_chrominfo[m1 , "size"]))
## Map 'x_seqlevels' to 'ucsc_seqlevels'.
seqlevels(x) <- ucsc_seqlevels[m1[m0]]
genome(x) <- "rheMac10"
x
}
Then:
x <- seqinfo(ensdb)
x2 <- switch_Mmul_10_seqlevelsStyle_from_Ensembl_to_UCSC(x)
x2
# Seqinfo object with 329 sequences (1 circular) from rheMac10 genome:
# seqnames seqlengths isCircular genome
# chr20 77137495 FALSE rheMac10
# chr10 99517758 FALSE rheMac10
# chr13 108737130 FALSE rheMac10
# chr3 185288947 FALSE rheMac10
# chr1 223616942 FALSE rheMac10
# ... ... ... ...
# chr6_NW_021160188v1_random 82950 FALSE rheMac10
# chrUn_NW_021160819v1 23808 FALSE rheMac10
# chrUn_NW_021162185v1 8826 FALSE rheMac10
# chrUn_NW_021161713v1 53877 FALSE rheMac10
# chrM 16564 TRUE rheMac10
Another problem is that EnsDb objects don't support the seqinfo()
setter so x2
cannot be put back on ensdb
:
> seqinfo(ensdb) <- x2
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function 'seqinfo<-' for signature '"EnsDb"'
Note that this is a limitation of EnsDb objects (implemented in the ensembldb package), so is kind of a separate issue.
Finally, about this error raised by the seqlevelsStyle()
setter for EnsDb objects:
> seqlevelsStyle(ensdb) <- "UCSC"
Error in `seqlevelsStyle<-`(`*tmp*`, value = "UCSC") :
No mapping of seqlevel styles available in GenomeInfoDb for species Macaca mulatta! Please refer to the Vignette of the GenomeInfoDb package if you would like to provide this mapping.
I guess the error message refers to the Accept-organism-for-GenomeInfoDb.Rmd vignette. However note that those mappings were introduced a long time ago and were never intended to handle scaffolds, only the "main chromosomes". That's because scaffolds are specific to a particular assembly version (e.g. they're not the same in Mmul_10/rheMac10 and in Mmul_8.0.1/rheMac8). So adding a mapping for Macaca mulatta wouldn't actually help here.