Skip to content

Caching seqinfo  #26

@jeff-mandell

Description

@jeff-mandell

Hi, my package uses genomeInfoDb, and we use the seqlevelsStyle function to clean up user-inputted data and ensure consistent chromosome names (in our case, we go with NCBI style, which means stripping chr prefixes). I can see that what seems like a simple task gets complicated under the hood with the need to download the latest info from NCBI, Ensembl, and UCSC.

I found that .UCSC_cached_chrom_info and .NCBI_cached_chrom_info store the necessary information for seqlevelsStyle throughout a session, but an internet connection is initially necessary every new session. This causes a problem for offline users and users on networks that for whatever reason are blocking any of NCBI/UCSC/Ensembl traffic (yes, this is really happening). Since seqinfo is such a small amount of data, is there a plan to take advantage of R's support for caching user data to save this information and allow seqlevelsStyle to run offline? Or is there a safe workaround to supply the necessary seqinfo?

I did it this way, but I'm concerned this could cause problems with new GenomeInfoDb releases or if anything changes on the NCBI/UCSC/Ensembl server side.

# Get information for local caching
bsg = getBSgenome("hg19")
seqlevelsStyle(bsg) = "NCBI"
ucsc_info = GenomeInfoDb:::.add_ensembl_column(ucsc_info, "hg19")
ucsc_info = getFromNamespace(".UCSC_cached_chrom_info", "GenomeInfoDb")[["hg19"]]
ucsc_info = GenomeInfoDb:::.add_ensembl_column(ucsc_info, "hg19")
ncbi_info = getFromNamespace(".NCBI_cached_chrom_info", "GenomeInfoDb")[["GCF_000001405.25"]]
saveRDS(ncbi_info, "hg19_ncbi_seqinfo_for_GenomeInfoDb.rds")
saveRDS(ucsc_info, "hg19_ucsc_seqinfo_for_GenomeInfoDb.rds")

# Later, in new (offline) R session
ucsc_info = readRDS("hg19_ucsc_seqinfo_for_GenomeInfoDb.rds")
ncbi_info = readRDS("hg19_ncbi_seqinfo_for_GenomeInfoDb.rds")
assign('hg19', ucsc_info, envir = get(".UCSC_cached_chrom_info", envir = asNamespace('GenomeInfoDb')))
assign('GCF_000001405.25', ncbi_info, envir = get(".NCBI_cached_chrom_info", envir = asNamespace('GenomeInfoDb')))

# seqlevelsStyle now works offline

`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions