genome in TSSlogo function #123

ferenckata · 2024-11-28T12:47:56Z

I have some troubles with TSSlogo. I load the annotation from txdb like this:

tx_annotation_obj <- loadDb(tx_annotation)
ce <- CAGEr::annotateCTSS(ce, tx_annotation_obj)

and run TSSlogo like this:

CAGEr::TSSlogo(
    CAGEr::CTSScoordinatesGR(ce) |> subset(annotation == "promoter"),
    upstream = 35)

The error I get:

Error in getBSgenome(unique(genome(x))) : 
   'genome' must be a BSgenome object, or the full name of an installed
   BSgenome data package, or a short string specifying a genome assembly that
   refers unambiguouslyto an installed BSgenome data package
 Calls: <Anonymous> -> <Anonymous> -> .TSSlogo -> getBSgenome

for checking:

> x <- CAGEr::CTSScoordinatesGR(ce) |> subset(annotation == "promoter")
> x
CTSS object with 1501 positions and 2 metadata columns:
         seqnames       pos strand | genes annotation
            <Rle> <integer>  <Rle> | <Rle>      <Rle>
     [1]     chrI     71726      + |         promoter
     [2]     chrI     71751      + |         promoter
     [3]     chrI     71759      + |         promoter
     [4]     chrI     71767      + |         promoter
     [5]     chrI    142134      + |         promoter
     ...      ...       ...    ... .   ...        ...
  [1497]   chrXVI    694898      - |         promoter
  [1498]   chrXVI    696877      - |         promoter
  [1499]   chrXVI    720487      - |         promoter
  [1500]   chrXVI    729545      - |         promoter
  [1501]   chrXVI    892095      - |         promoter
  -------
  seqinfo: 16 sequences from an unspecified genome; no seqlengths
  BSgenome name: unspecified
> genome(x)
   chrI   chrII  chrIII   chrIV   chrIX    chrV   chrVI  chrVII chrVIII    chrX 
     NA      NA      NA      NA      NA      NA      NA      NA      NA      NA 
  chrXI  chrXII chrXIII  chrXIV   chrXV  chrXVI 
     NA      NA      NA      NA      NA      NA

The text was updated successfully, but these errors were encountered:

ferenckata · 2024-11-28T13:22:30Z

solved by passing genome as an argument:

TSSlogo_local <- function(x, genome_name, upstream=10, downstream=10) {
  if (! requireNamespace("ggseqlogo"))
    stop("This function requires the ", dQuote("ggseqlogo"), " package; please install it.")
  # Extract sequences
  upstreamRanges <-
    promoters(x = x, upstream = upstream, downstream = downstream) |>
    suppressWarnings() # This warns about off-genome coordinates, but we will fix this below.
  
  # Discard ranges that would be trimmed 
  upstreamRanges_trimmed <- upstreamRanges |> trim()
  upstreamRanges <- upstreamRanges[width(upstreamRanges) == width(upstreamRanges_trimmed)]
  
  # Extract sequences
  bsgenome <- getBSgenome(genome_name)
  upstreamSeq <- getSeq(bsgenome, upstreamRanges)
  
  # Plot sequence logo
  letter_counts <- consensusMatrix(upstreamSeq)
  probs <- prop.table(letter_counts[1:4,], 2)
  gg <- ggseqlogo::ggseqlogo(probs)
  # Circumvent "Scale for x is already present." warning.
  gg$scales$scales[[2]]$breaks <- seq(    1       , upstream + downstream, by = 1)
  gg$scales$scales[[2]]$labels <- seq(1 - upstream,            downstream, by = 1)
  gg$scales$scales[[1]]$breaks <- seq(    1       , upstream + downstream, by = 1)
  gg$scales$scales[[1]]$labels <- seq(1 - upstream,            downstream, by = 1)
  gg
}

charles-plessy · 2024-11-28T13:59:40Z

Hi Katalin, yes if your CAGEexp object did not get a BSgenome name assigned at creation, the TSSlogo is not going to work. Maybe the error message can be improved.

ferenckata · 2024-11-28T14:40:40Z

great, thanks! I will try to fix that then

ferenckata · 2024-12-02T14:27:09Z

Hi again,
I managed to add bsgenome to metadata, but that did not solve the problem. I can see that it is not added as seqinfo to the object returned with CTSStagCountSE:

> metadata(myce)
$genomeName
[1] "BSgenome.Scerevisiae.UCSC.sacCer3"

> seqinfo(CTSStagCountSE(myce))
Seqinfo object with 16 sequences from an unspecified genome; no seqlengths:
  seqnames seqlengths isCircular genome
  chrI           <NA>       <NA>   <NA>
  chrII          <NA>       <NA>   <NA>
  chrIII         <NA>       <NA>   <NA>
  chrIV          <NA>       <NA>   <NA>
  chrIX          <NA>       <NA>   <NA>
  ...             ...        ...    ...
  chrXII         <NA>       <NA>   <NA>
  chrXIII        <NA>       <NA>   <NA>
  chrXIV         <NA>       <NA>   <NA>
  chrXV          <NA>       <NA>   <NA>
  chrXVI         <NA>       <NA>   <NA>

> seqinfo(CTSStagCountSE(ce))
Seqinfo object with 26 sequences (1 circular) from danRer7 genome:
  seqnames seqlengths isCircular  genome
  chr1       60348388      FALSE danRer7
  chr2       60300536      FALSE danRer7
  chr3       63268876      FALSE danRer7
  chr4       62094675      FALSE danRer7
  chr5       75682077      FALSE danRer7
  ...             ...        ...     ...
  chr22      42261000      FALSE danRer7
  chr23      46386876      FALSE danRer7
  chr24      43947580      FALSE danRer7
  chr25      38499472      FALSE danRer7
  chrM          16596       TRUE danRer7

> metadata(ce)
$genomeName
[1] "BSgenome.Drerio.UCSC.danRer7"

I have been reading the source code for some time and cannot find the step where I should add this info. Can you maybe point me there?

charles-plessy · 2024-12-02T23:16:45Z

Hi Katalin,

I think that there is no good way to add BSgenome information after the object was created. Have you tried re-loading the data from scratch?

ferenckata · 2024-12-03T08:58:38Z

Hi Charles,
Yes, so the thing is it is loaded as 4.1 in the documentation. Damir made a code that works from bigwig and I am trying to fix that so bsgenome is properly added. Do you have a suggestion about adding it better maybe?


# bsgenome_name="BSgenome.Scerevisiae.UCSC.sacCer3",
# bigwig_paths=["path/to/file1.bw", "path/to/file2.bw"]

bigwigs = unlist(
    stringr::str_split(
      stringr::str_remove_all(
        bigwig_paths, "[\\[\\],]"),
      stringr::fixed(" ")))

  signals = lapply(
    bigwigs,
    function(x) rtracklayer::import(x))

  names(signals) = basename(bigwigs)

  signalsSplit = split(
    signals,
    grepl("str1", names(signals)))

  plus = lapply(signalsSplit$`TRUE`, function(x) {
    strand(x) = "+"
    return(x)
  })

  minus = lapply(signalsSplit$`FALSE`, function(x) {
    strand(x) = "-"
    return(x)
  })

  if (grepl( ".Signal.UniqueMultiple.str1.out.wig.bw", names(plus)[1], fixed = TRUE)){
    plus_sample_names = stringr::str_remove_all(
      names(plus),
      ".Signal.UniqueMultiple.str1.out.wig.bw")
    minus_sample_names = stringr::str_remove_all(
      names(minus),
      ".Signal.UniqueMultiple.str2.out.wig.bw" )
  } else if (grepl( ".Signal.Unique.str1.out.wig.bw", names(plus)[1], fixed = TRUE)) {
    plus_sample_names = stringr::str_remove_all(
      names(plus),
      ".Signal.Unique.str1.out.wig.bw")
    minus_sample_names = stringr::str_remove_all(
      names(minus),
      ".Signal.Unique.str2.out.wig.bw")
  }

  names(plus) <- plus_sample_names
  names(minus) <- minus_sample_names

  if (!all(plus_sample_names == minus_sample_names)) {
    if (setequal(plus_sample_names, minus_sample_names)) {
      minus = minus[match(plus_sample_names, minus_sample_names)]
    } else {
      stop("Error: Some basenames of minus- and plus-strand bigWigs are different! Are these bigWigs from different sets of samples? Exit.")
    }
  }

  merged = mapply(c, plus, minus)

  ctss.obj = purrr::reduce(
    merged,
    function(x, y) {
      dplyr::full_join(
        as.data.frame(x),
        as.data.frame(y),
        by = c("seqnames", "start", "strand")) %>%
      dplyr::select(
        c("seqnames", "start", "strand"),
        contains("score"))
    }) %>%
    dplyr::rename(
      chr = "seqnames",
      pos = "start") %>%
    dplyr::mutate(
      across(
        where(is.numeric),
        ~tidyr::replace_na(., 0)))

  names(ctss.obj) = c("chr", "pos", "strand", names(merged))

  ctss.obj %<>%
    dplyr::mutate(across(all_of(plus_sample_names), as.integer))
  ctss.obj$chr = as.character(ctss.obj$chr)
  ctss.obj$strand = as.character(ctss.obj$strand)
  cageexpobj = as(ctss.obj, "CAGEexp")

  colData(cageexpobj)$inputFilesType <- "CTSStable"
  colData(cageexpobj)$inputFiles <- bigwigs[grepl("str1", bigwigs)]

  metadata(cageexpobj)$genomeName = bsgenome_name

  rowRanges(cageexpobj@ExperimentList$tagCountMatrix) = as(
    rowRanges(cageexpobj@ExperimentList$tagCountMatrix),
    Class = "CTSS")

charles-plessy · 2024-12-03T09:28:00Z

Hi Katalin and Damir,

that sounds cool,

just to make sure, is it to create a CAGEexp object by loading bigWig files? If yes please have a look at loadFileIntoGPos and its import.* functions: this is I think the best entry point to add support for this format. Please close this issue and open another one if you go that way.

Where are the bigWig files from? Are they produced by a public pipeline? I see in your code that it is quite specific to some naming conventions…

Also, please be prepared that I will ask you to not add new dependencies on tidyverse packages if possible, as CAGEr's dependency count is already quite high...

Finally, you can have a look at the kind of studies I did in the benchmark folder, in case there are some steps you would like to optimise or if you hesitate about which class to use for receiving the data (in doubt I recommend DataFrame of Rle of integers).

Have a nice day,

Charles

PS: I am thinking about a way to allow the user to pass a FASTA file instead of a BSgenome object when it does not make sense to take all the effort to create such object. We can discuss it by email or via another issue.

ferenckata · 2024-12-03T09:41:38Z

Nice! thanks for the quick reply!

That is correct, it takes bigwigs and the end point is a CAGEexp object. It works all well actually for most steps: TSSlogo fails as noted above and I haven't added all steps yet (only until tagcluster qc).

I wanted to ask if you think this feature (not in its current form) could eventually go into CAGEr itself, but first I really need to get the pipeline in a decent shape, so I am not volunteering myself for that yet :) The pipeline is not public yet, but the input is from STAR's Nextflow module so that is fairly standard. It will allow the user to start from fasta instead of bsgenome, but we can coordinate about that too.

ferenckata closed this as completed Nov 28, 2024

ferenckata reopened this Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

genome in TSSlogo function #123

genome in TSSlogo function #123

ferenckata commented Nov 28, 2024 •

edited

Loading

ferenckata commented Nov 28, 2024

charles-plessy commented Nov 28, 2024

ferenckata commented Nov 28, 2024

ferenckata commented Dec 2, 2024

charles-plessy commented Dec 2, 2024

ferenckata commented Dec 3, 2024 •

edited

Loading

charles-plessy commented Dec 3, 2024

ferenckata commented Dec 3, 2024

genome in TSSlogo function #123

genome in TSSlogo function #123

Comments

ferenckata commented Nov 28, 2024 • edited Loading

ferenckata commented Nov 28, 2024

charles-plessy commented Nov 28, 2024

ferenckata commented Nov 28, 2024

ferenckata commented Dec 2, 2024

charles-plessy commented Dec 2, 2024

ferenckata commented Dec 3, 2024 • edited Loading

charles-plessy commented Dec 3, 2024

ferenckata commented Dec 3, 2024

ferenckata commented Nov 28, 2024 •

edited

Loading

ferenckata commented Dec 3, 2024 •

edited

Loading