Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genome in TSSlogo function #123

Open
ferenckata opened this issue Nov 28, 2024 · 8 comments
Open

genome in TSSlogo function #123

ferenckata opened this issue Nov 28, 2024 · 8 comments

Comments

@ferenckata
Copy link
Collaborator

ferenckata commented Nov 28, 2024

I have some troubles with TSSlogo. I load the annotation from txdb like this:

tx_annotation_obj <- loadDb(tx_annotation)
ce <- CAGEr::annotateCTSS(ce, tx_annotation_obj)

and run TSSlogo like this:

CAGEr::TSSlogo(
    CAGEr::CTSScoordinatesGR(ce) |> subset(annotation == "promoter"),
    upstream = 35)

The error I get:

Error in getBSgenome(unique(genome(x))) : 
   'genome' must be a BSgenome object, or the full name of an installed
   BSgenome data package, or a short string specifying a genome assembly that
   refers unambiguouslyto an installed BSgenome data package
 Calls: <Anonymous> -> <Anonymous> -> .TSSlogo -> getBSgenome

for checking:

> x <- CAGEr::CTSScoordinatesGR(ce) |> subset(annotation == "promoter")
> x
CTSS object with 1501 positions and 2 metadata columns:
         seqnames       pos strand | genes annotation
            <Rle> <integer>  <Rle> | <Rle>      <Rle>
     [1]     chrI     71726      + |         promoter
     [2]     chrI     71751      + |         promoter
     [3]     chrI     71759      + |         promoter
     [4]     chrI     71767      + |         promoter
     [5]     chrI    142134      + |         promoter
     ...      ...       ...    ... .   ...        ...
  [1497]   chrXVI    694898      - |         promoter
  [1498]   chrXVI    696877      - |         promoter
  [1499]   chrXVI    720487      - |         promoter
  [1500]   chrXVI    729545      - |         promoter
  [1501]   chrXVI    892095      - |         promoter
  -------
  seqinfo: 16 sequences from an unspecified genome; no seqlengths
  BSgenome name: unspecified
> genome(x)
   chrI   chrII  chrIII   chrIV   chrIX    chrV   chrVI  chrVII chrVIII    chrX 
     NA      NA      NA      NA      NA      NA      NA      NA      NA      NA 
  chrXI  chrXII chrXIII  chrXIV   chrXV  chrXVI 
     NA      NA      NA      NA      NA      NA
@ferenckata
Copy link
Collaborator Author

solved by passing genome as an argument:

TSSlogo_local <- function(x, genome_name, upstream=10, downstream=10) {
  if (! requireNamespace("ggseqlogo"))
    stop("This function requires the ", dQuote("ggseqlogo"), " package; please install it.")
  # Extract sequences
  upstreamRanges <-
    promoters(x = x, upstream = upstream, downstream = downstream) |>
    suppressWarnings() # This warns about off-genome coordinates, but we will fix this below.
  
  # Discard ranges that would be trimmed 
  upstreamRanges_trimmed <- upstreamRanges |> trim()
  upstreamRanges <- upstreamRanges[width(upstreamRanges) == width(upstreamRanges_trimmed)]
  
  # Extract sequences
  bsgenome <- getBSgenome(genome_name)
  upstreamSeq <- getSeq(bsgenome, upstreamRanges)
  
  # Plot sequence logo
  letter_counts <- consensusMatrix(upstreamSeq)
  probs <- prop.table(letter_counts[1:4,], 2)
  gg <- ggseqlogo::ggseqlogo(probs)
  # Circumvent "Scale for x is already present." warning.
  gg$scales$scales[[2]]$breaks <- seq(    1       , upstream + downstream, by = 1)
  gg$scales$scales[[2]]$labels <- seq(1 - upstream,            downstream, by = 1)
  gg$scales$scales[[1]]$breaks <- seq(    1       , upstream + downstream, by = 1)
  gg$scales$scales[[1]]$labels <- seq(1 - upstream,            downstream, by = 1)
  gg
}

@charles-plessy
Copy link
Owner

Hi Katalin, yes if your CAGEexp object did not get a BSgenome name assigned at creation, the TSSlogo is not going to work. Maybe the error message can be improved.

@ferenckata
Copy link
Collaborator Author

great, thanks! I will try to fix that then

@ferenckata
Copy link
Collaborator Author

Hi again,
I managed to add bsgenome to metadata, but that did not solve the problem. I can see that it is not added as seqinfo to the object returned with CTSStagCountSE:

> metadata(myce)
$genomeName
[1] "BSgenome.Scerevisiae.UCSC.sacCer3"

> seqinfo(CTSStagCountSE(myce))
Seqinfo object with 16 sequences from an unspecified genome; no seqlengths:
  seqnames seqlengths isCircular genome
  chrI           <NA>       <NA>   <NA>
  chrII          <NA>       <NA>   <NA>
  chrIII         <NA>       <NA>   <NA>
  chrIV          <NA>       <NA>   <NA>
  chrIX          <NA>       <NA>   <NA>
  ...             ...        ...    ...
  chrXII         <NA>       <NA>   <NA>
  chrXIII        <NA>       <NA>   <NA>
  chrXIV         <NA>       <NA>   <NA>
  chrXV          <NA>       <NA>   <NA>
  chrXVI         <NA>       <NA>   <NA>

> seqinfo(CTSStagCountSE(ce))
Seqinfo object with 26 sequences (1 circular) from danRer7 genome:
  seqnames seqlengths isCircular  genome
  chr1       60348388      FALSE danRer7
  chr2       60300536      FALSE danRer7
  chr3       63268876      FALSE danRer7
  chr4       62094675      FALSE danRer7
  chr5       75682077      FALSE danRer7
  ...             ...        ...     ...
  chr22      42261000      FALSE danRer7
  chr23      46386876      FALSE danRer7
  chr24      43947580      FALSE danRer7
  chr25      38499472      FALSE danRer7
  chrM          16596       TRUE danRer7

> metadata(ce)
$genomeName
[1] "BSgenome.Drerio.UCSC.danRer7"

I have been reading the source code for some time and cannot find the step where I should add this info. Can you maybe point me there?

@charles-plessy
Copy link
Owner

Hi Katalin,

I think that there is no good way to add BSgenome information after the object was created. Have you tried re-loading the data from scratch?

@ferenckata
Copy link
Collaborator Author

ferenckata commented Dec 3, 2024

Hi Charles,
Yes, so the thing is it is loaded as 4.1 in the documentation. Damir made a code that works from bigwig and I am trying to fix that so bsgenome is properly added. Do you have a suggestion about adding it better maybe?


# bsgenome_name="BSgenome.Scerevisiae.UCSC.sacCer3",
# bigwig_paths=["path/to/file1.bw", "path/to/file2.bw"]

bigwigs = unlist(
    stringr::str_split(
      stringr::str_remove_all(
        bigwig_paths, "[\\[\\],]"),
      stringr::fixed(" ")))

  signals = lapply(
    bigwigs,
    function(x) rtracklayer::import(x))

  names(signals) = basename(bigwigs)

  signalsSplit = split(
    signals,
    grepl("str1", names(signals)))

  plus = lapply(signalsSplit$`TRUE`, function(x) {
    strand(x) = "+"
    return(x)
  })

  minus = lapply(signalsSplit$`FALSE`, function(x) {
    strand(x) = "-"
    return(x)
  })

  if (grepl( ".Signal.UniqueMultiple.str1.out.wig.bw", names(plus)[1], fixed = TRUE)){
    plus_sample_names = stringr::str_remove_all(
      names(plus),
      ".Signal.UniqueMultiple.str1.out.wig.bw")
    minus_sample_names = stringr::str_remove_all(
      names(minus),
      ".Signal.UniqueMultiple.str2.out.wig.bw" )
  } else if (grepl( ".Signal.Unique.str1.out.wig.bw", names(plus)[1], fixed = TRUE)) {
    plus_sample_names = stringr::str_remove_all(
      names(plus),
      ".Signal.Unique.str1.out.wig.bw")
    minus_sample_names = stringr::str_remove_all(
      names(minus),
      ".Signal.Unique.str2.out.wig.bw")
  }

  names(plus) <- plus_sample_names
  names(minus) <- minus_sample_names

  if (!all(plus_sample_names == minus_sample_names)) {
    if (setequal(plus_sample_names, minus_sample_names)) {
      minus = minus[match(plus_sample_names, minus_sample_names)]
    } else {
      stop("Error: Some basenames of minus- and plus-strand bigWigs are different! Are these bigWigs from different sets of samples? Exit.")
    }
  }

  merged = mapply(c, plus, minus)

  ctss.obj = purrr::reduce(
    merged,
    function(x, y) {
      dplyr::full_join(
        as.data.frame(x),
        as.data.frame(y),
        by = c("seqnames", "start", "strand")) %>%
      dplyr::select(
        c("seqnames", "start", "strand"),
        contains("score"))
    }) %>%
    dplyr::rename(
      chr = "seqnames",
      pos = "start") %>%
    dplyr::mutate(
      across(
        where(is.numeric),
        ~tidyr::replace_na(., 0)))

  names(ctss.obj) = c("chr", "pos", "strand", names(merged))

  ctss.obj %<>%
    dplyr::mutate(across(all_of(plus_sample_names), as.integer))
  ctss.obj$chr = as.character(ctss.obj$chr)
  ctss.obj$strand = as.character(ctss.obj$strand)
  cageexpobj = as(ctss.obj, "CAGEexp")

  colData(cageexpobj)$inputFilesType <- "CTSStable"
  colData(cageexpobj)$inputFiles <- bigwigs[grepl("str1", bigwigs)]

  metadata(cageexpobj)$genomeName = bsgenome_name

  rowRanges(cageexpobj@ExperimentList$tagCountMatrix) = as(
    rowRanges(cageexpobj@ExperimentList$tagCountMatrix),
    Class = "CTSS")

@charles-plessy
Copy link
Owner

Hi Katalin and Damir,

that sounds cool,

just to make sure, is it to create a CAGEexp object by loading bigWig files? If yes please have a look at loadFileIntoGPos and its import.* functions: this is I think the best entry point to add support for this format. Please close this issue and open another one if you go that way.

Where are the bigWig files from? Are they produced by a public pipeline? I see in your code that it is quite specific to some naming conventions…

Also, please be prepared that I will ask you to not add new dependencies on tidyverse packages if possible, as CAGEr's dependency count is already quite high...

Finally, you can have a look at the kind of studies I did in the benchmark folder, in case there are some steps you would like to optimise or if you hesitate about which class to use for receiving the data (in doubt I recommend DataFrame of Rle of integers).

Have a nice day,

Charles

PS: I am thinking about a way to allow the user to pass a FASTA file instead of a BSgenome object when it does not make sense to take all the effort to create such object. We can discuss it by email or via another issue.

@ferenckata
Copy link
Collaborator Author

Nice! thanks for the quick reply!

That is correct, it takes bigwigs and the end point is a CAGEexp object. It works all well actually for most steps: TSSlogo fails as noted above and I haven't added all steps yet (only until tagcluster qc).

I wanted to ask if you think this feature (not in its current form) could eventually go into CAGEr itself, but first I really need to get the pipeline in a decent shape, so I am not volunteering myself for that yet :) The pipeline is not public yet, but the input is from STAR's Nextflow module so that is fairly standard. It will allow the user to start from fasta instead of bsgenome, but we can coordinate about that too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants