Skip to content

SIG: Bioconductor Infrastructure for Base Modifications #35

@Shians

Description

@Shians

Introduction

I am a new PhD Student at the Walter and Eliza Hall institute in Melbourne, Australia. My project is based around methods and tools for the analysis of DNA methylation in long reads using Oxford Nanopore sequencers. My formal background is in statistics but I mainly work on developing software and have a keen interest in efficient and user-friendly computational methods and visualisation.

Expected attendees

Researchers who are interested in base modifications of all kinds, I am interested in DNA but the developed structure should equally support RNA modifications.

Should it be held during Developer Day

Probably

Description of the topic

(Will update this section after I do some more research and take suggestions)

I think there are things to keep in mind for this:

  • Support for long reads, I don't think this is an big issue, I'm not aware of GenomicRanges having any limitations with length of reads, but since I'm interested in Nanopore sequencing, it's vitally important to have this support.
  • Read-based tracking, since I'm thinking about long reads, I can potentially detect when two sites along a read have correlated or anti-correlated methylation patterns on the same molecule. So I want to not only keep track of this information but efficiently make queries based on it.
  • Support for RNA modifications, there are over a hundred of these, I think extended alphabets are sometimes used for representing DNA modifications but that's likely not feasible without creating FASTQ qual-string-like monstrosities.
  • Interoperability with genomic data structures, down the line it's very likely that methylation and mRNA expression will be analysed together, facilitating this kind of analysis is of great interest.

As far as I'm aware there's not a specialised widely supported Bioconductor structure for storing base modification information that also facilitates straightforward querying of common issues. The basics would be to ask for the methylation proportions in a specific region, there should be metadata within objects to separate groups for which this can be asked as well as reporting of coverage at the loci. Additionally it would be useful to query within-read methylation patterns, to inspect correlation between methylation sites within molecules. Compactness of representation is also going to be important, sparse or on-disk representations would be useful to consider, features and query performance probably take second place to storage size.

Desired outcome

I'd like to establish a set of queries of interest and a general abstract idea of what data structure(s) might be appropriate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions