Skip to content

ZhaiLab-SUSTech/Mouse_polya_atlas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pan-organ poly(A) atlas reveals a post-transcriptional regulatory layer independent of transcription

License: MIT

Overview

This study constructs a comprehensive poly(A) tail atlas by performing unprecedentedly deep, full-length nanopore sequencing across 18 mouse organs. The initial processing of raw data, from FAST5 files to poly(A) tail length measurement, was performed using the FLEP-seq analysis pipeline. This repository contains the subsequent code for downstream analysis and visualization presented in the manuscript, "Pan-organ poly(A) atlas reveals a post-transcriptional regulatory layer independent of transcription".

Data and Code Availability

  • Interactive Data Portal: You can query, visualize, and download the poly(A) tail length and gene expression data from our Mouse Poly(A) Tail Atlas website.
  • Raw Sequencing Data: The FLEP-seq2 data generated in this study have been deposited in the GSA (Genome Sequence Archive) database under accession number CRA028430.
  • Analysis Code: All code for the preprocessing pipeline and downstream analysis is hosted in this GitHub repository.

System Requirements

  • Python (version >= 3.8 is recommended)
    • numpy (>= 1.23)
    • pandas (>= 1.5.3)
    • scikit-learn (>= 1.2.2)
    • matplotlib (>= 3.7)
    • seaborn (>= 0.11.2)
    • pysam (>= 0.21.0)
    • scipy (>= 1.10.1)
    • gseapy (>= 1.0.5)
  • R (version 4.2.2)
    • WGCNA (>= 1.73)
    • tidyverse (>= 1.3.2)
    • dplyr (>= 1.1.0)
    • BiocManager
  • Command-line tools
    • Isoquant (v3.6.0)
    • samtools (v1.3.1)

Installation Guide

This project requires separate Python and R environments. The recommended method for managing these is conda.

Step 1: Clone the Repository

First, clone this repository to your local machine and navigate into the directory.

git clone https://github.com/ZhaiLab-SUSTech/Mouse_polya_atlas.git
cd Mouse_polya_atlas

Step 2: Set Up the Python Environment

  1. Create the environment from the file: We provide a python_env.yml file to ensure all dependencies are correct. Save the following content as python_env.yml in the project directory:

  2. Create and activate the conda environment:

    # Create the environment
    conda env create -f python_env.yml
    
    # You can activate it when needed with:
    # conda activate data_prep_env
    

Step 3: Set Up the R Environment

This environment is for running the main WGCNA analysis.

  1. Create a base R environment using conda:

    conda create -n wgcna_env -c conda-forge r-base=4.2.2
  2. Activate the new R environment:

    conda activate wgcna_env
  3. Install required R packages::

    Rscript install_packages.R

Project Structure

The project is organized into a main Snakemake workflow, supplemented by individual scripts for debugging and a modular collection of downstream analyses.

  • Snakefile & config.yaml: The central Snakemake workflow and its configuration, orchestrating the entire preprocessing pipeline.
  • data/: Contains annotation files and defines the expected directory structure for input data. Note: The fasta and gtf files in data/annotation/ are placeholders and should be replaced with the actual reference files before running the pipeline.
  • scripts/: Contains all executable scripts.
    • preprocessing_pipeline/: Scripts for the main data processing workflow. Each step includes a submit_*.sh script for manual execution and debugging on a cluster.
    • Downstream_analysis/: A collection of modular Python and R scripts used to generate the figures and statistics for the manuscript.
    • utils/: General utility scripts called by the main pipelines.
  • results/: Stores all output files generated by the analysis (this directory is not tracked by Git).

Analysis Pipeline

The analysis is organized into a primary preprocessing workflow managed by Snakemake, followed by a series of downstream analysis scripts.

  1. Preprocessing Workflow

The entire preprocessing pipeline, from raw BAM files to the final distance matrices, is defined in the Snakefile.

  1. Downstream Analysis

The scripts located in scripts/Downstream_analysis/ are used to generate the figures and statistical results presented in the paper. These are designed to be run manually after the preprocessing workflow is complete. Please see the individual scripts for details on their inputs and outputs.

Note on Heatmaps: The pan-organ gene expression and poly(A) distribution heatmaps presented in the manuscript were generated using the tools available on our interactive Mouse Poly(A) Tail Atlas website. Therefore, the code for generating these specific figures is not included in this repository.

How to Cite

If you use our data or code in your research, please cite our paper:

Lei, H., Long, Y., Wu, S., Wang, X., Peng, Y., Liu, Z., Lu, W., Yi, S., Zou, M., Xia, Y., et al. (2025). Pan-organ poly(A) atlas reveals a post-transcriptional regulatory layer independent of transcription.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published