Skip to content

mukherjeelab/CLIPs4U

Repository files navigation

CLIPs4U

CLIPs4U is the snakemeke workflow to analyze PAR-CLIP data.

Overview

Introduction

Prerequisites

Dependencies

Preparing configuration file

Workflow

Usage

Output

Contributors

License

Introduction

CLIPs4U runs whole analysis starting from gzipped fastq files end ending with annotation tables using single command and yaml configuration file. It is user friendly and a highly customizable tool.

Prerequisites

The only thing you must know before running CLIPs4U is the location of your fastq.gz files and your adapters sequences. If you want you can also provide your own genome version and annotation, star index etc (see below).

Dependencies

Required libraries are stored in the env.yml file. All required dependencies can be installed using conda by executing the following command:

conda env create --name clips4u --file env.yml

Where clips4u is your name of the environment.

NOTE: creating environment will take some time, a few hundred packages must be downloaded and installed.

The environment then needs to be activated in order to run CLIPs4U:

conda activate clips4u

If you prefere working with Docker, CLIPS4U docker image can be found here

Preparing configuration file

DETAILS ABOUT CONFIGURATION FILE CAN BE FOUND IN config/README.md file

Workflow

The example directed acyclic graph for the workflow is show below.

DAG

Steps:

  • downloading genome fasta and gtf files, prepare genome 2bit file, main_annotation.rds file and index for selected aligner (performs only once for selected genome, can be omitted if user specify paths to files in config.yaml)
  • raw reads quality control using FastQC
  • adapter trimming (one or two rounds - second round OPTIONAL) using cutadapt
  • reads collapsing ant OPTIONAL UMI removal using seqtk and FASTX-Toolkit
  • OPTIONAL removing repetitive elements using STAR
  • genome alignment using bowtie or STAR
  • running PARalyzer to detect enriched clusters
  • calculation of mismatches statistics and conversion specificity for the PARalyzer output
  • annotation of clusters
  • creating bigWig files for clusters using deepTools for visualization in genome viewers
  • motif enrichment analysis using meme, dreme or streme
  • generating final report and plots

Usage

First, install git-lfs, to ensure that test data will be correctly downloaded. Dependent on your system git-lfs can be installed using the following commands:

Ubuntu/Debian

sudo apt-get update
sudo apt-get install git-lfs

Fedora

sudo dnf update
sudo dnf install git-lfs

CentOS7/RHEL7

sudo yum install epel-release
sudo yum install git-lfs

CentOS8/RHEL8

sudo dnf install epel-release
sudo dnf install git-lfs

openSUSE

sudo zypper refresh
sudo zypper install git-lfs

Arch linux

sudo pacman -S git-lfs

Then initialize git-lfs using:

git lfs install

Second, clone the repository using:

gh repo clone mukherjeelab/CLIPs4U

or:

git clone https://github.com/mukherjeelab/CLIPs4U.git

or just download zipped package and unpack it.

Ensure that ZFP36.fq.gz file in your test_data directory has 242MB. After navigating to your CLIPs4U directory and typing ls -lh test_data you should have output similar to the one presented below:

total 242M
-rw-r----- 1 marcin marcin 242M lip 11 20:26 ZFP36.fq.gz
-rw-r----- 1 marcin marcin 3,7K lip 12 12:46 ZFP36.yaml

If for some reasons your output looks like one below:

total 4,5K
-rw-r----- 1 sajekmar mukherjee  134 07-12 14:09 ZFP36.fq.gz
-rw-r----- 1 sajekmar mukherjee 3,7K 07-12 14:09 ZFP36.yaml

and ZFP36.fq.gz has only 134 B download it manually from repository and move to /path/to/CLIPs4U/test_data.

Third, create directory for your project, e.g.:

mkdir my_parclip_dir

Fourth, prepare your config YAML file. It can be located anywhere, but if you put it in your directory it will be automatically detected. Parameters not specified in your config file will be set to default values using default_config file.

Please note, that parameters that will be shared between all analyses might be put in the (clips4u)/config/default_config.yaml. File default_config.yaml contains predefined default parameters, and you are free to change them.

Fifth, create and activate conda environment as described above.

Running analysis specify clips4u snakefile using flag "--snakefile", specify your directory using flag "--directory" if it is not current working directory (if you are not in this directory), specify your configfile using flag "--configfile" if configfile is located outside your directory or your directory contains multiple YAML files, specify maximum number of threads using flag "--threads"

example dry run command:

snakemake -n 1 --snakefile clips4u/workflow/Snakefile --directory my_parclip_dir

example test data run command:

snakemake --snakefile /path/to/CLIPs4U/workflow/Snakefile --configfile /path/to/CLIPs4U/test_data/ZFP36.yaml --threads 10

For further options and possibilities please check Snakemake documentation.

Test data contains fastq.gz file from ZFP36 PAR-CLIP described here, which can be also downloaded from Gene Expression Omnibus.

Output

Every run generates multiple output files. The most important one are:

  • tabular TSV files with clusters annotation workdir/annot
  • plots from various steps of analysis in pdf format workdir/plots, including annotation, mismatches statistics, motif enrichment, metagene plots
  • final report in html format workdir/final_report.html, containing summary tables and plots from various steps of analysis in user friendly format, plots from final report in pdf format can be also found in workdir/plots
  • bigWig files for visualization of the clusters in genomic viewers workdir/genome_viewer_files

NOTE: There are four bigWig (bw) files for every sample - filtered and unfiltered for positive and negative strand. Unfiltered files contain all clusters, filtered ones clusters with conversion specificity > 0.6 based. This value was set based on our previous study.

NOTE: For reproducibility purposes final_config.json will be created in your working directory. This file will contain all default and computed parameters alongside user defined parameters.

NOTE: Example output files for test data can be found in test_out. Please compare your output from test run with these files. Consider you will have more output files. In test_out we are storing only the most important ones.

Contributors

  • Marcin Sajek
  • Neelanjan Mukherjee
  • Samantha Lisy
  • Yelena Prevalova
  • Manuel Ascano Jr
  • Tomasz Woźniak

License

MIT license

About

Snakemake workflow for PAR-CLIP data analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published