Update here
- Set up directories structure and config file
- Launch Main pipeline
- Multiqc (marking duplicates and plot number of mapped reads) - You don't care !
- ChipQC
- DiffBinding and Peak annotation
- Set up Environment - You don't care !
You will the last version of this code here :
/home/jean-philippe.villemin/disciplussimplex/TEST_CHIPSEQ/
Each time you use this pipeline, you need to copy/paste it in a new directory with a config file.
Snakemake pipeline provided by Raoul Raffel @t IGH,Montpellier.
Directory aborescence need to be strictly the same.
One directory per condition. Several histone marks are inside these directories. A raw_data directory where you put fastq files.
Inside this directory, you will set up directories for input raw data and for each replicate of histone marks as followed :
You also have in this directory the script and its config file.
Raw fastq must be formated as follows file.fq.gz.
NB : Due to a bug in CTCF and K9AC files. fastq_illumina_filter doesn't work. See To bypass that, rename the file as file.filtered.fq.gz. Or you can remove the spaces by yourself and still do this filtering (sed 's/^$/N/'
newfile.fa) .
NB1 : If you received different files for one replicate, you need to merge them into one, with a command line like :
cat file1.gz file2.gz > Luco22.fq.gz
Then, you need to configure your yaml configuration file. See provided example.
You need to change filepaths,filenames accordingly to your data & environment.
The main pipeline do all alignments for fastq files. It produces also fastqc, wig files for visualisation and finally it will gives us the peaks calling files. Other stuffs are done but you don't care. (the fastqc produces MissingOutputException...but it works don't worry)
Dry Run ( can be useful to see the set of commands that will be done):
- n : is drymode
- p : p is for print (...)
- j : number of cores
snakemake -s pipeline_chip-seq.Snakefile.py -j 16 -k -n --verbose -p
Real Run (nohup make the script runs in background. Note the ampersand at the end to go back in the shell.):
nohup snakemake -s pipeline_chip-seq.Snakefile.py -j 16 -k --verbose -p &
Control with htop -u username and look into nohup.out file to see if job is finished. nohup.out is a file created with all the commands, errors that can occurs, it's a log file.
NB4 : You can relaunch script if you find something went wrong.It will execute only part that failed. Fastqc will be done again (whereas fastqc analysis has already be done with sucess) because of a bug I didn't manage to correct. computeMatrix_ref_point, computeMatrix_ref_point failed but you don't care because we don't use their output.
NB5 : Don't use the normalise bigwig create here. You can use bigwig for replicates normalised by RPKM but not taking account of the input.
Description of output : You should know them since 3 years.
This step is mandatory if you want to create next bigwig with mean signal without duplicates.
#MarkDuplicate
find /pathTo/condition -name *.sorted.bam | xargs -I{} bash -c 'java -jar /home/jean-philippe.villemin/bin/picard.2-6/picard.2-6.jar MarkDuplicates I=$0 O=${0/.sorted.bam/.sorted.marked.bam} M=${0/.bam/.marked_dup_metrics.txt} VALIDATION_STRINGENCY=LENIENT' {} \;
find /pathTo/condition -name *.sorted.bam | xargs -I{} bash -c 'java -jar /home/jean-philippe.villemin/bin/picard.2-6/picard.2-6.jar MarkDuplicates I=$0 O=${0/.sorted.bam/.sorted.removed.marked.bam} M=${0/.bam/.sorted.removed.marked_dup_metrics.txt} REMOVE_DUPLICATES=TRUE VALIDATION_STRINGENCY=LENIENT' {} \;
This has been added to have only one track per mark. Here we are doing one track using mean of replicates removing the input signal signal to normalise.
The script is using a json configuration file. It's also plotting profile around TSS for differentialy expressed gene. So you need to provide bed files for TSS of upregulated genes and TSS of downregulated genes. You can use the old bed I set in the config. This is not the main purpose of the script but I didn't removed this part to it's still trying to plot stuff but your are only instered in the bigwig normalised.
So here you need to create a json file. See --config parameter to retrieve an example file.
Here we are just interested in creating the mean track, you so can provide fake bed files.
Example :
python3 /home/jean-philippe.villemin/code/CHIP-SEQ/src/deeptools_countsignal.py --config=/home/jean-philippe.villemin/code/configs/HEATMAP_HISTONE/tss_by_DGE_NODUP/POL2_T7.json
NB : Lot of things in json you don't care.
Only things you need to change are the path to bam files.( "bam" , bam-DupRemoved" and 'path_to_output' for Rep1,Rep2,Control)
The next step are based on ChipQC and DiffBind R packages. You need to do chipQC to do differential binding. Infos about parameters of chipqc Infos about parameters diffbind
First you need to create a samplesheet containing all path to your samples. see( ChipQC ,DiffBind,Doc )
You will find an example here :
/home/jean-philippe.villemin/disciplussimplex/TEST_CHIPSEQ/samplestest.csv /home/jean-philippe.villemin/CHIPSEQ_2017_1_ALL/samplesChipMarks.csv
You will use this file to create an R object saved on disk that will be used by the next step. This step is creating a first report with quality controls.
You can use consensus methodology where ChIPQC is done using consensus length around peak summit. With this approach, you will also have input signal plotted for interval you defined around peak summits with option -s.
Option -c is used to say you are using consenus methodology. (see docs) Option -n is used to say what is the name of the Robject saved.
Rscript chipQC.R --file /pathTO/samplesChipMarks.csv -n ALL_MarkedDup -s 500 -c
You can find it here : /home/jean-philippe.villemin/code/CHIP-SEQ/R/
NB : I loaded the library at the beginning. It call hg38. (if you work on mouse, you will have to change hard coded stuffs)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
NB : Index .bai must be created before. Very important.
Here you will test a variation of binding between condition using R object output (name ALL_MarkedDup in this example) from the step before.
-n is used to say the number of conditions.
-m is used to say which marks from your sample you want to analyze. -p is used to say percentage at least used to retrieve peaks in all you samples. minOverlap (peakset parameter in diffbind) only include peaks in at least this many peaksets in the main binding matrix.If minOverlap is between zero and one, (Don't change 0.99, that means peaks must be in all samples.)Peak will be included from at least this proportion of peaksets.
You will do that for each mark you want to analyse in your sampleshit.
Rscript chipDiffBind.R --file=/pathTO/ALL_MarkedDup -n 3 -m K27AC -p 0.99
NB : Don't use the .Rdata extension when you call the file.
Description of the 2 outputs :
No filter is applied on the fold or the p-value. You will find that in the 4th column of the .bed.
POL2.T1_vs_T7.bed:
chr10 73910795 73916597 2623_NearestLocation_inside_ENSG00000122861_proteincoding_PLAU_6.91_20.9030899869919_+_4519_900 0 .
chrom start end IDPeak_fromOverlappingOrNearest_insideFeature_Feature_genebiotype_symbol_Fold_log10(p-value)_strand_distancetoFeature_shortestDistance
Annotation comes from chipeakAnno
fromOverlappingOrNearest: nearest: indicates this feature's start (feature's end for features at minus strand) is closest to the peak start; insideFeature: upstream peak resides upstream of the feature; downstream: peak resides downstream of the feature; inside: peak resides inside the feature; overlapStart: peak overlaps with the start of the feature; overlapEnd: peak overlaps with the end of the feature; includeFeature: peak include the feature entirely distancetoFeature: distance to the nearest feature such as transcription start site. By default, the distance is calculated as the distance between the start of the binding site and the TSS that is the gene start for genes located on the forward strand and the gene end for genes located on the reverse stran shortestDistance: The shortest distance from either end of peak to either end the feature.
POL2.T1_vs_T7.diffPeaks.bed :
chr10 73910795 73916597 2623 0 .
2623 is just the peak ID
All the packages that are used are listed below.
First you need to check that the following tools are installed on server/computer.
Deeptools : Tool to handle deepsequencing files here
Macs2 : Peak Caller here
wigToBigWig : Include in KentTools suite here
It's advised to install Conda.
It will make things easier then for installing tools. (snakemake for example)
Conda : here
snakemake : Evil framework for pipeline in python here
WiggleTools : Optionnal , can be useful when you want to merge bigwig files and apply some basic numerical operations here
SPP : Need to be there but you don't care here
MaSC : Need to be there but you don't care here
R3.3.1 : A lot of packages are used ... It's adviced to read their docs.
ChipQC : Control Quality for peaks. Object created will be used then with ChIPQC for Differential binding here
DiffBind : Differential binding analysis. Which peaks are moving in my samples ? here
ChIPpeakAnno : Anotation of the peaks in a genomic context here
ChIPseeker : Used for a derived upset peaks graph counting how much peaks are annotated in one or other features. here
rtracklayer : Very usefull to export/import bed to Grange objects. here
ComplexHeatmap : Plot heatmap. here
Other packages are mandatory. biomaRt, GenomicFeatures,stringr, reshape , ggplot2, optparse , knitr ... Look inside R scripts to know what to install.