This pipeline can be run after the pipeline pipeline_slamdunk_umis but it is not necessary. First, the Rscripts for the LASSO regression have to be ran on their own. When bed files have been generated in appropriate directories, the pipeline_motif_analysis can be run.
Afterwadrs, different python scripts exist out of the pipeline to get list of wanted linkers to build libraries. They are meant to be used in this order: RE_recombined_sites.py (optional, need specific environment) -> selectLinkers.py -> createSequencesToOrder.py.
A report of the pipeline can be built using build_report after finishing the full pipeline.
-
Run STREME on highly stable and lowly stable sequences
-
Run HOMER (findMotifs.pl) on highly stable and lowly stable sequences http://homer.ucsd.edu/homer/motif/fasta.html
-
Convert HOMER outputs to MEME motif format
-
Run Tomtom to merge motifs from STREME and HOMER together
-
Run Tomtom to remove lowstab motifs in highstab motifs of Streme/Homer results
-
From fire, take consensus sequence directly, remove redundants from the different kmer analysis
-
Remove highstab sequences also present in lowstab sequences for Fire.
-
Merge all results from the 3 tools together, remove highstab sequences that match or miRNA seed targets, or a lowstab sequence or a polyA singal AUAAA/AAUAAA.
- STREME inputs [X]_lowstab|highstab.bed : bed files consisting of 3'UTR sequences of transcript with highest or lowest half-lives or residuals.
- HOMER inputs [X]_lowstab|highstab.bed : bed files consisting of 3'UTR sequences of transcript with highest or lowest half-lives or residuals. backgroud.bed : bed file consisting of all 3'UTR sequences of transcript detected by the pipeline_slamdunk_umis.
- fire fire.bed : bed file consisting of all 3'UTR sequences with a length >6 and <10000 fire_[X].txt : transcript id and ranking value table were [X] can be any name where you data (ranking value) originate from (halflife, ...).
The pipeline configuration file pipeline.yml.
Outputs by directories
- [X]_[lowstab|highstab]_streme.dir
-
Outputs generate by streme: streme.html - an HTML file that provides the results in an interactive, human-readable format streme.txt - a text file containing the motifs discovered by STREME in MEME format sequences.tsv - a TSV (tab-separated values) file that lists the true- and false-positive sequences identified by STREME for each motif streme.xml - an XML file that provides the results in a format designed for machine processing (source: https://meme-suite.org/meme/doc/streme.html)
-
tomtom.self Final output directory, obtained from running tomtom on the ouput file to get rid of redundant motifs (log of run: streme.txt.tomtom.log)
- [X]_[lowstab|highstab]_homer.dir
-
Outputs generated by runing homer findMotifs.pl script http://homer.ucsd.edu/homer/motif/fasta.html (NB: randomization folder can be deleted after pipeline as finished running homer) homerMotifs.all.motifs contains all discovered motifs by Homer.
-
homerMotifs.all.motifs.meme homerMotifs.all.motifs file in meme format.
-
homerMotifs.all.motifs.tomtom.self Output obtained from running tomtom on the ouput file (homerMotifs.all.motifs) to get rid of redundant motifs.
- fire_[X].txt.[6-7-8]imer_FIRE Directories for each kmer size (I have little control over naming of directories, which explain their weird names.)
- Outputs generated fron FIRE /RNA directory contains interesting outputs. Fire generates a lot of different outputs and they don't explain most of them in their tutorial, the important file are "...signif.motifs.rep" and "....signif.motifs" (https://tavazoielab.c2b2.columbia.edu/FIRE/tutorial.html)
- ..._[highstab|lowstab].signif.motifs List of motifs enriched in high or low stability transcripts discovered by fire.
- fire.dir
- [X]_[highstab|lowstab].allkmer.signif.motifs Merged results from the different kmer sizes
- [highstab|lowstab].allkmer.fireMotifs Merging of all fire results in either low and high stability motifs Contains the name of the motif and it's consensus sequence given by fire in the file "....signif.motifs".
- highstab_in_lowstab.list List of motifs sequecnces from lowstab also present in lowstab (later filtered)
- final_motifs
-
[highstab|lowstab]_final_motifs.list The most important output. Table with the name of the motif and its associated consensus sequence.
-
[highstab|lowstab]_final_motifs.list.log Log file giving: a. the number of motifs coming from Streme/Homer or Fire, b. the number of consensus sequences coming from these motifs, c.the number of motifs shared between highstab and lowstab, for Homer/Streme d. the number of sequences shared between highstab and lowstab for Fire Once Streme/Homer and Fire results have been merged c. number of sequences that contained miRNA seed targets, d. number of sequences with one of the most 2 common polyA signals (AAUAAA or AUAAA) e. Final number of sequences, once unwanted ones have been removed (For lowstab the list is not filtered)
-
[highstab|lowstab]_final_motifs.matching.mirna.seeds Table of sequences matching miRNA seed targets struture: miRNA_name:miRNA_seed_target_sequence name_matching_motif motif_sequence
-
[X]_merge_homer_streme.meme Merged motifs from Homer and Streme, similar ones have been clustered using tomtom
-
[X]_merge_homer_streme.motifs.tomtom Output obtained from running tomtom on the merge motifs originating from homer and streme highstab_merge_homer_streme.meme.log is the tomtom output log.
The report render ouputs in final_motifs pipeline_report.html and associated files.
On top of the default CGAT setup, the pipeline requires the following
- Software:
- python (v3.8.12 with pysam v0.17.0 when built)
- meme (v5.3.0 when built)
- HOMER in path (command findMotifs.pl in path)
- fire in path (https://tavazoielab.c2b2.columbia.edu/FIRE/)
- R modules:
- Biostrings
- tidyverse
- optparse
- stringr
- tools
- universalmotif
- msa
The pipeline requires a configured :file: pipeline.yml
file.
Make a directory with your project name.
Configure the pipeline with python [path_to_repo]/pipeline_motif_analysis.py config
.
A pipeline.log and pipeline.yml file(s) will be added to your new directory.
Modify the pipeline.yml according to your project (specify annotation database and directory, database for uploading the outputs; specify options for Salmon quantification).
Run the pipeline with python [path_to_repo]/motif_analysis.py make full -v5
.
Run the report render (after doinf full):
python [path_to_repo]/motif_analysis.py make build_report -v5
For running the pipeline on a large set of samples, submit the pipeline onto the cluster (sharc), using a submit_pipeline custom script.
Scripts related to the lasso.
Once this pipeline has been run, you can then run merge_motifs.Rmd to merge all motif sequences from the different pipeline runs and gerenate a general list of highstab and lowstab motifs. Then the python scrypts RE_recombined_sites.py and/or select_linkers.py can be run to generate linkers.