Skip to content

Commit

Permalink
Solve StrandPhaseR docker image installation (minor)
Browse files Browse the repository at this point in the history
  • Loading branch information
weber8thomas committed Jun 21, 2022
1 parent 278f52c commit fb04b0a
Show file tree
Hide file tree
Showing 115 changed files with 16,416 additions and 0 deletions.
114 changes: 114 additions & 0 deletions .github/workflows/main.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
name: Tests

on:
push:
branches:
- smk_workflow_catalog
# paths:
# - "github-actions-runner/Dockerfile"

jobs:
build_container:
name: Build and push image
runs-on: ubuntu-20.04
env:
IMAGE_NAME: mosaicatcher-pipeline

if: github.ref == 'refs/heads/master'
steps:
- uses: actions/checkout@v2

- name: Read upstream tag without version
id: gettag
run: echo "::set-output name=tag::$(head -n 1 github-actions-runner/Dockerfile | awk -F':' '{print $2}' | awk -F'-' 'BEGIN { OFS="-" } {$NF=""; print $0}')"

- name: Read internal update version
id: getversion
run: echo "::set-output name=version::$(grep 'ARG RUNNER_VERSION' github-actions-runner/Dockerfile | awk -F'=' '{print $2}')"

- name: Build Image
id: build-image
uses: redhat-actions/buildah-build@v2
with:
image: ${{ env.IMAGE_NAME }}
tags: latest dev 1.3
dockerfiles: |
./github-actions-runner/Dockerfile
- name: Push To DockerHub
id: push-to-dockerhub
uses: redhat-actions/push-to-registry@v2
with:
image: ${{ steps.build-image.outputs.image }}
tags: ${{ steps.build-image.outputs.tags }}
registry: docker.io/weber8thomas
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}

- name: Use the image
run: echo "New images has been pushed to ${{ steps.push-to-quay.outputs.registry-paths }}"
# jobs:
test_workflow:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v1
# - name: Setup Snakemake environment
# run: |
# export PATH="/usr/share/miniconda/bin:$PATH"
# conda config --set channel_priority strict
# conda install -c conda-forge -q mamba
# # ensure that mamba is happy to write into the cache
# sudo chown -R runner:docker /usr/share/miniconda/pkgs/cache
# # additionally add singularity
# # TODO remove version constraint: needed because 3.8.7 fails with missing libz:
# # bin/unsquashfs: error while loading shared libraries: libz.so.1: cannot open shared object file: No such file or directory
# # mamba create -y -n mosaicatcher_env -c conda-forge -c bioconda snakemake pandas pysam tqdm imagemagick "singularity<=3.8.6"
# # source activate mosaicatcher_env
# # conda list
# # which python
# # python -c 'import pysam; print(pysam)'
- name: Downloading data
uses: snakemake/snakemake-github-action@v1.22.0
with:
directory: .test
snakefile: Snakefile
stagein: "mamba env remove -n snakemake && mamba create -y -n snakemake -c conda-forge -c bioconda unzip snakemake pandas pysam tqdm imagemagick && source activate snakemake"
args: "--cores 1 --config mode=download_data dl_external_files=True dl_bam_example=True input_bam_location=TEST_EXAMPLE_DATA/"
- name: Test data
uses: snakemake/snakemake-github-action@v1.22.0
with:
directory: .test
snakefile: Snakefile
stagein: 'mamba env remove -n snakemake && mamba create -y -n snakemake -c conda-forge -c bioconda snakemake pandas pysam tqdm imagemagick "singularity<=3.8.6" && source activate snakemake && ls -lh'
args: "--cores 1 --config plot=True input_bam_location=TEST_EXAMPLE_DATA/ output_location=TEST_OUTPUT/ --use-conda --use-singularity"

formatting:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v1
- name: Formatting
uses: github/super-linter@v3.16.1
env:
VALIDATE_ALL_CODEBASE: false
DEFAULT_BRANCH: smk_workflow_catalog
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
VALIDATE_SNAKEMAKE_SNAKEFMT: true

linting:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v1
- name: Downloading data
uses: snakemake/snakemake-github-action@v1.22.0
with:
directory: .test
snakefile: Snakefile
stagein: "mamba env remove -n snakemake && mamba create -y -n snakemake -c conda-forge -c bioconda unzip snakemake pandas pysam tqdm imagemagick && source activate snakemake && ls -l && pwd"
args: "--cores 1 --config mode=download_data dl_external_files=True dl_bam_example=True input_bam_location=TEST_EXAMPLE_DATA/ --touch"
- name: Linting
uses: snakemake/snakemake-github-action@v1.22.0
with:
directory: ".test"
snakefile: Snakefile
stagein: "mamba env remove -n snakemake && mamba create -y -n snakemake -c conda-forge -c bioconda unzip snakemake pandas pysam tqdm imagemagick && source activate snakemake && ls -l && pwd"
args: "--lint"
87 changes: 87 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Hidden folders & files
.DS_Store
.vscode/
.snakemake/
.panoptes.db
._.DS_Store
.pytest_cache/
.condarc
files.txt
workflow/.conda/

# Tmp files & execution outputs
*.pyc
*.zip
*.gz
*.db
*.png
*.pdf
*.svg
*.tsv
*.csv

# Links
*@

# Mosaicatcher folders
chroms/
counts/
log/
plots/
segmentation/
segmentation2/
snv_calls/
strand_states/
sv_probabilities/
workflow/config/config_df.tsv
workflow/config/exclude_file.txt
workflow/config/exclude_file

# Docs
docs/build/
build/
*.html
workflow/static/

# Python
__pycache__
workflow/scripts/__pycache__

# Zenodo
workflow/sandbox.zenodo.org/
sandbox.zenodo.org/
TEST_OUTPUT/
TEST_OUTPUT
workflow/test.txt
.snakemake

# Exceptions
!docs/images/*.png
!workflow/data/segdups/segDups_hg38_UCSCtrack.bed.gz
!workflow/data/bin_200kb_all.bed
!config/config.yaml
!config/*

# Dev
discover_big_files_git.sh
builds/
workflow/report_TALL/
*.bam
*.bai
workflow/TEST_EXAMPLE_DATA/
TEST_EXAMPLE_DATA
TEST_EXAMPLE_DATA/
workflow/logs/
workflow/errors/

# git

## Others
# afac/
# workflow/.snakemake
# bam/

## Personal note: files/folders specific to dev branch
# .gitlab-ci.yml // to use with LFS example data in dev branch
# singularity/ folder
# afac/ debugging & dev folder
18 changes: 18 additions & 0 deletions .snakemake-workflow-catalog.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# configuration of display in snakemake workflow catalog: https://snakemake.github.io/snakemake-workflow-catalog

usage:
mandatory-flags: # optional definition of additional flags
desc: # describe your flags here in a few sentences (they will be inserted below the example commands)
flags:
- "snakemake"
- "mosaicatcher"
- "single-cell-genomics"
- "strand-seq"
- "structural-variants"
- "sv-calling"
# put your flags here
software-stack-deployment: # definition of software deployment method (at least one of conda, singularity, or singularity+conda)
conda: false # whether pipeline works with --use-conda
singularity: false # whether pipeline works with --use-singularity
singularity+conda: true # whether pipeline works with --use-singularity --use-conda
report: true # add this to confirm that the workflow allows to use 'snakemake --report report.zip' to generate a report containing all results and explanations
26 changes: 26 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
## 1.3 (2022-06-02)

* Check if SM tag are corresponding to folder name [View](https://git.embl.de/tweber/mosaicatcher-update/-/commit/a4611b70a03675ee5db7816728b28eb9a9875e5c)



## 1.2.3 (2022-05-18)

* Correct issue [View](https://git.embl.de/tweber/mosaicatcher-update/-/commit/932d2529815cc31a57f60ca860fadf65212738f4)
* Small correction [View](https://git.embl.de/tweber/mosaicatcher-update/-/commit/5ed61ec9d20692d4e14394baea5636a21ae9dfc1)
* Correct SMK download BAM example files [View](https://git.embl.de/tweber/mosaicatcher-update/-/commit/d84904a5c1ec9f4901f7dc69d7b879692c1266c6)
* Update README.md [View](https://git.embl.de/tweber/mosaicatcher-update/-/commit/881c6612b31e74efbb854b85d4e5328e300e7c2e)


## 1.2.2 (2022-05-18)


* Handle of multi samples in the same folder now Change the way to retrieve the selected cell list [View](https://git.embl.de/tweber/mosaicatcher-update/-/commit/3f0dc28ec22d88def269c215ef551800b8b1f7e5)


## 1.2.1 (2022-05-17)

* Removing files .gitattributes .gitlab-ci.yml [View](https://git.embl.de/tweber/mosaicatcher-update/-/commit/b6a46ff3d7dd8978743be9c4ee801535aac03eab)
* Download example & external data Implemented rules based on snakemake.remote.HTTP function that can be called through config.yaml / CLI arguments Update config.yaml file Update rules/examples.smk Update Snakefile Update README.md [View](https://git.embl.de/tweber/mosaicatcher-update/-/commit/a835f79928bf6ec5c5b93678bd89bc54c59e3206)


21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2022 Thomas Weber (thomas.weber@embl.de)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
114 changes: 114 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
![MosaiCatcher](docs/images/mosaic_logo.png)


Structural variant calling from single-cell Strand-seq data [Snakemake](https://github.com/snakemake/snakemake) pipeline.


# Overview of this workflow

This workflow uses [Snakemake](https://github.com/snakemake/snakemake) to
execute all steps of MosaiCatcher in order. The starting point are single-cell
BAM files from Strand-seq experiments and the final output are SV predictions in
a tabular format as well as in a graphical representation. To get to this point,
the workflow goes through the following steps:

1. Binning of sequencing reads in genomic windows of 100kb via [mosaic](https://github.com/friendsofstrandseq/mosaicatcher)
2. Strand state detection
3. [Optional]Normalization of coverage with respect to a reference sample
4. Multi-variate segmentation of cells ([mosaic](https://github.com/friendsofstrandseq/mosaicatcher))
5. Haplotype resolution via [StrandPhaseR](https://github.com/daewoooo/StrandPhaseR)
6. Bayesian classification of segmentation to find SVs using MosaiClassifier
7. Visualization of results using custom R plots


# Quick Start

1. Install [Singularity](https://www.sylabs.io/guides/3.0/user-guide/)
2. To prevent conda channel errors
```
conda config --set channel_priority
```
3. Create a dedicated conda environment
```
conda create -n mosaicatcher_env -c conda-forge -c bioconda snakemake pandas pysam imagemagick tqdm && conda activate mosaicatcher_env
```
4. Clone the repository
```
git clone https://github.com/friendsofstrandseq/mosaicatcher-pipeline.git && cd mosaicatcher-pipeline
```
5. Download test and reference data
```
snakemake -c1 --config mode=download_data dl_external_files=True dl_bam_example=True input_bam_location=TEST_EXAMPLE_DATA/
```
6. Run on example data on only one small chromosome (`<disk>` must be replaced by your disk letter/name, `/g` or `/scratch` at EMBL for example)
```
snakemake --cores 12 --config mode=mosaiclassifier plot=True input_bam_location=TEST_EXAMPLE_DATA/ output_location=TEST_OUTPUT/ chromosomes="[chr21]" --use-conda --use-singularity --singularity-args "-B /<disk>:/<disk>" --latency-wait 60
```

7. Generate report on example data
```
snakemake --cores 12 --config mode=mosaiclassifier plot=True input_bam_location=TEST_EXAMPLE_DATA/ output_location=TEST_OUTPUT/ chromosomes="[chr21]" --use-conda --use-singularity --singularity-args "-B /<disk>:/<disk>" --latency-wait 60 --report <REPORT.zip>
```


8. Start running your own analysis
```
snakemake --cores 12 --config mode=mosaiclassifier plot=True input_bam_location=<INPUT_DATA_FOLDER> output_location=<OUTPUT_DATA_FOLDER> --use-conda --use-singularity --singularity-args "-B /<disk>:/<disk>" --latency-wait 60
```
9. Generate report
```
snakemake --cores 12 --config mode=mosaiclassifier plot=True input_bam_location=<INPUT_DATA_FOLDER> output_location=<OUTPUT_DATA_FOLDER> --use-conda --use-singularity --singularity-args "-B /<disk>:/<disk>" --latency-wait 60 --report <REPORT.zip>
```




# Documentation

* [Usage](docs/Usage.md)
* [Parameters & input](docs/Parameters.md)
* [Output](docs/Output.md) (#TODO)



# 📆 Roadmap

## Technical-related features

- [x] Zenodo automatic download of external files + indexes ([1.2.1](https://github.com/friendsofstrandseq/mosaicatcher-pipeline/releases/tag/1.2.1))
- [x] Multiple samples in the parent folder ([1.2.2](https://github.com/friendsofstrandseq/mosaicatcher-pipeline/releases/tag/1.2.2))
- [x] Automatic testing of BAM SM tag compared to sample folder name ([1.2.3](https://github.com/friendsofstrandseq/mosaicatcher-pipeline/releases/tag/1.2.3))
- [x] On-error/success e-mail ([1.3](https://github.com/friendsofstrandseq/mosaicatcher-pipeline/releases/tag/1.3))
- [x] HPC execution (slurm profile for the moment) ([1.3](https://github.com/friendsofstrandseq/mosaicatcher-pipeline/releases/tag/1.3))
- [ ] Plotting options (enable/disable segmentation back colors)
- [ ] Full singularity image with preinstalled conda envs
- [ ] Portable Encapsulated Project compliant
- [ ] Single BAM folder with side config file
## Bioinformatic-related features

- [ ] Change of reference genome (currently only GRCh38)
- [ ] Upstream QC pipeline and FastQ handle
- [ ] Pooling samples
- [ ] Self-handling of low-coverage cells

## Small issues to fix

- [ ] Move pysam / SM tag comparison script to snakemake rule


# 🛑 Troubleshooting & Current limitations

- Do not change the structure of your input folder after running the pipeline, first execution will build a config dataframe file (`OUTPUT_DIRECTORY/config/config.tsv`) that contains the list of cells and the associated paths
- Do not change the list of chromosomes after a first execution (i.e: first execution using `count` mode on `chr21`, second execution using `segmentation` mode on all chromosomes)
- ~~Pipeline is unstable on **male** samples (LCL sample for example) for the moment due to the impossibility to run strandphaser (only one haplotype for the X chrom)~~ That was solved based on [Hufsah Ashraf](https://github.com/orgs/friendsofstrandseq/people/Hufsah-Ashraf) and [Wolfram Höps](https://github.com/orgs/friendsofstrandseq/people/WHops) work allowing to determine automatically sample sex and use [snakemake checkpoint](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#data-dependent-conditional-execution) that allow data-depdendent conditional execution. Thus, initial list of chromosomes was updated regarding the samples sex in order to bypass chrX & chrY for male sample, as both are present in a single haplotype.


# 📕 References


> Strand-seq publication: Falconer, E., Hills, M., Naumann, U. et al. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat Methods 9, 1107–1112 (2012). https://doi.org/10.1038/nmeth.2206
> scTRIP/MosaiCatcher original publication: Sanders, A.D., Meiers, S., Ghareghani, M. et al. Single-cell analysis of structural variations and complex rearrangements with tri-channel processing. Nat Biotechnol 38, 343–354 (2020). https://doi.org/10.1038/s41587-019-0366-x

Loading

0 comments on commit fb04b0a

Please sign in to comment.