Skip to content

Commit

Permalink
more docs!
Browse files Browse the repository at this point in the history
  • Loading branch information
nebfield committed Jan 28, 2022
1 parent 56493b3 commit 2cfd048
Show file tree
Hide file tree
Showing 13 changed files with 249 additions and 26 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/setup-python@v2
- run: pip install sphinx-book-theme
- run: pip install sphinx-book-theme sphinx-jsonschema
- uses: actions/checkout@master
with:
fetch-depth: 0 # otherwise, you will failed to push refs to dest repo
Expand Down
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,11 @@ NOTE: the pipeline is distributed and makes use of datasets (e.g. 1000 Genomes a
are provided under specific data licenses (see the [assets](assets/README.md) directory README for more information). It is up to
end-users to ensure that their use conforms to these restrictions.

This work has received funding from EMBL-EBI core funds, the Baker Institute,
the University of Cambridge, Health Data Research UK (HDRUK), and the European
Union’s Horizon 2020 research and innovation programme under grant agreement No
101016775 INTERVENE.

<!-- TODO nf-core: If applicable, make list of people who have also contributed
-->

Expand Down
15 changes: 15 additions & 0 deletions assets/api_examples/call.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"target_genomes": [
{
"sample": "example",
"vcf_path": "/path/to/genome.vcf.gz",
"chrom": 22
}
],
"nxf_params_file": {
"scorefile": "/path/to/scorefile.txt",
"format": "json"
},
"nxf_work": "/workspace/unique_work_directory/",
"id": "unique_id"
}
9 changes: 9 additions & 0 deletions assets/api_examples/input.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"target_genomes": [
{
"sample": "example",
"vcf_path": "/path/to/genome.vcf.gz",
"chrom": 22
}
]
}
6 changes: 6 additions & 0 deletions assets/api_examples/params.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"nxf_params_file": {
"scorefile": "/path/to/scorefile.txt",
"format": "json"
}
}
92 changes: 72 additions & 20 deletions docs/api.rst
Original file line number Diff line number Diff line change
@@ -1,26 +1,27 @@
API
===

``pgsc_calc`` is designed to be used in a terminal, but can also be launched
programmatically on a Kubernetes cluster. To simplify this process,
``pgsc_calc`` supports specifying target genomes and runtime parameters using
JSON.
``pgsc_calc`` has two main use cases:

- A bioinformatician or data scientist wants to calculate some polygenic scores
using an Unixy operating system and a terminal
- A normal person (e.g. a biologist or other researcher) wants to calculate some
polygenic scores using a web browser

To simplify the second use case, the workflow is designed to be launched
programmatically on a `private cloud`_ using an API. API parameters are specified
using JSON. The web platform is still under development.

.. _private cloud: http://www.embassycloud.org/

Minimal example
---------------

Specifying target genomes with JSON
-----------------------------------

.. code-block:: json
{
"target_genomes": [
{
"sample": "example",
"vcf_path": "path.vcf.gz",
"chrom": 22
}
]
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. literalinclude:: ../assets/api_examples/input.json
:language: JSON

Target genomes are specified as a JSON array. Each element of the array must:

Expand All @@ -39,7 +40,58 @@ This JSON data must be saved to a file and used with the workflow parameter
with ``.json``.

Specifying workflow parameters with JSON
----------------------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. literalinclude:: ../assets/api_examples/params.json
:language: JSON

Some other parameters need to be set for the workflow to run, which are
specified in a simple JSON object. This object can be complex, because many
optional parameters can be set here. A minimal workflow parameter object must
contain:

- The path to a :term:`scoring file` OR
- An :term:`accession` in the :term:`PGS Catalog` (replace ``--accession`` with
scorefile)
- The format must be "json"

The :ref:`JSON schema` specifies optional parameters in full.

API call
~~~~~~~~

.. literalinclude:: ../assets/api_examples/call.json

The complete call also includes some nextflow configuration. The workflow is
assigned a unique identifier at launch to monitor its progress. Nextflow has
some requirements for the Kubernetes executor, so the work
directory must be unique and be in a `ReadWriteMany persistent volume claim`_ that
is accessible by the :term:`driver pod` and all :term:`worker pods`.

.. _ReadWriteMany persistent volume claim: https://www.nextflow.io/docs/latest/kubernetes.html#requirements

Schema
------

This documentation is useful for a human, but not a computer, so we wrote a
document (`a JSON schema`_) that describes the data format. The schema is used
to automatically validate data submitted to the workflow via the API.

.. _a JSON schema: https://raw.githubusercontent.com/PGScatalog/pgsc_calc/master/assets/schema_k8s.json

.. jsonschema:: ../assets/schema_k8s.json

Implementation details
----------------------

The API is designed using an event-driven approach with `Argo
Events`_. Briefly, a sensor constantly listens on a Kubernetes cluster for Kafka
messages to launch the pipeline. Once the message is received, a nextflow driver
pod is created and the workflow is executed using the `K8S executor`_. The
status of the workflow instance is reported using Nextflow's `weblog`_ and
a second sensor.

.. _Argo Events: https://argoproj.github.io/argo-events/
.. _K8S executor: https://www.nextflow.io/docs/latest/kubernetes.html
.. _weblog: https://www.nextflow.io/docs/latest/tracing.html#weblog-via-http

JSON schema
-----------
5 changes: 4 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,10 @@
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
"sphinx.ext.githubpages"
'sphinx.ext.githubpages',
'sphinx.ext.autosectionlabel',
'sphinx.ext.autodoc',
'sphinx-jsonschema'
]

# Add any paths that contain templates here, relative to this directory.
Expand Down
47 changes: 47 additions & 0 deletions docs/glossary.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
Glossary
========

.. glossary::
accession
A unique and stable identifier

polygenic score
A `polygenic score`_ (PGS) aggregates the effects of many genetic variants
into a single number which predicts genetic predisposition for a
phenotype. PGS are typically composed of hundreds-to-millions of genetic
variants (usually SNPs) which are combined using a weighted sum of allele
dosages multiplied by their corresponding effect sizes, as estimated from
a relevant genome-wide association study (GWAS).

.. _polygenic score: https://www.pgscatalog.org/about/

PGS Catalog
The `PGS Catalog`_ is an open database of published polygenic scores
(PGS). If you develop and publish polygenic scores, please consider
`submitting them`_ to the Catalog!

.. _PGS Catalog: https://www.pgscatalog.org
.. _submitting them: https://www.pgscatalog.org/submit/

PGS Catalog Calculator
This cool workflow

Scoring file
A file for scoring

SNP
A single nucleotide polymorphism

driver pod
pod
`A pod`_ is a description of one or more containers and its associated
computing resources (e.g. number of processes and RAM, but it's more
complicated than that). Kubernetes takes this description and tries to
make it exist on the cluster. The driver pod is responsible for managing
a workflow instance. The driver pod will monitor and submit each job in
the workflow as a separate worker pod.

.. _A pod: https://kubernetes.io/docs/concepts/workloads/pods/

worker pods
Pods unite! You have nothing to lose but your chains.
13 changes: 9 additions & 4 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,15 @@ Welcome to ``pgsc_calc``'s documentation!
==================================================

.. toctree::
:maxdepth: 2

input
api
:maxdepth: 2

install
input
usage
troubleshooting
api
offline
glossary


``pgsc_calc`` is a bioinformatics best-practice analysis pipeline for applying
Expand Down
75 changes: 75 additions & 0 deletions docs/install.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
Installation
============

``pgsc_calc`` is made with Nextflow and the nf-core framework. Nextflow needs to
be present on the computer where you want to launch the analysis. The latest
installation `instructions are available`_ here. The only hard requirement for
Nextflow is an Unix operating system and Java:

.. _`instructions are available`: https://www.nextflow.io/docs/latest/getstarted.html#installation

.. code-block:: bash
# Make sure that Java v8+ is installed:
java -version
# Install Nextflow
curl -fsSL get.nextflow.io | bash
# Add Nextflow binary to your user's PATH:
mv nextflow ~/bin/
# OR system-wide installation:
# sudo mv nextflow /usr/local/bin
Adding nextflow `to your PATH`_ is important so you are able to run nextflow in
a terminal outside of the directory that the downloaded binary is in. If your
operating system don't add ``~/bin/`` to your PATH automatically, so you might
need to configure this yourself.

.. _`to your PATH`: https://unix.stackexchange.com/a/26059

.. note::
You can update nextflow by running ``nextflow self-update``

Workflow software
-----------------

``pgsc_calc`` needs a lot of different software to run. Instead of manually
installing each piece of dependent software, the workflow supports automatic
software packaging to improve reproducibility. The workflow supports Docker,
Singularity, and Conda:

- `Docker`_
- Normally used on a local computer or the cloud
- Runs software inside `containers`_
- Traditionally requires system root access, and rootless Docker is
difficult to work with
- `Singularity`_
- Often used instead of Docker on multi-user HPC systems
- Runs software inside `containers`_
- `Conda`_
- A packaging system that manages environments
- Doesn't use containers, so worse reproducibility than Docker or
Singularity
- Recommended only as a fallback if Docker or Singularity aren't available

``pgsc_calc`` uses the nf-core framework, so has theoretical support for
`podman`_, `charliecloud`_, and `shifter`_, but these software packaging tools
aren't tested.

.. _`containers`: https://biocontainers-edu.readthedocs.io/en/latest/what_is_container.html
.. _`charliecloud`: https://hpc.github.io/charliecloud/
.. _`shifter`: https://www.nersc.gov/research-and-development/user-defined-images/
.. _`podman`: https://podman.io/
.. _`Docker`: https://docs.docker.com/get-docker/
.. _`Singularity`: https://sylabs.io/
.. _`Conda`: https://conda.io

Workflow code
-------------

Nextflow will automatically fetch ``pgsc_calc`` from Github, so you don't have
to do anything else. This process requires an internet connection.

If you would like to run the workflow on a computer with no internet connection,
please see the :doc:`offline instructions<offline>`.
2 changes: 2 additions & 0 deletions docs/offline.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Offline usage
=============
2 changes: 2 additions & 0 deletions docs/troubleshooting.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Troubleshooting
===============
2 changes: 2 additions & 0 deletions docs/usage.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Usage
=====

0 comments on commit 2cfd048

Please sign in to comment.