more docs!

PGScatalog · Jan 28, 2022 · 2cfd048 · 2cfd048
1 parent 56493b3
commit 2cfd048
Show file tree

Hide file tree

Showing 13 changed files with 249 additions and 26 deletions.
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -8,7 +8,7 @@ jobs:
     runs-on: ubuntu-latest
     steps:
     - uses: actions/setup-python@v2
-    - run: pip install sphinx-book-theme
+    - run: pip install sphinx-book-theme sphinx-jsonschema
     - uses: actions/checkout@master
       with:
         fetch-depth: 0 # otherwise, you will failed to push refs to dest repo

diff --git a/README.md b/README.md
@@ -71,6 +71,11 @@ NOTE: the pipeline is distributed and makes use of datasets (e.g. 1000 Genomes a
 are provided under specific data licenses (see the [assets](assets/README.md) directory README for more information). It is up to
 end-users to ensure that their use conforms to these restrictions.
 
+This work has received funding from EMBL-EBI core funds, the Baker Institute,
+the University of Cambridge, Health Data Research UK (HDRUK), and the European
+Union’s Horizon 2020 research and innovation programme under grant agreement No
+101016775 INTERVENE.
+
 <!-- TODO nf-core: If applicable, make list of people who have also contributed
 -->
 

diff --git a/assets/api_examples/call.json b/assets/api_examples/call.json
@@ -0,0 +1,15 @@
+{
+    "target_genomes": [
+        {
+            "sample": "example",
+            "vcf_path": "/path/to/genome.vcf.gz",
+            "chrom": 22
+        }
+    ],
+    "nxf_params_file": {
+        "scorefile": "/path/to/scorefile.txt",
+        "format": "json"
+    },
+    "nxf_work": "/workspace/unique_work_directory/",
+    "id": "unique_id"
+}
diff --git a/assets/api_examples/input.json b/assets/api_examples/input.json
@@ -0,0 +1,9 @@
+{
+    "target_genomes": [
+        {
+            "sample": "example",
+            "vcf_path": "/path/to/genome.vcf.gz",
+            "chrom": 22
+        }
+    ]
+}
diff --git a/assets/api_examples/params.json b/assets/api_examples/params.json
@@ -0,0 +1,6 @@
+{
+    "nxf_params_file": {
+        "scorefile": "/path/to/scorefile.txt",
+        "format": "json"
+    }
+}
diff --git a/docs/api.rst b/docs/api.rst
@@ -1,26 +1,27 @@
 API
 ===
 
-``pgsc_calc`` is designed to be used in a terminal, but can also be launched
-programmatically on a Kubernetes cluster. To simplify this process,
-``pgsc_calc`` supports specifying target genomes and runtime parameters using
-JSON.
+``pgsc_calc`` has two main use cases:
 
+- A bioinformatician or data scientist wants to calculate some polygenic scores
+  using an Unixy operating system and a terminal
+- A normal person (e.g. a biologist or other researcher) wants to calculate some
+  polygenic scores using a web browser
+
+To simplify the second use case, the workflow is designed to be launched
+programmatically on a `private cloud`_ using an API. API parameters are specified
+using JSON. The web platform is still under development.
+
+.. _private cloud: http://www.embassycloud.org/
+
+Minimal example
+---------------
 
 Specifying target genomes with JSON
------------------------------------
-
-.. code-block:: json
-                
-    {
-      "target_genomes": [
-        {
-          "sample": "example",
-          "vcf_path": "path.vcf.gz",
-          "chrom": 22
-        }
-      ]
-    }   
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. literalinclude:: ../assets/api_examples/input.json
+  :language: JSON
 
 Target genomes are specified as a JSON array. Each element of the array must:
 
@@ -39,7 +40,58 @@ This JSON data must be saved to a file and used with the workflow parameter
 with ``.json``.
 
 Specifying workflow parameters with JSON
-----------------------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. literalinclude:: ../assets/api_examples/params.json
+  :language: JSON
+
+Some other parameters need to be set for the workflow to run, which are
+specified in a simple JSON object. This object can be complex, because many
+optional parameters can be set here. A minimal workflow parameter object must
+contain:
+
+- The path to a :term:`scoring file` OR
+- An :term:`accession` in the :term:`PGS Catalog` (replace ``--accession`` with
+  scorefile)
+- The format must be "json"
+
+The :ref:`JSON schema` specifies optional parameters in full.
+
+API call
+~~~~~~~~
+
+.. literalinclude:: ../assets/api_examples/call.json
+
+The complete call also includes some nextflow configuration. The workflow is
+assigned a unique identifier at launch to monitor its progress. Nextflow has
+some requirements for the Kubernetes executor, so the work
+directory must be unique and be in a `ReadWriteMany persistent volume claim`_ that
+is accessible by the :term:`driver pod` and all :term:`worker pods`.
+
+.. _ReadWriteMany persistent volume claim: https://www.nextflow.io/docs/latest/kubernetes.html#requirements
+
+Schema
+------
+
+This documentation is useful for a human, but not a computer, so we wrote a
+document (`a JSON schema`_) that describes the data format. The schema is used
+to automatically validate data submitted to the workflow via the API.
+
+.. _a JSON schema: https://raw.githubusercontent.com/PGScatalog/pgsc_calc/master/assets/schema_k8s.json
+
+.. jsonschema:: ../assets/schema_k8s.json
+
+Implementation details
+----------------------
+
+The API is designed using an event-driven approach with `Argo
+Events`_. Briefly, a sensor constantly listens on a Kubernetes cluster for Kafka
+messages to launch the pipeline. Once the message is received, a nextflow driver
+pod is created and the workflow is executed using the `K8S executor`_. The
+status of the workflow instance is reported using Nextflow's `weblog`_ and
+a second sensor.
+
+.. _Argo Events: https://argoproj.github.io/argo-events/
+.. _K8S executor: https://www.nextflow.io/docs/latest/kubernetes.html
+.. _weblog: https://www.nextflow.io/docs/latest/tracing.html#weblog-via-http
 
-JSON schema
------------
diff --git a/docs/conf.py b/docs/conf.py
@@ -28,7 +28,10 @@
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
 extensions = [
-    "sphinx.ext.githubpages"
+    'sphinx.ext.githubpages',
+    'sphinx.ext.autosectionlabel',
+    'sphinx.ext.autodoc',
+    'sphinx-jsonschema'
 ]
 
 # Add any paths that contain templates here, relative to this directory.

diff --git a/docs/glossary.rst b/docs/glossary.rst
@@ -0,0 +1,47 @@
+Glossary
+========
+
+.. glossary::
+     accession
+         A unique and stable identifier
+
+     polygenic score
+         A `polygenic score`_ (PGS) aggregates the effects of many genetic variants
+         into a single number which predicts genetic predisposition for a
+         phenotype. PGS are typically composed of hundreds-to-millions of genetic
+         variants (usually SNPs) which are combined using a weighted sum of allele
+         dosages multiplied by their corresponding effect sizes, as estimated from
+         a relevant genome-wide association study (GWAS).
+
+         .. _polygenic score: https://www.pgscatalog.org/about/
+
+     PGS Catalog
+         The `PGS Catalog`_ is an open database of published polygenic scores
+         (PGS). If you develop and publish polygenic scores, please consider
+         `submitting them`_ to the Catalog!
+
+         .. _PGS Catalog: https://www.pgscatalog.org
+         .. _submitting them: https://www.pgscatalog.org/submit/
+
+     PGS Catalog Calculator
+         This cool workflow
+
+     Scoring file
+         A file for scoring
+
+     SNP
+         A single nucleotide polymorphism
+
+     driver pod
+     pod
+         `A pod`_ is a description of one or more containers and its associated
+         computing resources (e.g. number of processes and RAM, but it's more
+         complicated than that). Kubernetes takes this description and tries to
+         make it exist on the cluster. The driver pod is responsible for managing
+         a workflow instance. The driver pod will monitor and submit each job in
+         the workflow as a separate worker pod.
+
+         .. _A pod: https://kubernetes.io/docs/concepts/workloads/pods/
+
+     worker pods
+         Pods unite! You have nothing to lose but your chains.
diff --git a/docs/index.rst b/docs/index.rst
@@ -7,10 +7,15 @@ Welcome to ``pgsc_calc``'s documentation!
 ==================================================
 
 .. toctree::
-   :maxdepth: 2
-
-   input
-   api
+    :maxdepth: 2
+
+    install
+    input
+    usage
+    troubleshooting
+    api
+    offline
+    glossary
 
 
 ``pgsc_calc`` is a bioinformatics best-practice analysis pipeline for applying

diff --git a/docs/install.rst b/docs/install.rst
@@ -0,0 +1,75 @@
+Installation
+============
+
+``pgsc_calc`` is made with Nextflow and the nf-core framework. Nextflow needs to
+be present on the computer where you want to launch the analysis. The latest
+installation `instructions are available`_ here. The only hard requirement for
+Nextflow is an Unix operating system and Java:
+
+.. _`instructions are available`: https://www.nextflow.io/docs/latest/getstarted.html#installation
+
+.. code-block:: bash
+
+    # Make sure that Java v8+ is installed:
+    java -version
+
+    # Install Nextflow
+    curl -fsSL get.nextflow.io | bash
+
+    # Add Nextflow binary to your user's PATH:
+    mv nextflow ~/bin/
+    # OR system-wide installation:
+    # sudo mv nextflow /usr/local/bin
+
+Adding nextflow `to your PATH`_ is important so you are able to run nextflow in
+a terminal outside of the directory that the downloaded binary is in. If your
+operating system don't add ``~/bin/`` to your PATH automatically, so you might
+need to configure this yourself.
+
+.. _`to your PATH`: https://unix.stackexchange.com/a/26059
+
+.. note::
+   You can update nextflow by running ``nextflow self-update``
+
+Workflow software
+-----------------
+
+``pgsc_calc`` needs a lot of different software to run. Instead of manually
+installing each piece of dependent software, the workflow supports automatic
+software packaging to improve reproducibility. The workflow supports Docker,
+Singularity, and Conda:
+
+- `Docker`_
+    - Normally used on a local computer or the cloud
+    - Runs software inside `containers`_
+    - Traditionally requires system root access, and rootless Docker is
+      difficult to work with
+- `Singularity`_
+    - Often used instead of Docker on multi-user HPC systems
+    - Runs software inside `containers`_
+- `Conda`_
+    - A packaging system that manages environments
+    - Doesn't use containers, so worse reproducibility than Docker or
+      Singularity
+    - Recommended only as a fallback if Docker or Singularity aren't available
+
+``pgsc_calc`` uses the nf-core framework, so has theoretical support for
+`podman`_, `charliecloud`_, and `shifter`_, but these software packaging tools
+aren't tested.
+
+.. _`containers`: https://biocontainers-edu.readthedocs.io/en/latest/what_is_container.html
+.. _`charliecloud`: https://hpc.github.io/charliecloud/
+.. _`shifter`: https://www.nersc.gov/research-and-development/user-defined-images/
+.. _`podman`: https://podman.io/
+.. _`Docker`: https://docs.docker.com/get-docker/
+.. _`Singularity`: https://sylabs.io/
+.. _`Conda`: https://conda.io
+
+Workflow code
+-------------
+
+Nextflow will automatically fetch ``pgsc_calc`` from Github, so you don't have
+to do anything else. This process requires an internet connection.
+
+If you would like to run the workflow on a computer with no internet connection,
+please see the :doc:`offline instructions<offline>`.
diff --git a/docs/offline.rst b/docs/offline.rst
@@ -0,0 +1,2 @@
+Offline usage
+=============
diff --git a/docs/troubleshooting.rst b/docs/troubleshooting.rst
@@ -0,0 +1,2 @@
+Troubleshooting
+===============
diff --git a/docs/usage.rst b/docs/usage.rst
@@ -0,0 +1,2 @@
+Usage
+=====