update docs

PGScatalog · Aug 31, 2022 · 8ec0d45 · 8ec0d45
1 parent 60a44a9
commit 8ec0d45
Show file tree

Hide file tree

Showing 8 changed files with 104 additions and 80 deletions.
diff --git a/conf/modules.config b/conf/modules.config
@@ -50,5 +50,6 @@ process {
 
     withName: PLINK2_SCORE {
         ext.args2 = ""
+        maxForks = 1
     }
 }
diff --git a/docs/_templates/globaltoc.html b/docs/_templates/globaltoc.html
@@ -5,7 +5,7 @@ <h3>Contents</h3>
   <li><a href="{{ pathto('getting-started') }}">Get started</a></li>
   <li><a href="{{ pathto('how-to/index') }}">How-to guides</a></li>
   <li><a href="{{ pathto('reference/index') }}">Reference guide</a></li>
-  <li><a href="{{ pathto('explanation/index') }}">Explanation</a></li>
+  <li><a href="{{ pathto('output') }}">Outputs & Results</a></li>
   <li><a href="{{ pathto('troubleshooting') }}">Troubleshooting</a></li>
   <li><a href="{{ pathto('glossary') }}">Glossary</a></li>
 </ul>

diff --git a/docs/explanation/index.rst b/docs/explanation/index.rst
diff --git a/docs/getting-started.rst b/docs/getting-started.rst
@@ -229,3 +229,13 @@ If the workflow didn't execute successfully, have a look at the
 :ref:`troubleshoot` section. Remember to replace ``<docker/singularity/conda>``
 with the software you have installed on your computer.
 
+4. Next steps & advanced usage
+------------------------------
+
+The pipeline distributes with settings that easily allow for it to be run on a
+personal computer on smaller datasets (e.g. 1000 Genomes, HGDP).
+
+For information on how to run the pipelines on larger datasets/computers/job-schedulers,
+see :ref:`big job`.
+
+If you are using an newer Mac computer with an M-series chip, see :ref:`arm`.
diff --git a/docs/how-to/bigjob.rst b/docs/how-to/bigjob.rst
@@ -1,19 +1,20 @@
 .. _big job:
 
-How do I run big jobs on a powerful computer?
-=============================================
+How do I run `pgsc_calc` on larger datasets and more powerful computers?
+========================================================================
 
-If you want to calculate many polygenic scores for a very large dataset, like
-the UK BioBank, you might need some extra computing power! You might have access
-to a powerful workstation, a University cluster, or some cloud compute
-resources. This section will show how to set up pgsc_calc to submit work to
-these types of systems.
+If you want to calculate many polygenic scores for a very large dataset (e.g. UK BioBank)
+you will likely need to adjust the pipeline settings. You might have access to a powerful workstation,
+a University cluster, or some cloud compute resources. This section will show how to set up
+`pgsc_calc` to submit work to these types of systems by creating and editing `nextflow .config files`_.
 
-Configuring pgsc_calc to use more resources locally
----------------------------------------------------
+.. _nextflow .config files: https://www.nextflow.io/docs/latest/config.html
+
+Configuring `pgsc_calc` to use more resources locally
+-----------------------------------------------------
 
 If you have a powerful computer available locally, you can configure the amount
-of resources that the workflow uses. 
+of resources that each job in the workflow uses.
 
 .. code-block:: text
 
@@ -53,7 +54,13 @@ High performance computing cluster
 
 If you have access to a HPC cluster, you'll need to configure your cluster's
 unique parameters to set correct queues, user accounts, and resource
-limits. Here's an example for an LSF cluster:
+limits.
+
+.. note:: Your institution may already have `a nextflow profile`_ with existing cluster settings
+          that can be adapted instead of setting up a custom config using ``-c``
+
+However, in general you will have to adjust the ``executor`` options and job resource
+allocations (e.g. ``process_low``). Here's an example for an LSF cluster:
 
 .. code-block:: text
 
@@ -74,10 +81,10 @@ limits. Here's an example for an LSF cluster:
             time   = 4.h
         }
         withName: PLINK2_SCORE {
-            maxForks = 50
+            maxForks = 25
         }       
-    } 
-    
+    }
+
 In SLURM, queue is equivalent to a partition. Specific cluster parameters can be
 provided by modifying ``clusterOptions``. You should change ``cpus``,
 ``memory``, and ``time`` to match the amount of resources used. Assuming the
@@ -102,9 +109,6 @@ instead:
 
 .. note:: The name of the nextflow and singularity modules will be different in
           your local environment
-
-.. note:: Your institution may already have `a nextflow profile`_, which can be
-          used instead of setting up a custom config using ``-c``
 
 .. note:: Think about enabling fast variant matching with ``--fast_match``!      
 
@@ -126,11 +130,11 @@ Other environments
 
 Nextflow also supports submitting jobs platforms like:
 
-- Google cloud
-- Azure cloud
-- Amazon cloud
-- Kubernetes
+- Google cloud (https://www.nextflow.io/docs/latest/google.html)
+- Azure cloud (https://www.nextflow.io/docs/latest/azure.html)
+- Amazon cloud (https://www.nextflow.io/docs/latest/aws.html)
+- Kubernetes (https://www.nextflow.io/docs/latest/kubernetes.html)
 
 Check the `nextflow documentation`_ for configuration specifics.
 
-.. _`nextflow documentation`: https://nextflow.io/docs/latest/google.html
+.. _`nextflow documentation`: https://nextflow.io/docs/latest/
diff --git a/docs/index.rst b/docs/index.rst
@@ -1,14 +1,33 @@
 :orphan:
 
 ``pgsc_calc``: a reproducible workflow to calculate polygenic scores
-=================================================
+====================================================================
 
-The ``pgsc_calc`` workflow makes it easy to calculate a :term:`polygenic score` using
-scoring files of PGS published in the `Polygenic Score (PGS) Catalog`_ |:dna:|
+The ``pgsc_calc`` workflow makes it easy to calculate a :term:`polygenic score` (PGS) using
+scoring files published in the `Polygenic Score (PGS) Catalog`_ |:dna:|
 and/or custom scoring files.
 
+The calculator workflow automates PGS downloads from the Catalog,
+vairant matching between scoring files and target genotyping samplesets,
+and the paralell calculation of multiple PGS.
+
 .. _`Polygenic Score (PGS) Catalog`: https://www.pgscatalog.org/
 
+Workflow summary
+----------------
+
+Currently the pipeline works
+
+- Fetch scoring files using the PGS Catalog API in a specified genome build (GRCh37 and GRCh38).
+- Read custom scoring files (perform liftover if genotyping data is in a different build).
+- Match variants in the scoring files against variants in the target dataset (in plink bfile/pfile or VCF format)
+- Automatically combine and split different scoring files for efficient parallel computation of multiple PGS
+- Calculate and create aggregate score data for all samples
+- Publish a summary report to visualize score distributions and pipeline metadata (variant matching QC)
+
+See `Features Under Development <Features Under Development_>`_ section for information
+about planned updates.
+
 Quick example
 -------------
 
@@ -59,27 +78,6 @@ The workflow should output:
 
 If you want to try the workflow with your own data, have a look at the
 :ref:`get started` section.
-
-Workflow summary
-----------------
-
-- Fetch scoring files using the PGS Catalog API in a specified genome build (GRCh37 and GRCh38).
-- Read custom scoring files (perform liftover if genotyping data is in a different build).
-- Match variants in the scoring files against variants in the target dataset (in plink bfile/pfile or VCF format)
-- Automatically combine and split different scoring files for efficient parallel computation of multiple PGS
-- Calculate and create aggregate score data for all samples
-- Publish a summary report to visualize score distributions and pipeline metadata (variant matching QC)
-
-In the future, the calculator will include new features for PGS interpretation:
-
-- *Genetic Ancestry*: calculate similarity of target samples to populations in a
-  reference dataset (e.g. `1000 Genomes (1000G)`_, `Human Genome Diversity Project (HGDP)`_)
-  using principal components analysis (PCA).
-- *PGS Normalization*: Using reference population data and/or PCA projections to report
-  individual-level PGS predictions (e.g. percentiles, z-scores) that account for genetic ancestry.
-
-.. _1000 Genomes (1000G): http://www.nature.com/nature/journal/v526/n7571/full/nature15393.html
-.. _Human Genome Diversity Project (HGDP): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7115999/
 
 Documentation
 -------------
@@ -95,22 +93,20 @@ Changelog
 
 The :doc:`Changelog page<changelog>` describes fixes and enhancements for each version.
 
-Citations
----------
-
-If you use ``pgscatalog/pgsc_calc`` in your analysis, please cite:
+Features Under Development
+--------------------------
 
-    PGS Catalog Calculator `(in development)`. PGS Catalog
-    Team. https://github.com/PGScatalog/pgsc_calc
+In the future, the calculator will include new features for PGS interpretation:
 
-    Lambert `et al.` (2021) The Polygenic Score Catalog as an open database for
-    reproducibility and systematic evaluation.  Nature Genetics. 53:420–425
-    doi:`10.1038/s41588-021-00783-5`_.
+- *Genetic Ancestry*: calculate similarity of target samples to populations in a
+  reference dataset (e.g. `1000 Genomes (1000G)`_, `Human Genome Diversity Project (HGDP)`_)
+  using principal components analysis (PCA).
+- *PGS Normalization*: Using reference population data and/or PCA projections to report
+  individual-level PGS predictions (e.g. percentiles, z-scores) that account for genetic ancestry.
 
-In addition, please remember to cite the other papers and software tools described in the `citations file`_.
+.. _1000 Genomes (1000G): http://www.nature.com/nature/journal/v526/n7571/full/nature15393.html
+.. _Human Genome Diversity Project (HGDP): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7115999/
 
-.. _citations file: https://github.com/PGScatalog/pgsc_calc/blob/master/CITATIONS.md
-.. _10.1038/s41588-021-00783-5: https://doi.org/10.1038/s41588-021-00783-5
 
 Credits
 -------
@@ -132,6 +128,24 @@ is ongoing including Inouye lab members (Rodrigo Canovas, Scott Ritchie) and oth
 manuscript describing the tool is in preparation (see `Citations <Citations_>`_) and we
 welcome ongoing community feedback before then.
 
+Citations
+~~~~~~~~~
+
+If you use ``pgscatalog/pgsc_calc`` in your analysis, please cite:
+
+    PGS Catalog Calculator `(in development)`. PGS Catalog
+    Team. https://github.com/PGScatalog/pgsc_calc
+
+    Lambert `et al.` (2021) The Polygenic Score Catalog as an open database for
+    reproducibility and systematic evaluation.  Nature Genetics. 53:420–425
+    doi:`10.1038/s41588-021-00783-5`_.
+
+In addition, please remember to cite the other papers and software tools described in the `citations file`_.
+
+.. _citations file: https://github.com/PGScatalog/pgsc_calc/blob/master/CITATIONS.md
+.. _10.1038/s41588-021-00783-5: https://doi.org/10.1038/s41588-021-00783-5
+
+
 Others
 ~~~~~~
 

diff --git a/docs/explanation/output.rst → docs/output.rst b/docs/explanation/output.rst → docs/output.rst
@@ -1,11 +1,11 @@
 
 .. _interpret:
 
-Understanding workflow output
+``pgsc_calc`` Outputs & Results
 =============================
 
 
-The pipeline outputs are writtent to a results directory
+The pipeline outputs are written to a results directory
 (``--outdir`` default is ``./results/``) that contains three subdirectories:
 
 - ``score/``
@@ -27,13 +27,14 @@ Each row represents an individual, and there should be at least three columns wi
 
 At least one score must be present in this file (the third column). Extra columns might be
 present if you calculated more than one score, or if you calculated the PGS on a dataset with a
-small sample size (n < 50, in this cases a column named ``[PGS NAME]_AVG`` will be added that normalizes the PGS
-using the number of non-missing genotypes to avoid using allele frequency data from the target sample).
+small sample size (n < 50, in this cases a column named ``[PGS NAME]_AVG`` will be added that
+normalizes the PGS using the number of non-missing genotypes to avoid using allele frequency data
+from the target sample).
 
 A summary report is also available (``report.html``). The report should open in
 a web browser and contain useful information about the PGS that were applied,
 how well the variants match with the genotyping data, and some simple graphs
-displaying the distribution of scores in your dataset(s).
+displaying the distribution of scores in your dataset(s) as a density plot.
 
 ``match/``
 ----------
@@ -63,8 +64,11 @@ files, scores are aggregated to produce the final results in ``score/``.
 ``pipeline_info/``
 ------------------
 
-Summary reports generated by nextflow describes the execution of the pipeline in
-a lot of technical detail. The execution report can be useful to see how long a
-job takes to execute, and how much memory/cpu has been allocated (or overallocated)
-to specific jobs. The DAG is a diagram that may be useful to understand how
-the pipeline processes data. 
+Summary reports generated by nextflow describing the execution of the pipeline in
+a lot of technical detail (see `nextflow tracing & visulisation`_ docs for more detail).
+The execution report can be useful to see how long a job takes to execute, and how much
+memory/cpu has been allocated (or overallocated) to specific jobs. The DAG is a visualization
+of the pipline that may be useful to understand how the pipeline processes data and the ordering
+of the modules.
+
+.. _`nextflow tracing & visulisation`: https://www.nextflow.io/docs/latest/tracing.html
diff --git a/docs/troubleshooting.rst b/docs/troubleshooting.rst
@@ -19,7 +19,8 @@ Did you forget to set ``--max_cpu`` or ``--max_memory?``
 
 You can also edit ``nextflow.config`` to configure cpu and memory permanently. nf-core
 provides a `set of example .config files`_, including examples for both institutional
-compute clusters (e.g. Cambridge, Sanger) and cloud compute providers (e.g. Google, AWS Tower and Batch).
+compute clusters (e.g. Cambridge, Sanger) and cloud compute providers
+(e.g. Google, AWS Tower and Batch). See :ref:`big job` for more information.
 
 .. _set of example .config files : https://github.com/nf-core/configs