Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
nebfield committed Aug 31, 2022
1 parent 60a44a9 commit 8ec0d45
Show file tree
Hide file tree
Showing 8 changed files with 104 additions and 80 deletions.
1 change: 1 addition & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -50,5 +50,6 @@ process {

withName: PLINK2_SCORE {
ext.args2 = ""
maxForks = 1
}
}
2 changes: 1 addition & 1 deletion docs/_templates/globaltoc.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ <h3>Contents</h3>
<li><a href="{{ pathto('getting-started') }}">Get started</a></li>
<li><a href="{{ pathto('how-to/index') }}">How-to guides</a></li>
<li><a href="{{ pathto('reference/index') }}">Reference guide</a></li>
<li><a href="{{ pathto('explanation/index') }}">Explanation</a></li>
<li><a href="{{ pathto('output') }}">Outputs & Results</a></li>
<li><a href="{{ pathto('troubleshooting') }}">Troubleshooting</a></li>
<li><a href="{{ pathto('glossary') }}">Glossary</a></li>
</ul>
Expand Down
10 changes: 0 additions & 10 deletions docs/explanation/index.rst

This file was deleted.

10 changes: 10 additions & 0 deletions docs/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -229,3 +229,13 @@ If the workflow didn't execute successfully, have a look at the
:ref:`troubleshoot` section. Remember to replace ``<docker/singularity/conda>``
with the software you have installed on your computer.

4. Next steps & advanced usage
------------------------------

The pipeline distributes with settings that easily allow for it to be run on a
personal computer on smaller datasets (e.g. 1000 Genomes, HGDP).

For information on how to run the pipelines on larger datasets/computers/job-schedulers,
see :ref:`big job`.

If you are using an newer Mac computer with an M-series chip, see :ref:`arm`.
48 changes: 26 additions & 22 deletions docs/how-to/bigjob.rst
Original file line number Diff line number Diff line change
@@ -1,19 +1,20 @@
.. _big job:

How do I run big jobs on a powerful computer?
=============================================
How do I run `pgsc_calc` on larger datasets and more powerful computers?
========================================================================

If you want to calculate many polygenic scores for a very large dataset, like
the UK BioBank, you might need some extra computing power! You might have access
to a powerful workstation, a University cluster, or some cloud compute
resources. This section will show how to set up pgsc_calc to submit work to
these types of systems.
If you want to calculate many polygenic scores for a very large dataset (e.g. UK BioBank)
you will likely need to adjust the pipeline settings. You might have access to a powerful workstation,
a University cluster, or some cloud compute resources. This section will show how to set up
`pgsc_calc` to submit work to these types of systems by creating and editing `nextflow .config files`_.

Configuring pgsc_calc to use more resources locally
---------------------------------------------------
.. _nextflow .config files: https://www.nextflow.io/docs/latest/config.html

Configuring `pgsc_calc` to use more resources locally
-----------------------------------------------------

If you have a powerful computer available locally, you can configure the amount
of resources that the workflow uses.
of resources that each job in the workflow uses.

.. code-block:: text
Expand Down Expand Up @@ -53,7 +54,13 @@ High performance computing cluster

If you have access to a HPC cluster, you'll need to configure your cluster's
unique parameters to set correct queues, user accounts, and resource
limits. Here's an example for an LSF cluster:
limits.

.. note:: Your institution may already have `a nextflow profile`_ with existing cluster settings
that can be adapted instead of setting up a custom config using ``-c``

However, in general you will have to adjust the ``executor`` options and job resource
allocations (e.g. ``process_low``). Here's an example for an LSF cluster:

.. code-block:: text
Expand All @@ -74,10 +81,10 @@ limits. Here's an example for an LSF cluster:
time = 4.h
}
withName: PLINK2_SCORE {
maxForks = 50
maxForks = 25
}
}
}
In SLURM, queue is equivalent to a partition. Specific cluster parameters can be
provided by modifying ``clusterOptions``. You should change ``cpus``,
``memory``, and ``time`` to match the amount of resources used. Assuming the
Expand All @@ -102,9 +109,6 @@ instead:
.. note:: The name of the nextflow and singularity modules will be different in
your local environment

.. note:: Your institution may already have `a nextflow profile`_, which can be
used instead of setting up a custom config using ``-c``

.. note:: Think about enabling fast variant matching with ``--fast_match``!

Expand All @@ -126,11 +130,11 @@ Other environments

Nextflow also supports submitting jobs platforms like:

- Google cloud
- Azure cloud
- Amazon cloud
- Kubernetes
- Google cloud (https://www.nextflow.io/docs/latest/google.html)
- Azure cloud (https://www.nextflow.io/docs/latest/azure.html)
- Amazon cloud (https://www.nextflow.io/docs/latest/aws.html)
- Kubernetes (https://www.nextflow.io/docs/latest/kubernetes.html)

Check the `nextflow documentation`_ for configuration specifics.

.. _`nextflow documentation`: https://nextflow.io/docs/latest/google.html
.. _`nextflow documentation`: https://nextflow.io/docs/latest/
86 changes: 50 additions & 36 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,33 @@
:orphan:

``pgsc_calc``: a reproducible workflow to calculate polygenic scores
=================================================
====================================================================

The ``pgsc_calc`` workflow makes it easy to calculate a :term:`polygenic score` using
scoring files of PGS published in the `Polygenic Score (PGS) Catalog`_ |:dna:|
The ``pgsc_calc`` workflow makes it easy to calculate a :term:`polygenic score` (PGS) using
scoring files published in the `Polygenic Score (PGS) Catalog`_ |:dna:|
and/or custom scoring files.

The calculator workflow automates PGS downloads from the Catalog,
vairant matching between scoring files and target genotyping samplesets,
and the paralell calculation of multiple PGS.

.. _`Polygenic Score (PGS) Catalog`: https://www.pgscatalog.org/

Workflow summary
----------------

Currently the pipeline works

- Fetch scoring files using the PGS Catalog API in a specified genome build (GRCh37 and GRCh38).
- Read custom scoring files (perform liftover if genotyping data is in a different build).
- Match variants in the scoring files against variants in the target dataset (in plink bfile/pfile or VCF format)
- Automatically combine and split different scoring files for efficient parallel computation of multiple PGS
- Calculate and create aggregate score data for all samples
- Publish a summary report to visualize score distributions and pipeline metadata (variant matching QC)

See `Features Under Development <Features Under Development_>`_ section for information
about planned updates.

Quick example
-------------

Expand Down Expand Up @@ -59,27 +78,6 @@ The workflow should output:

If you want to try the workflow with your own data, have a look at the
:ref:`get started` section.

Workflow summary
----------------

- Fetch scoring files using the PGS Catalog API in a specified genome build (GRCh37 and GRCh38).
- Read custom scoring files (perform liftover if genotyping data is in a different build).
- Match variants in the scoring files against variants in the target dataset (in plink bfile/pfile or VCF format)
- Automatically combine and split different scoring files for efficient parallel computation of multiple PGS
- Calculate and create aggregate score data for all samples
- Publish a summary report to visualize score distributions and pipeline metadata (variant matching QC)

In the future, the calculator will include new features for PGS interpretation:

- *Genetic Ancestry*: calculate similarity of target samples to populations in a
reference dataset (e.g. `1000 Genomes (1000G)`_, `Human Genome Diversity Project (HGDP)`_)
using principal components analysis (PCA).
- *PGS Normalization*: Using reference population data and/or PCA projections to report
individual-level PGS predictions (e.g. percentiles, z-scores) that account for genetic ancestry.

.. _1000 Genomes (1000G): http://www.nature.com/nature/journal/v526/n7571/full/nature15393.html
.. _Human Genome Diversity Project (HGDP): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7115999/

Documentation
-------------
Expand All @@ -95,22 +93,20 @@ Changelog

The :doc:`Changelog page<changelog>` describes fixes and enhancements for each version.

Citations
---------

If you use ``pgscatalog/pgsc_calc`` in your analysis, please cite:
Features Under Development
--------------------------

PGS Catalog Calculator `(in development)`. PGS Catalog
Team. https://github.com/PGScatalog/pgsc_calc
In the future, the calculator will include new features for PGS interpretation:

Lambert `et al.` (2021) The Polygenic Score Catalog as an open database for
reproducibility and systematic evaluation. Nature Genetics. 53:420–425
doi:`10.1038/s41588-021-00783-5`_.
- *Genetic Ancestry*: calculate similarity of target samples to populations in a
reference dataset (e.g. `1000 Genomes (1000G)`_, `Human Genome Diversity Project (HGDP)`_)
using principal components analysis (PCA).
- *PGS Normalization*: Using reference population data and/or PCA projections to report
individual-level PGS predictions (e.g. percentiles, z-scores) that account for genetic ancestry.

In addition, please remember to cite the other papers and software tools described in the `citations file`_.
.. _1000 Genomes (1000G): http://www.nature.com/nature/journal/v526/n7571/full/nature15393.html
.. _Human Genome Diversity Project (HGDP): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7115999/

.. _citations file: https://github.com/PGScatalog/pgsc_calc/blob/master/CITATIONS.md
.. _10.1038/s41588-021-00783-5: https://doi.org/10.1038/s41588-021-00783-5

Credits
-------
Expand All @@ -132,6 +128,24 @@ is ongoing including Inouye lab members (Rodrigo Canovas, Scott Ritchie) and oth
manuscript describing the tool is in preparation (see `Citations <Citations_>`_) and we
welcome ongoing community feedback before then.

Citations
~~~~~~~~~

If you use ``pgscatalog/pgsc_calc`` in your analysis, please cite:

PGS Catalog Calculator `(in development)`. PGS Catalog
Team. https://github.com/PGScatalog/pgsc_calc

Lambert `et al.` (2021) The Polygenic Score Catalog as an open database for
reproducibility and systematic evaluation. Nature Genetics. 53:420–425
doi:`10.1038/s41588-021-00783-5`_.

In addition, please remember to cite the other papers and software tools described in the `citations file`_.

.. _citations file: https://github.com/PGScatalog/pgsc_calc/blob/master/CITATIONS.md
.. _10.1038/s41588-021-00783-5: https://doi.org/10.1038/s41588-021-00783-5


Others
~~~~~~

Expand Down
24 changes: 14 additions & 10 deletions docs/explanation/output.rst → docs/output.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@

.. _interpret:

Understanding workflow output
``pgsc_calc`` Outputs & Results
=============================


The pipeline outputs are writtent to a results directory
The pipeline outputs are written to a results directory
(``--outdir`` default is ``./results/``) that contains three subdirectories:

- ``score/``
Expand All @@ -27,13 +27,14 @@ Each row represents an individual, and there should be at least three columns wi

At least one score must be present in this file (the third column). Extra columns might be
present if you calculated more than one score, or if you calculated the PGS on a dataset with a
small sample size (n < 50, in this cases a column named ``[PGS NAME]_AVG`` will be added that normalizes the PGS
using the number of non-missing genotypes to avoid using allele frequency data from the target sample).
small sample size (n < 50, in this cases a column named ``[PGS NAME]_AVG`` will be added that
normalizes the PGS using the number of non-missing genotypes to avoid using allele frequency data
from the target sample).

A summary report is also available (``report.html``). The report should open in
a web browser and contain useful information about the PGS that were applied,
how well the variants match with the genotyping data, and some simple graphs
displaying the distribution of scores in your dataset(s).
displaying the distribution of scores in your dataset(s) as a density plot.

``match/``
----------
Expand Down Expand Up @@ -63,8 +64,11 @@ files, scores are aggregated to produce the final results in ``score/``.
``pipeline_info/``
------------------

Summary reports generated by nextflow describes the execution of the pipeline in
a lot of technical detail. The execution report can be useful to see how long a
job takes to execute, and how much memory/cpu has been allocated (or overallocated)
to specific jobs. The DAG is a diagram that may be useful to understand how
the pipeline processes data.
Summary reports generated by nextflow describing the execution of the pipeline in
a lot of technical detail (see `nextflow tracing & visulisation`_ docs for more detail).
The execution report can be useful to see how long a job takes to execute, and how much
memory/cpu has been allocated (or overallocated) to specific jobs. The DAG is a visualization
of the pipline that may be useful to understand how the pipeline processes data and the ordering
of the modules.

.. _`nextflow tracing & visulisation`: https://www.nextflow.io/docs/latest/tracing.html
3 changes: 2 additions & 1 deletion docs/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ Did you forget to set ``--max_cpu`` or ``--max_memory?``

You can also edit ``nextflow.config`` to configure cpu and memory permanently. nf-core
provides a `set of example .config files`_, including examples for both institutional
compute clusters (e.g. Cambridge, Sanger) and cloud compute providers (e.g. Google, AWS Tower and Batch).
compute clusters (e.g. Cambridge, Sanger) and cloud compute providers
(e.g. Google, AWS Tower and Batch). See :ref:`big job` for more information.

.. _set of example .config files : https://github.com/nf-core/configs

Expand Down

0 comments on commit 8ec0d45

Please sign in to comment.