Skip to content

Commit

Permalink
add docs
Browse files Browse the repository at this point in the history
  • Loading branch information
nebfield committed May 26, 2022
1 parent 5a7f781 commit add194e
Show file tree
Hide file tree
Showing 4 changed files with 184 additions and 47 deletions.
26 changes: 15 additions & 11 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,13 @@ Introduction
------------

``pgsc_calc`` is a bioinformatics best-practice analysis pipeline for applying
scoring files from the `Polygenic Score (PGS)
Catalog <https://www.pgscatalog.org/>`_ to target genotyped samples |:dna:| |:partying_face:|
scoring files from the `Polygenic Score (PGS) Catalog
<https://www.pgscatalog.org/>`_ to target genotyped samples |:dna:|
|:partying_face:|

.. note::

This project is under active development and may break at any time but should
be ready soon |:tm:|
This project is under very active development and updates are frequent

Quick start
-----------
Expand All @@ -39,7 +39,7 @@ Quick start
nextflow run pgscatalog/pgsc_calc -profile test,docker
.. note:: The ``docker`` profile option can be replaced with ``singularity`` or
``conda``.
``conda`` depending on your local environment

.. _`Docker`: https://docs.docker.com/get-docker/
.. _`Singularity`: https://sylabs.io/
Expand All @@ -48,14 +48,18 @@ Quick start
Workflow summary
----------------

- Optionally, fetch a scorefile from the PGS Catalog API
- Convert target genomic data from VCF to plink format automatically
- Split target genomic data automatically
- Relabel variants to a common identifier
- Match variants in the scoring file against variants in the target genome
- Calculate scores for each sample
- Optionally, fetch scoring files from the PGS Catalog API
- Convert target genomic data to plink 2 binary fileset format automatically
- Match variants in the scoring files against variants in the target genome
- Create a set of new scoring files from the matched variant data
- Calculate scores for each sample from each scoring file
- Produce a summary report

In the future, the calculator will include:

- Build conversion of scoring files
- Ancestry estimation

Credits
-------

Expand Down
105 changes: 83 additions & 22 deletions docs/input.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,11 @@ To calculate a polygenic score, you need to provide the workflow with two
inputs:

- :term:`target genomic data`
- a :term:`scoring file`
- a :term:`scoring file`, or a list of scoring files

At its simplest, target genomic data might be a single :term:`VCF` file. Larger
and more complex datasets might be split across multiple files. There are two
At its simplest, target genomic data might be a single :term:`VCF` file or a
plink 1 binary fileset (i.e., bed / bim / fam). Larger and more complex datasets
might be split across multiple files, separated by chromosome. There are two
ways to specify the structure of target genomic data:

- A :term:`CSV` file (a "samplesheet")
Expand All @@ -19,9 +20,6 @@ Excel or similar spreadsheet software. The use of samplesheets is quite popular
across `nf-core`_ pipelines.

.. _nf-core: https://nf-co.re/

.. warning:: Genomic data must currently be in build 37, but we're working to
support other builds ASAP

Samplesheet
-----------
Expand Down Expand Up @@ -52,9 +50,9 @@ A template is `available here`_.
There are four mandatory columns. Two columns, **vcf_path** and **bfile_path**,
are mutually exclusive and specify the genomic files:

- **sample**: A text string containing the name of a sample/experiment, which
can be split across multiple files. Scores generated from files with the same
sample name are cominbed in later stages of the analysis.
- **sample**: A text string containing the name of a dataset, which can be split
across multiple files. Scores generated from files with the same sample name
are combined in later stages of the analysis.
- **vcf_path**: A text string of a file path pointing to a multi-sample
:term:`VCF` file. File names must be unique.
- **bfile_path**: A text string of a file path pointing to the prefix of a plink
Expand All @@ -63,28 +61,91 @@ are mutually exclusive and specify the genomic files:
- **chrom**: An integer, range 1-22. If the target genomic data contains
multiple chromosomes, leave empty.

.. _`available here`: https://github.com/PGScatalog/pgsc_calc/tree/master/assets/examples/example_data
.. _`available here`: https://github.com/PGScatalog/pgsc_calc/tree/master/assets/examples/samplesheet.csv

The documentation below is automatically generated from the input schema and
contains additional technical detail.

.. jsonschema:: ../assets/schema_input.json
.. _`example`: https://github.com/PGScatalog/pgsc_calc/blob/master/assets/api_examples/input.json

Scoring file
------------
Scoring files
-------------

PGS Catalog
~~~~~~~~~~~

The calculator natively supports scoring files submitted to the PGS Catalog
using the parameter ``--accession``. Setting this parameter means that the
calculator will query the PGS Catalog API and automatically fetch scoring
files. Multiple accessions can be specified using a comma separated list, e.g.:

.. code-block:: bash
--accession PGS001229,PGS000014
Scoring files can be specified with ``--scorefile``, which must be a string of a
file path. Scorefiles must be:
Multiple accessions will be merged and processed in parallel. If you want to
calculate a lot of scores for your dataset, it's always more efficient to
specify multiple accessions and to run the calculator once (instead of running
the calculator multiple times, once per accession). Accessions should always
start with the prefix "PGS".

- PGS Catalog `scoring file format v2.0`_
- Genome build 37
.. warning:: You MUST check that the PGS Catalog accession and target genomic
data are in the same build (e.g. GrCh37) for your calculated scores
to be biologically meaningful. We're working to support automatic
build conversion.

.. note:: We're adding support for custom scoring files, multiple scoring files,
and build conversion very soon!
.. _custom scoring:

.. note:: ``--accession`` can be used to automatically fetch a scorefile via the
PGS Catalog API. The ``--accession`` and ``-scorefile`` parameters are
mututally exclusive.
Custom scoring files
~~~~~~~~~~~~~~~~~~~~

The calculator also supports using custom scoring files that haven't been
submitted to the PGS Catalog. The custom scorefile should have the following format:

.. list-table:: Scorefile template
:widths: 20 20 20 20 20
:header-rows: 1

* - chr_name
- chr_position
- effect_allele
- other_allele
- effect_weight
* - 22
- 17080378
- G
- A
- 0.01045457

Where column names are defined in the PGS Catalog `scoring file format v2.0`_.
The file should be in tab separated values (TSV) format. Example `scorefile
templates`_ are available in the calculator repository. Two additional optional
columns can be set to specify the effect type of each variant:

.. list-table:: Optional effect type columns
:widths: 50 50
:header-rows: 1

.. _`scoring file format v2.0`: https://www.pgscatalog.org/downloads/#scoring_header
* - is_dominant
- is_recessive
* - TRUE
- FALSE

These optional columns follow the structure described in the PGS Catalog
`scoring file format v2.0`_ and should be included after the effect_weight
column. Briefly, a variant with an additive effect type (the default if optional
columns are not set) is specified by setting both columns to FALSE. If the
variant effect type is recessive, set is_recessive to TRUE. If the variant
effect type is dominant, set is_dominant to TRUE. The columns are mutually
exclusive (a variant cannot be dominant and recessive).

The calculator can run using a custom scorefile with the ``--scorefile``
parameter (e.g. ``--scorefile path/to/scorefile.txt``. A custom scorefile can
only contain a single score. If you would like to calculate multiple scores in
parallel, include a wildcard (``*``) with the scorefile parameter
(e.g. ``--scorefile path/to/scorefiles/*.txt``). More detailed examples are
available in the :doc:`Usage </usage>` section of the documentation.

.. _`scorefile templates`: https://github.com/PGScatalog/pgsc_calc/blob/master/assets/examples/scorefiles
.. _`scoring file format v2.0`: https://www.pgscatalog.org/downloads/#scoring_header
2 changes: 1 addition & 1 deletion docs/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ I get an error about variant matching
- ``--min_overlap`` defaults to 0.75 (75% of variants in scoring file must be
present in target genomes). Try changing this parameter!

The workflow isn't using much resources (e.g. RAM / CPU)
The workflow isn't using many resources (e.g. RAM / CPU)
--------------------------------------------------------

Did you forget to set ``--max_cpu`` or ``--max_memory?``
Expand Down
98 changes: 85 additions & 13 deletions docs/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,10 @@ Usage

This page describes some typical use cases of the workflow.

.. warning:: Target genomic data must be in build 37 currently
.. warning:: You MUST check that the scorefile and target genomic data are in
the same build (e.g. GrCh37) for your calculated scores to be
biologically meaningful. We're working to support automatic build
conversion.

Calculating scores with a VCF file
----------------------------------
Expand All @@ -20,18 +23,20 @@ Firstly, prepare a samplesheet in CSV format with the following structure:
- chrom
* - cineca_synthetic_subset_vcf
- path/to/vcf.gz
-
- 22
-
-

An example samplesheet is available to download `here <https://github.com/PGScatalog/pgsc_calc/blob/master/assets/examples/example_data/bfile_samplesheet.csv>`_.
Secondly, download a polygenic score from the `PGS Catalog`_ and decompress it.
An example samplesheet is available to download `here
<https://github.com/PGScatalog/pgsc_calc/blob/master/assets/examples/example_data/samplesheet.csv>`_.
Secondly, specify one or more PGS Catalog accessions. To specify multiple
accessions, use a comma separated list (with no spaces between accessions).

.. code-block:: bash
nextflow run pgscatalog/pgsc_calc \
-profile docker \
--input example_vcf.csv \
--scorefile example_scorefile.txt
--accession PGS001229,PGS000014
.. _`PGS Catalog`: https://www.pgscatalog.org/

Expand Down Expand Up @@ -59,20 +64,87 @@ binary fileset prefix, which is the name of the fileset before the file extensio
* - cineca_synthetic_subset
-
- path/to/bfile_prefix
- 22

An example samplesheet is available to download `here <https://github.com/PGScatalog/pgsc_calc/blob/master/assets/examples/example_data/bfile_samplesheet.csv>`_.
-

Secondly, download a polygenic score from the `PGS Catalog`_ and decompress it.
An example samplesheet is available to download `here
<https://github.com/PGScatalog/pgsc_calc/blob/master/assets/examples/example_data/samplesheet.csv>`_.
Secondly, specify one or more PGS Catalog accessions. To specify multiple
accessions, use a comma separated list (with no spaces between accessions).

.. code-block:: bash
nextflow run pgscatalog/pgsc_calc \
-profile docker \
--input example_bfile.csv \
--scorefile example_scorefile.txt
--accession PGS001229,PGS000014
.. _`binary filesets`: https://www.cog-genomics.org/plink2/formats#bed

Calculating scores with multiple files
--------------------------------------
Calculating scores with split genomic data
------------------------------------------

Sometimes your target genomic data might be split across multiple files. The
calculator supports this type of input data for data split by chromosome. To
work with split genomic data, add rows to the samplesheet (one per file) and
set the **chrom** column. For example:

.. list-table:: Example samplesheet: ``example_split_vcf.csv``
:widths: 25 25 25 25
:header-rows: 1

* - sample
- vcf_path
- bfile_path
- chrom
* - cineca_synthetic_subset_vcf
- path/to/1.vcf.gz
-
- 1
* - ...
- ...
-
- ...
* - cineca_synthetic_subset_vcf
- path/to/22.vcf.gz
-
- 22

You can include as many or as few chromosomes as you want. For example, if your
scoring file only includes variants across 3 chromosomes you can choose to
include only these three chromosomes. Omitting unused chromosomes will make the
calculator slightly faster.

Using custom scoring files
--------------------------

If you would like to use a scoring file not in the PGS Catalog, you will need to
format it correctly for the calculator. See the :ref:`custom scoring` section
for an explanation of the scoring file format. Once your scoring file is
prepared, simply replace the ``--accession`` parameter:

.. code-block:: bash
nextflow run pgscatalog/pgsc_calc \
-profile docker \
--input example_vcf.csv \
--scorefile /path/to/scorefile.txt
The calculator can calculate multiple scores in parallel efficiently. Just
prepare multiple scoring files, and use a wildcard character (``*``) to set
multiple files:

.. code-block:: bash
nextflow run pgscatalog/pgsc_calc \
-profile docker \
--input example_vcf.csv \
--scorefile /path/to/scorefile/directory/*.txt
.. note:: It's a good idea to keep scorefiles in a separate and clean
directory. If there are other text files (that aren't scores) in the
same directory, then the calculator will try to use them and break!

.. warning:: The base name of the scoring file (e.g. ``depression.txt`` ->
"depression") is important and used to label scores in the output
report. Please use filenames you'll understand.

0 comments on commit add194e

Please sign in to comment.