add docs

PGScatalog · May 26, 2022 · add194e · add194e
1 parent 5a7f781
commit add194e
Show file tree

Hide file tree

Showing 4 changed files with 184 additions and 47 deletions.
diff --git a/docs/index.rst b/docs/index.rst
@@ -18,13 +18,13 @@ Introduction
 ------------
 
 ``pgsc_calc`` is a bioinformatics best-practice analysis pipeline for applying
-scoring files from the `Polygenic Score (PGS)
-Catalog <https://www.pgscatalog.org/>`_ to target genotyped samples |:dna:| |:partying_face:|
+scoring files from the `Polygenic Score (PGS) Catalog
+<https://www.pgscatalog.org/>`_ to target genotyped samples |:dna:|
+|:partying_face:|
 
 .. note::
 
-   This project is under active development and may break at any time but should
-   be ready soon |:tm:|
+   This project is under very active development and updates are frequent
 
 Quick start
 -----------
@@ -39,7 +39,7 @@ Quick start
     nextflow run pgscatalog/pgsc_calc -profile test,docker
 
 .. note:: The ``docker`` profile option can be replaced with ``singularity`` or
-          ``conda``.
+          ``conda`` depending on your local environment
 
 .. _`Docker`: https://docs.docker.com/get-docker/
 .. _`Singularity`: https://sylabs.io/
@@ -48,14 +48,18 @@ Quick start
 Workflow summary
 ----------------
 
-- Optionally, fetch a scorefile from the PGS Catalog API
-- Convert target genomic data from VCF to plink format automatically
-- Split target genomic data automatically
-- Relabel variants to a common identifier
-- Match variants in the scoring file against variants in the target genome
-- Calculate scores for each sample
+- Optionally, fetch scoring files from the PGS Catalog API
+- Convert target genomic data to plink 2 binary fileset format automatically
+- Match variants in the scoring files against variants in the target genome
+- Create a set of new scoring files from the matched variant data
+- Calculate scores for each sample from each scoring file
 - Produce a summary report
 
+In the future, the calculator will include:
+
+- Build conversion of scoring files
+- Ancestry estimation
+
 Credits
 -------
 

diff --git a/docs/input.rst b/docs/input.rst
@@ -5,10 +5,11 @@ To calculate a polygenic score, you need to provide the workflow with two
 inputs:
 
 - :term:`target genomic data`
-- a :term:`scoring file`
+- a :term:`scoring file`, or a list of scoring files
 
-At its simplest, target genomic data might be a single :term:`VCF` file. Larger
-and more complex datasets might be split across multiple files. There are two
+At its simplest, target genomic data might be a single :term:`VCF` file or a
+plink 1 binary fileset (i.e., bed / bim / fam). Larger and more complex datasets
+might be split across multiple files, separated by chromosome. There are two
 ways to specify the structure of target genomic data:
 
 - A :term:`CSV` file (a "samplesheet")
@@ -19,9 +20,6 @@ Excel or similar spreadsheet software. The use of samplesheets is quite popular
 across `nf-core`_ pipelines.
 
 .. _nf-core: https://nf-co.re/
-
-.. warning:: Genomic data must currently be in build 37, but we're working to
-             support other builds ASAP
 
 Samplesheet
 -----------
@@ -52,9 +50,9 @@ A template is `available here`_.
 There are four mandatory columns. Two columns, **vcf_path** and **bfile_path**,
 are mutually exclusive and specify the genomic files:
 
-- **sample**: A text string containing the name of a sample/experiment, which
-  can be split across multiple files. Scores generated from files with the same
-  sample name are cominbed in later stages of the analysis.
+- **sample**: A text string containing the name of a dataset, which can be split
+  across multiple files. Scores generated from files with the same sample name
+  are combined in later stages of the analysis.
 - **vcf_path**: A text string of a file path pointing to a multi-sample
   :term:`VCF` file. File names must be unique.
 - **bfile_path**: A text string of a file path pointing to the prefix of a plink
@@ -63,28 +61,91 @@ are mutually exclusive and specify the genomic files:
 - **chrom**: An integer, range 1-22. If the target genomic data contains
   multiple chromosomes, leave empty.
 
-.. _`available here`: https://github.com/PGScatalog/pgsc_calc/tree/master/assets/examples/example_data
+.. _`available here`: https://github.com/PGScatalog/pgsc_calc/tree/master/assets/examples/samplesheet.csv
 
 The documentation below is automatically generated from the input schema and
 contains additional technical detail. 
 
 .. jsonschema:: ../assets/schema_input.json
 .. _`example`: https://github.com/PGScatalog/pgsc_calc/blob/master/assets/api_examples/input.json
 
-Scoring file
-------------
+Scoring files
+-------------
+
+PGS Catalog
+~~~~~~~~~~~
+
+The calculator natively supports scoring files submitted to the PGS Catalog
+using the parameter ``--accession``. Setting this parameter means that the
+calculator will query the PGS Catalog API and automatically fetch scoring
+files. Multiple accessions can be specified using a comma separated list, e.g.:
+
+.. code-block:: bash
+
+  --accession PGS001229,PGS000014
 
-Scoring files can be specified with ``--scorefile``, which must be a string of a
-file path. Scorefiles must be:
+Multiple accessions will be merged and processed in parallel. If you want to
+calculate a lot of scores for your dataset, it's always more efficient to
+specify multiple accessions and to run the calculator once (instead of running
+the calculator multiple times, once per accession). Accessions should always
+start with the prefix "PGS".
 
-- PGS Catalog `scoring file format v2.0`_
-- Genome build 37
+.. warning:: You MUST check that the PGS Catalog accession and target genomic
+             data are in the same build (e.g. GrCh37) for your calculated scores
+             to be biologically meaningful. We're working to support automatic
+             build conversion.
 
-.. note:: We're adding support for custom scoring files, multiple scoring files,
-          and build conversion very soon!
+.. _custom scoring:
 
-.. note:: ``--accession`` can be used to automatically fetch a scorefile via the
-          PGS Catalog API. The ``--accession`` and ``-scorefile`` parameters are
-          mututally exclusive.
+Custom scoring files
+~~~~~~~~~~~~~~~~~~~~
+
+The calculator also supports using custom scoring files that haven't been
+submitted to the PGS Catalog. The custom scorefile should have the following format:
+
+.. list-table:: Scorefile template
+   :widths: 20 20 20 20 20
+   :header-rows: 1
+
+   * - chr_name
+     - chr_position
+     - effect_allele
+     - other_allele
+     - effect_weight
+   * - 22
+     - 17080378
+     - G
+     - A
+     - 0.01045457
+
+Where column names are defined in the PGS Catalog `scoring file format v2.0`_.
+The file should be in tab separated values (TSV) format. Example `scorefile
+templates`_ are available in the calculator repository. Two additional optional
+columns can be set to specify the effect type of each variant:
+
+.. list-table:: Optional effect type columns
+   :widths: 50 50
+   :header-rows: 1
 
-.. _`scoring file format v2.0`: https://www.pgscatalog.org/downloads/#scoring_header      
+   * - is_dominant
+     - is_recessive
+   * - TRUE
+     - FALSE
+
+These optional columns follow the structure described in the PGS Catalog
+`scoring file format v2.0`_ and should be included after the effect_weight
+column. Briefly, a variant with an additive effect type (the default if optional
+columns are not set) is specified by setting both columns to FALSE. If the
+variant effect type is recessive, set is_recessive to TRUE. If the variant
+effect type is dominant, set is_dominant to TRUE. The columns are mutually
+exclusive (a variant cannot be dominant and recessive).
+
+The calculator can run using a custom scorefile with the ``--scorefile``
+parameter (e.g. ``--scorefile path/to/scorefile.txt``. A custom scorefile can
+only contain a single score. If you would like to calculate multiple scores in
+parallel, include a wildcard (``*``) with the scorefile parameter
+(e.g. ``--scorefile path/to/scorefiles/*.txt``). More detailed examples are
+available in the :doc:`Usage </usage>` section of the documentation. 
+
+.. _`scorefile templates`: https://github.com/PGScatalog/pgsc_calc/blob/master/assets/examples/scorefiles
+.. _`scoring file format v2.0`: https://www.pgscatalog.org/downloads/#scoring_header
diff --git a/docs/troubleshooting.rst b/docs/troubleshooting.rst
@@ -8,7 +8,7 @@ I get an error about variant matching
 - ``--min_overlap`` defaults to 0.75 (75% of variants in scoring file must be
   present in target genomes). Try changing this parameter!
 
-The workflow isn't using much resources (e.g. RAM / CPU)
+The workflow isn't using many resources (e.g. RAM / CPU)
 --------------------------------------------------------
 
 Did you forget to set ``--max_cpu`` or ``--max_memory?``

diff --git a/docs/usage.rst b/docs/usage.rst
@@ -3,7 +3,10 @@ Usage
 
 This page describes some typical use cases of the workflow.
 
-.. warning:: Target genomic data must be in build 37 currently
+.. warning:: You MUST check that the scorefile and target genomic data are in
+             the same build (e.g. GrCh37) for your calculated scores to be
+             biologically meaningful. We're working to support automatic build
+             conversion.
 
 Calculating scores with a VCF file
 ----------------------------------
@@ -20,18 +23,20 @@ Firstly, prepare a samplesheet in CSV format with the following structure:
      - chrom
    * - cineca_synthetic_subset_vcf
      - path/to/vcf.gz
-     - 
-     - 22
+     -
+     -
 
-An example samplesheet is available to download `here <https://github.com/PGScatalog/pgsc_calc/blob/master/assets/examples/example_data/bfile_samplesheet.csv>`_.       
-Secondly, download a polygenic score from the `PGS Catalog`_ and decompress it.
+An example samplesheet is available to download `here
+<https://github.com/PGScatalog/pgsc_calc/blob/master/assets/examples/example_data/samplesheet.csv>`_.
+Secondly, specify one or more PGS Catalog accessions. To specify multiple
+accessions, use a comma separated list (with no spaces between accessions).
 
 .. code-block:: bash
 
     nextflow run pgscatalog/pgsc_calc \
         -profile docker \
         --input example_vcf.csv \
-        --scorefile example_scorefile.txt
+        --accession PGS001229,PGS000014
 
 .. _`PGS Catalog`: https://www.pgscatalog.org/
 
@@ -59,20 +64,87 @@ binary fileset prefix, which is the name of the fileset before the file extensio
    * - cineca_synthetic_subset
      -
      - path/to/bfile_prefix
-     - 22
-
-An example samplesheet is available to download `here <https://github.com/PGScatalog/pgsc_calc/blob/master/assets/examples/example_data/bfile_samplesheet.csv>`_.
+     - 
 
-Secondly, download a polygenic score from the `PGS Catalog`_ and decompress it.
+An example samplesheet is available to download `here
+<https://github.com/PGScatalog/pgsc_calc/blob/master/assets/examples/example_data/samplesheet.csv>`_.
+Secondly, specify one or more PGS Catalog accessions. To specify multiple
+accessions, use a comma separated list (with no spaces between accessions).
 
 .. code-block:: bash
 
     nextflow run pgscatalog/pgsc_calc \
         -profile docker \
         --input example_bfile.csv \
-        --scorefile example_scorefile.txt
+        --accession PGS001229,PGS000014 
 
 .. _`binary filesets`: https://www.cog-genomics.org/plink2/formats#bed
 
-Calculating scores with multiple files
---------------------------------------
+Calculating scores with split genomic data
+------------------------------------------
+
+Sometimes your target genomic data might be split across multiple files. The
+calculator supports this type of input data for data split by chromosome. To
+work with split genomic data, add rows to the samplesheet (one per file) and
+set the **chrom** column. For example:
+
+.. list-table:: Example samplesheet: ``example_split_vcf.csv``
+   :widths: 25 25 25 25
+   :header-rows: 1
+
+   * - sample
+     - vcf_path
+     - bfile_path
+     - chrom
+   * - cineca_synthetic_subset_vcf
+     - path/to/1.vcf.gz
+     -
+     - 1
+   * - ...
+     - ...
+     -
+     - ...
+   * - cineca_synthetic_subset_vcf
+     - path/to/22.vcf.gz
+     -
+     - 22
+
+You can include as many or as few chromosomes as you want. For example, if your
+scoring file only includes variants across 3 chromosomes you can choose to
+include only these three chromosomes. Omitting unused chromosomes will make the
+calculator slightly faster.
+
+Using custom scoring files
+--------------------------
+
+If you would like to use a scoring file not in the PGS Catalog, you will need to
+format it correctly for the calculator. See the :ref:`custom scoring` section
+for an explanation of the scoring file format. Once your scoring file is
+prepared, simply replace the ``--accession`` parameter:
+
+.. code-block:: bash
+
+    nextflow run pgscatalog/pgsc_calc \
+        -profile docker \
+        --input example_vcf.csv \
+        --scorefile /path/to/scorefile.txt
+
+The calculator can calculate multiple scores in parallel efficiently. Just
+prepare multiple scoring files, and use a wildcard character (``*``) to set
+multiple files:
+
+.. code-block:: bash
+
+    nextflow run pgscatalog/pgsc_calc \
+        -profile docker \
+        --input example_vcf.csv \
+        --scorefile /path/to/scorefile/directory/*.txt
+
+.. note:: It's a good idea to keep scorefiles in a separate and clean
+          directory. If there are other text files (that aren't scores) in the
+          same directory, then the calculator will try to use them and break!
+
+.. warning:: The base name of the scoring file (e.g. ``depression.txt`` ->
+             "depression") is important and used to label scores in the output
+             report. Please use filenames you'll understand.
+