Add ingest tutorials #195

joverlee521 · 2024-03-22T23:19:13Z

Description of proposed changes

Tutorial contents are based on the first draft of the tutorials.

This PR adds two new sections to our tutorials with two separate ingest tutorials

The new sections of tutorials were motivated by the SAB meeting that emphasized the need to keep the simple zika-tutorial for new users.

Related issue(s)

Resolves #179
Related to #188

This PR is based on #191

Checklist

Checks pass

j23414

Looks reasonable to me

jameshadfield · 2024-03-26T23:17:21Z

src/snippets/uncurated-ncbi-dataset.rst

+
+The produced ``ingest/data/raw_metadata.tsv`` will contain all of the fields available from NCBI Datasets.
+Note that the headers in this file use the human readable ``Name`` of the
+`NCBI Datasets' available fields <https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields>`_,


When designing the ingest workflow did you consider having an intermediate output which was exactly this (raw_metadata.tsv, or maybe 2 files if the sequences were split out into a FASTA)? This would remove the need for users to run these 5 steps which in turn would encourage comparing the NCBI data vs the curated data as a normal part of ingest.

The files aren't too large, so I don't think space is a concern, but if it were we could easily mark them with temp() and then this section would be "run with --notemp to get the raw NCBI data to compare against the curated data".

No, I had not...This is definitely left over from working with the SARS-CoV-2 data, where I had focused on only using metadata columns that are needed for the workflow to avoid large files.

For huge datasets this is the right call, but we don't need to build every pipeline to handle the data requirements of SC2.

I updated the default config in the pathogen-repo-guide to include all of the fields of the raw metadata so that people can remove fields as needed.

Updated tutorial to use the new Snakemake target from the pathogen-repo-guide in 43b0f77

jameshadfield · 2024-03-26T23:25:24Z

src/snippets/uncurated-ncbi-dataset.rst

@@ -0,0 +1,37 @@
+1. Enter an interactive Nextstrain shell to be able to run the NCBI Datasets CLI commands without installing them separately.


There are many ways to get to the some outcome and docs like these always have to balance covering all the ways to get there vs detailing just one. (Examples are everywhere - we detail nextstrain CLI rather than the snakemake commands it's actually running, we assume non-ambient runtimes.) Sometimes adding little pointers to indicate the links between the different methods would be really helpful. In this example, we are running commands which are largely recreating the steps we just ran in the "running an ingest workflow" section - steps 2+3 are identical to snakemake --cores 1 --notemp fetch_ncbi_dataset_package.

I'd change the introduction slightly to indicate this:

- If you want to see the uncurated NCBI Datasets data to decide what changes - you would like to make to the workflow, you can run the following: + If you want to see the uncurated NCBI Datasets data to decide what changes + you would like to make to the workflow, you can download the raw NCBI data + by manually running commands very similar to those the pipeline used earlier + when "running an ingest workflow"

I'd be interested in adopting a consistent approach throughout the docs where we add little hints to explain links between what we're detailing and the other ways it can be / is done. Using this as an example (and based off the text at the end of this snippet), I'm imagining a hint section after the commands such as:

These commands are actually run by the ingest pipeline with some minor differences. The ingest pipeline restricts the columns to those defined in config["ncbi_datasets_fields"], modifies the header names using config["curate"]["field_map"] (which maps the human-readable strings NCBI uses to ones more typical in Nextstrain pipelines), and changes the actual data via a number of curation steps.

I can add similar comments throughout these docs if you are interested in adopting this style, but wanted to discuss here first.

docs like these always have to balance covering all the ways to get there vs detailing just one.

I have the opposite expectation for a tutorial: it should teach how to get a task done using one straight-forward path. I think including multiple/all ways to achieve the task can be distracting and overwhelming for a new user. We've heard this from people in office hours as "There's so many ways to achieve X. What's the one way Nextstrain recommends doing it?"

I definitely think it's helpful for people who want to know more to have links to other explanation/how-to docs that include details on what's happening in the black box and how to do a task multiple ways. Although this does go back to the issue you raised in discussion of discoverability of these docs...

Yeah, I think there's a range of views out there (including in our group).

I think including multiple/all ways to achieve the task can be distracting and overwhelming for a new user.

I'm not asking for documenting all of the ways, just adding little hints so that people can join the dots between concepts that may appear separate / independent. In this case it's providing clarifying hints that this section is very related to the section they've just run, even though on the surface they seem completely different.

Added your suggested hints in 6019d50. Feel free to add more!

jameshadfield · 2024-03-26T23:32:36Z

src/tutorials/using-a-pathogen-repo/running-an-ingest-workflow.rst

+
+.. include:: ../../snippets/uncurated-ncbi-dataset.rst
+
+We'll walk through an example custom config to include an additional column in the curated output.


[minor] - can you add a small subheading here to demarcate this from the snippet contents?

Added subheading in 6019d50

jameshadfield · 2024-03-26T23:40:39Z

src/tutorials/using-a-pathogen-repo/running-an-ingest-workflow.rst

+==========
+
+* Run the `zika phylogenetic workflow <https://github.com/nextstrain/zika/tree/main/phylogenetic>`_ with new ingested data as input
+  by running


Note that this won't include any of the data we (may) have just added in "Advanced usage: Customizing the ingest workflow" because the phylo workflow doesn't know about ingest/results/merged-metadata.tsv

Ah, right! I'll update to use ingest/results/merged-metadata.tsv.

Ah, I remembered that I didn't want to point to the new merged-metadata.tsv because I didn't want to go into detail of how to add the new columns to the Auspice config of the phylogenetic workflow. That seems like it needs to be a whole other tutorial...

Maybe just a line saying "If you've customized the ingest workflow then you may need to modify the phylo workflow to use your ingested data file if it's not results/metadata.tsv, and other modifications to your phylo workflow may be needed, as appropriate." Or something

Updated in f021e76

jameshadfield · 2024-03-26T23:40:53Z

src/tutorials/using-a-pathogen-repo/running-an-ingest-workflow.rst

+
+    $ mkdir ingest/build-configs/tutorial
+
+2. Create a new config file ``ingest/build-configs/tutorial/config.yaml``


Seeing this big YAML block made me think "whoa, were did all these values come from"? There are a couple of changes you may want to consider here:

I'd move the text below the YAML to above it. So instead of starting with a big config file¹, start with "the parts of the config we want to change are ncbi_datasets_fields (because...), field_map (because...) and metadata_column (because).

Then detail that we copy certain sections from the existing config then add the changes just described.

¹ The config YAML alone takes up the full height of a laptop screen

Updated in 9255b1a

jameshadfield · 2024-03-26T23:50:37Z

src/tutorials/creating-a-pathogen-repo/creating-an-ingest-workflow.rst

+
+We highly encourage you to go through the commands and custom scripts used in the ``curate`` rule within ``ingest/rules/curate.smk``
+to gain a deeper understanding of how they work.
+We will give a brief overview of each step and their relevant config parameters defined in ``ingest/defaults/config.yaml`` to help you get started.


Here we are mostly taking a template curate pipeline and describing the config values you can change. The other approach would be to detail the curate commands themselves and teach people how to add them to the bash command + abstract their parameters to the YAML config. This reminds me of a similar situation in visualisation where some people build libraries of charts while others expose the methods to build such charts (e.g. see Leland Wilkinson quote here).

The other approach would be to detail the curate commands themselves and teach people how to add them to the bash command + abstract their parameters to the YAML config.

Hmm, that seems to be stepping into Snakemake tutorial territory...

I guess I'm wondering how feasible it is for a user to develop an ingest pipeline without knowing Snakemake. (This is the analogue of designing/documenting a charting library.) If that's going to work for a number of users then great! The phrasing you used here ("highly encourage you to go through the [snakemake rules]", "we will give you a brief overview" etc) made it seem to me like knowing Snakemake was a prerequisite.

Anyway, the docs as they stand are an improvement so I wouldn't want this to hold up their merge. Rather something to think about as we move forward.

made it seem to me like knowing Snakemake was a prerequisite.

Yup! I have Snakemake listed as a prerequisite for the "Creating an ingest workflow" tutorial at the top:

docs.nextstrain.org/src/tutorials/creating-a-pathogen-repo/creating-an-ingest-workflow.rst

Lines 22 to 24 in 79a0176

Additionally, to follow this tutorial, you will need:

* An understanding of `Snakemake <https://snakemake.readthedocs.io/en/stable/>`_ workflows.

jameshadfield · 2024-03-26T23:54:49Z

src/tutorials/creating-a-pathogen-repo/creating-an-ingest-workflow.rst

+==========
+
+* Learn more about :doc:`augur curate commands <augur:usage/cli/curate/index>`
+* Learn how to create a phylogenetic workflow (TKTK)


We shouldn't use TKTK in live docs (assuming you aren't going to add the tutorial it in this PR). I'd replace this with "We're planning on writing a similar workflow for the phylogenetic pipeline, but until that's ready you can the best place to learn about these is ..."

Yup! Thank you for catching that!

Updated in f021e76

jameshadfield · 2024-03-27T00:00:08Z

src/tutorials/creating-a-pathogen-repo/creating-an-ingest-workflow.rst

+
+1. Add your pathogen's NCBI taxonomy ID to the ``ncbi_taxon_id`` parameter
+2. If there are other NCBI Datasets fields you would like to include in the download, you can add them to the ``ncbi_datasets_fields`` parameter
+3. Skip to the :ref:`curation-steps`.


At least in zika we do some extra steps (format_ncbi_datasets_ndjson) - do you want to discuss them here? (Maybe I've missed it)

Ah, I had excluded it because it's not a configurable step and seemed like an implementation detail that's not immediately useful to first time users.

@jameshadfield

Provides an easy way for first time users to get the uncurated metadata from NCBI Datasets commands by running the ingest workflow with the specified target `data/ncbi_dataset_report.tsv`. Afterwards, users can easily remove fields that are not needed as part the workflow to reduce the file size and save space. Prompted by @jameshadfield in review of the tutorial¹ and resolves #30. ¹ nextstrain/docs.nextstrain.org#195 (comment)

@jameshadfield

Provides an easy way for first time users to get the full uncurated metadata from NCBI Datasets commands by running the ingest workflow with the specified target `dump_ncbi_dataset_report`. They can then inspect and explore the raw data to determine if they want to configure the workflow to use additional fields from NCBI. The rule is added to `fetch_from_ncbi.smk` to make it easy to run without additional configs. Note that it is not run as part of the default workflow and only intended to be used as a specified target. Prompted by @jameshadfield in review of the tutorial¹ and resolves #30. ¹ nextstrain/docs.nextstrain.org#195 (comment) Co-authored-by: James Hadfield <hadfield.james@gmail.com>

Based on contents of the initial draft of the tutorial https://docs.google.com/document/d/1_16VYT5MU8oXJ4t6HUHp_smx_kgF9OMsORSCldee1_0/edit#heading=h.opa6649r0u7j

This commit's main purpose is to split out the tutorials for more complex workflows into a separate section than our existing "Quickstart" zika-tutorial. This new section of tutorials was motivated by the SAB meeting¹ that emphasized the need to keep the simple zika-tutorial for new users. The new section only holds the new "Running an ingest workflow" docs, but will be expanded in the future to cover how to run and customize other workflows within an existing pathogen repository. ¹ https://docs.google.com/document/d/1zDwbn16ZRlMMcKLGWWVJ3lYPNwo5_v_WmgBsHv30_lU/edit?disco=AAABKGP5kEc

Intended to mirror the new "Using a pathogen repo" tutorials with tutorials on how to set up each individual workflow in a pathogen repo.

Based on contents of the initial draft of the tutorial https://docs.google.com/document/d/1_16VYT5MU8oXJ4t6HUHp_smx_kgF9OMsORSCldee1_0/edit#heading=h.r95jmyuit0s0 Split out the steps to get the uncurated NCBI Dataset data into a snippet that can be shared between the two ingest tutorials.

Explain what each step of the example command is doing to give readers a better understanding.

The pathogen-repo-guide will be updated to included the target `dump_ncbi_dataset_report` to easily generate the uncurated NCBI Dataset metadata.¹ This commit updates the tutorial to use this new target so that the user does not need to manually run the extra commands to see the raw metadata. ¹ nextstrain/pathogen-repo-guide#38

The previous commit updates the creating-an-ingest-workflow tutorial to use a custom Snakemake target to generate the uncurated NCBI Dataset metadata so we no longer need the separate snippet. I am still providing the extra commands to generate the uncurated data in the running-an-ingest-workflow tutorial because I wanted users to be able to see the uncurated data even if the existing pathogen workflow does _not_ include the custom Snakemake target.

Include our intention to write phylogenetic tutorials

@jameshadfield

Replace big YAML block with written instructions to create the custom config file based on suggestion from @jameshadfield in review. #195 (comment)

joverlee521 · 2024-03-29T21:32:26Z

Rebased to include the latest changes merged in #191

joverlee521 · 2024-04-03T20:21:14Z

Merging since I got a round of 👍 in today's priorities meeting.

tsibley · 2024-04-19T18:41:44Z

src/tutorials/using-a-pathogen-repo/running-an-ingest-workflow.rst

+.. code-block::
+
+    $ nextstrain build ingest
+    Using profile profiles/default and workflow specific profile profiles/default for setting default command line arguments.
+    Building DAG of jobs…
+    [...a lot of output...]


All of these new code-blocks which show an example shell session with a prompt and command and optionally output should be .. code-block:: console to properly handle the various bits inside with highlighting and the copy/paste button.

That's good to know! I just tried updating to .. code-block::console locally, I see a difference in syntax highlighting but the copy/paste behavior is still the same where it only copies the command.

Adding missing code-block languages in #197

…but the copy/paste behavior is still the same where it only copies the command.

Ah! I thought we were using the syntax-aware prompt/output exclusion method of sphinx-copybutton, but we're using the pattern-matching exclusion method.

All good then.

Be explicit about code-block language to get the proper syntax highlighting. Prompted by post-merge review #195 (comment)

joverlee521 requested a review from a team March 22, 2024 23:22

j23414 approved these changes Mar 23, 2024

View reviewed changes

jameshadfield reviewed Mar 27, 2024

View reviewed changes

Base automatically changed from update-glossary-workflow to master March 28, 2024 20:15

joverlee521 mentioned this pull request Mar 28, 2024

ingest: Provide target for raw metadata from NCBI Datasets nextstrain/pathogen-repo-guide#38

Merged

joverlee521 added 9 commits March 29, 2024 13:16

Add tutorial/running-an-ingest-workflow

0d7d415

Based on contents of the initial draft of the tutorial https://docs.google.com/document/d/1_16VYT5MU8oXJ4t6HUHp_smx_kgF9OMsORSCldee1_0/edit#heading=h.opa6649r0u7j

tutorial: Add "Creating a pathogen repo" section

703f35b

Intended to mirror the new "Using a pathogen repo" tutorials with tutorials on how to set up each individual workflow in a pathogen repo.

uncurated-ncib-dataset: Separate commands to give explanation

00ee4a5

Explain what each step of the example command is doing to give readers a better understanding.

ingest tutorials: Update "Next steps"

f021e76

Include our intention to write phylogenetic tutorials

running-an-ingest-workflow: Replace big YAML block

9255b1a

Replace big YAML block with written instructions to create the custom config file based on suggestion from @jameshadfield in review. #195 (comment)

joverlee521 force-pushed the add-ingest-tutorials branch from 79a0176 to 9255b1a Compare March 29, 2024 21:27

joverlee521 merged commit 65c2849 into master Apr 3, 2024
4 checks passed

joverlee521 deleted the add-ingest-tutorials branch April 3, 2024 20:21

tsibley reviewed Apr 19, 2024

View reviewed changes

joverlee521 added a commit that referenced this pull request Apr 19, 2024

Fill in missing code-block language

4de29e9

Be explicit about code-block language to get the proper syntax highlighting. Prompted by post-merge review #195 (comment)

joverlee521 mentioned this pull request Apr 19, 2024

Fill in missing code-block language #197

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ingest tutorials #195

Add ingest tutorials #195

joverlee521 commented Mar 22, 2024 •

edited

Loading

j23414 left a comment

jameshadfield Mar 26, 2024 •

edited

Loading

joverlee521 Mar 27, 2024

jameshadfield Mar 28, 2024

joverlee521 Mar 28, 2024

joverlee521 Mar 29, 2024

jameshadfield Mar 26, 2024

joverlee521 Mar 27, 2024

jameshadfield Mar 28, 2024

joverlee521 Mar 29, 2024

jameshadfield Mar 26, 2024

joverlee521 Mar 29, 2024

jameshadfield Mar 26, 2024

joverlee521 Mar 27, 2024

joverlee521 Mar 28, 2024

jameshadfield Mar 28, 2024

joverlee521 Mar 29, 2024

jameshadfield Mar 26, 2024

joverlee521 Mar 29, 2024

jameshadfield Mar 26, 2024 •

edited

Loading

joverlee521 Mar 27, 2024

jameshadfield Mar 28, 2024

joverlee521 Mar 28, 2024

jameshadfield Mar 26, 2024

joverlee521 Mar 27, 2024

joverlee521 Mar 29, 2024

jameshadfield Mar 27, 2024

joverlee521 Mar 27, 2024

joverlee521 commented Mar 29, 2024

joverlee521 commented Apr 3, 2024

tsibley Apr 19, 2024

joverlee521 Apr 19, 2024

joverlee521 Apr 19, 2024

tsibley Apr 19, 2024

		@@ -0,0 +1,37 @@
		1. Enter an interactive Nextstrain shell to be able to run the NCBI Datasets CLI commands without installing them separately.


		.. include:: ../../snippets/uncurated-ncbi-dataset.rst

		We'll walk through an example custom config to include an additional column in the curated output.


		$ mkdir ingest/build-configs/tutorial

		2. Create a new config file ``ingest/build-configs/tutorial/config.yaml``

	Additionally, to follow this tutorial, you will need:

	* An understanding of `Snakemake <https://snakemake.readthedocs.io/en/stable/>`_ workflows.

Add ingest tutorials #195

Add ingest tutorials #195

Conversation

joverlee521 commented Mar 22, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

j23414 left a comment

Choose a reason for hiding this comment

jameshadfield Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshadfield Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joverlee521 commented Mar 29, 2024

joverlee521 commented Apr 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joverlee521 commented Mar 22, 2024 •

edited

Loading

jameshadfield Mar 26, 2024 •

edited

Loading

jameshadfield Mar 26, 2024 •

edited

Loading