Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ingest tutorials #195

Merged
merged 9 commits into from
Apr 3, 2024
Merged

Add ingest tutorials #195

merged 9 commits into from
Apr 3, 2024

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Mar 22, 2024

Description of proposed changes

Tutorial contents are based on the first draft of the tutorials.

This PR adds two new sections to our tutorials with two separate ingest tutorials

The new sections of tutorials were motivated by the SAB meeting that emphasized the need to keep the simple zika-tutorial for new users.

Related issue(s)

Resolves #179
Related to #188

This PR is based on #191

Checklist

  • Checks pass

@joverlee521 joverlee521 requested a review from a team March 22, 2024 23:22
Copy link
Contributor

@j23414 j23414 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me


The produced ``ingest/data/raw_metadata.tsv`` will contain all of the fields available from NCBI Datasets.
Note that the headers in this file use the human readable ``Name`` of the
`NCBI Datasets' available fields <https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields>`_,
Copy link
Member

@jameshadfield jameshadfield Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When designing the ingest workflow did you consider having an intermediate output which was exactly this (raw_metadata.tsv, or maybe 2 files if the sequences were split out into a FASTA)? This would remove the need for users to run these 5 steps which in turn would encourage comparing the NCBI data vs the curated data as a normal part of ingest.

The files aren't too large, so I don't think space is a concern, but if it were we could easily mark them with temp() and then this section would be "run with --notemp to get the raw NCBI data to compare against the curated data".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I had not...This is definitely left over from working with the SARS-CoV-2 data, where I had focused on only using metadata columns that are needed for the workflow to avoid large files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For huge datasets this is the right call, but we don't need to build every pipeline to handle the data requirements of SC2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the default config in the pathogen-repo-guide to include all of the fields of the raw metadata so that people can remove fields as needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated tutorial to use the new Snakemake target from the pathogen-repo-guide in 43b0f77

@@ -0,0 +1,37 @@
1. Enter an interactive Nextstrain shell to be able to run the NCBI Datasets CLI commands without installing them separately.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many ways to get to the some outcome and docs like these always have to balance covering all the ways to get there vs detailing just one. (Examples are everywhere - we detail nextstrain CLI rather than the snakemake commands it's actually running, we assume non-ambient runtimes.) Sometimes adding little pointers to indicate the links between the different methods would be really helpful. In this example, we are running commands which are largely recreating the steps we just ran in the "running an ingest workflow" section - steps 2+3 are identical to snakemake --cores 1 --notemp fetch_ncbi_dataset_package.

I'd change the introduction slightly to indicate this:

- If you want to see the uncurated NCBI Datasets data to decide what changes
- you would like to make to the workflow, you can run the following:
+ If you want to see the uncurated NCBI Datasets data to decide what changes
+ you would like to make to the workflow, you can download the raw NCBI data
+ by manually running commands very similar to those the pipeline used earlier
+ when "running an ingest workflow"

I'd be interested in adopting a consistent approach throughout the docs where we add little hints to explain links between what we're detailing and the other ways it can be / is done. Using this as an example (and based off the text at the end of this snippet), I'm imagining a hint section after the commands such as:

These commands are actually run by the ingest pipeline with some minor differences. The ingest pipeline restricts the columns to those defined in config["ncbi_datasets_fields"], modifies the header names using config["curate"]["field_map"] (which maps the human-readable strings NCBI uses to ones more typical in Nextstrain pipelines), and changes the actual data via a number of curation steps.

I can add similar comments throughout these docs if you are interested in adopting this style, but wanted to discuss here first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs like these always have to balance covering all the ways to get there vs detailing just one.

I have the opposite expectation for a tutorial: it should teach how to get a task done using one straight-forward path. I think including multiple/all ways to achieve the task can be distracting and overwhelming for a new user. We've heard this from people in office hours as "There's so many ways to achieve X. What's the one way Nextstrain recommends doing it?"

I definitely think it's helpful for people who want to know more to have links to other explanation/how-to docs that include details on what's happening in the black box and how to do a task multiple ways. Although this does go back to the issue you raised in discussion of discoverability of these docs...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think there's a range of views out there (including in our group).

I think including multiple/all ways to achieve the task can be distracting and overwhelming for a new user.

I'm not asking for documenting all of the ways, just adding little hints so that people can join the dots between concepts that may appear separate / independent. In this case it's providing clarifying hints that this section is very related to the section they've just run, even though on the surface they seem completely different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added your suggested hints in 6019d50. Feel free to add more!


.. include:: ../../snippets/uncurated-ncbi-dataset.rst

We'll walk through an example custom config to include an additional column in the curated output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] - can you add a small subheading here to demarcate this from the snippet contents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added subheading in 6019d50

==========

* Run the `zika phylogenetic workflow <https://github.com/nextstrain/zika/tree/main/phylogenetic>`_ with new ingested data as input
by running
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this won't include any of the data we (may) have just added in "Advanced usage: Customizing the ingest workflow" because the phylo workflow doesn't know about ingest/results/merged-metadata.tsv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right! I'll update to use ingest/results/merged-metadata.tsv.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I remembered that I didn't want to point to the new merged-metadata.tsv because I didn't want to go into detail of how to add the new columns to the Auspice config of the phylogenetic workflow. That seems like it needs to be a whole other tutorial...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just a line saying "If you've customized the ingest workflow then you may need to modify the phylo workflow to use your ingested data file if it's not results/metadata.tsv, and other modifications to your phylo workflow may be needed, as appropriate." Or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in f021e76


$ mkdir ingest/build-configs/tutorial

2. Create a new config file ``ingest/build-configs/tutorial/config.yaml``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing this big YAML block made me think "whoa, were did all these values come from"? There are a couple of changes you may want to consider here:

  • I'd move the text below the YAML to above it. So instead of starting with a big config file¹, start with "the parts of the config we want to change are ncbi_datasets_fields (because...), field_map (because...) and metadata_column (because).
  • Then detail that we copy certain sections from the existing config then add the changes just described.

¹ The config YAML alone takes up the full height of a laptop screen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 9255b1a


We highly encourage you to go through the commands and custom scripts used in the ``curate`` rule within ``ingest/rules/curate.smk``
to gain a deeper understanding of how they work.
We will give a brief overview of each step and their relevant config parameters defined in ``ingest/defaults/config.yaml`` to help you get started.
Copy link
Member

@jameshadfield jameshadfield Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are mostly taking a template curate pipeline and describing the config values you can change. The other approach would be to detail the curate commands themselves and teach people how to add them to the bash command + abstract their parameters to the YAML config. This reminds me of a similar situation in visualisation where some people build libraries of charts while others expose the methods to build such charts (e.g. see Leland Wilkinson quote here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other approach would be to detail the curate commands themselves and teach people how to add them to the bash command + abstract their parameters to the YAML config.

Hmm, that seems to be stepping into Snakemake tutorial territory...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm wondering how feasible it is for a user to develop an ingest pipeline without knowing Snakemake. (This is the analogue of designing/documenting a charting library.) If that's going to work for a number of users then great! The phrasing you used here ("highly encourage you to go through the [snakemake rules]", "we will give you a brief overview" etc) made it seem to me like knowing Snakemake was a prerequisite.

Anyway, the docs as they stand are an improvement so I wouldn't want this to hold up their merge. Rather something to think about as we move forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made it seem to me like knowing Snakemake was a prerequisite.

Yup! I have Snakemake listed as a prerequisite for the "Creating an ingest workflow" tutorial at the top:

Additionally, to follow this tutorial, you will need:
* An understanding of `Snakemake <https://snakemake.readthedocs.io/en/stable/>`_ workflows.

==========

* Learn more about :doc:`augur curate commands <augur:usage/cli/curate/index>`
* Learn how to create a phylogenetic workflow (TKTK)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't use TKTK in live docs (assuming you aren't going to add the tutorial it in this PR). I'd replace this with "We're planning on writing a similar workflow for the phylogenetic pipeline, but until that's ready you can the best place to learn about these is ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! Thank you for catching that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in f021e76


1. Add your pathogen's NCBI taxonomy ID to the ``ncbi_taxon_id`` parameter
2. If there are other NCBI Datasets fields you would like to include in the download, you can add them to the ``ncbi_datasets_fields`` parameter
3. Skip to the :ref:`curation-steps`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in zika we do some extra steps (format_ncbi_datasets_ndjson) - do you want to discuss them here? (Maybe I've missed it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I had excluded it because it's not a configurable step and seemed like an implementation detail that's not immediately useful to first time users.

Base automatically changed from update-glossary-workflow to master March 28, 2024 20:15
joverlee521 added a commit to nextstrain/pathogen-repo-guide that referenced this pull request Mar 28, 2024
Provides an easy way for first time users to get the uncurated metadata
from NCBI Datasets commands by running the ingest workflow with the
specified target `data/ncbi_dataset_report.tsv`.

Afterwards, users can easily remove fields that are not needed as part
the workflow to reduce the file size and save space.

Prompted by @jameshadfield in review of the tutorial¹ and
resolves #30.

¹ nextstrain/docs.nextstrain.org#195 (comment)
joverlee521 added a commit to nextstrain/pathogen-repo-guide that referenced this pull request Mar 29, 2024
Provides an easy way for first time users to get the full
uncurated metadata from NCBI Datasets commands by running the
ingest workflow with the specified target `dump_ncbi_dataset_report`.
They can then inspect and explore the raw data to determine if they
want to configure the workflow to use additional fields from NCBI.

The rule is added to `fetch_from_ncbi.smk` to make it easy to run
without additional configs. Note that it is not run as part of the
default workflow and only intended to be used as a specified target.

Prompted by @jameshadfield in review of the tutorial¹ and
resolves #30.

¹ nextstrain/docs.nextstrain.org#195 (comment)

Co-authored-by: James Hadfield <hadfield.james@gmail.com>
This commit's main purpose is to split out the tutorials for more
complex workflows into a separate section than our existing "Quickstart"
zika-tutorial. This new section of tutorials was motivated by the
SAB meeting¹ that emphasized the need to keep the simple zika-tutorial
for new users.

The new section only holds the new "Running an ingest workflow"
docs, but will be expanded in the future to cover how to run
and customize other workflows within an existing pathogen
repository.

¹ https://docs.google.com/document/d/1zDwbn16ZRlMMcKLGWWVJ3lYPNwo5_v_WmgBsHv30_lU/edit?disco=AAABKGP5kEc
Intended to mirror the new "Using a pathogen repo" tutorials with
tutorials on how to set up each individual workflow in a pathogen repo.
Based on contents of the initial draft of the tutorial
https://docs.google.com/document/d/1_16VYT5MU8oXJ4t6HUHp_smx_kgF9OMsORSCldee1_0/edit#heading=h.r95jmyuit0s0

Split out the steps to get the uncurated NCBI Dataset data into a
snippet that can be shared between the two ingest tutorials.
Explain what each step of the example command is doing to give readers
a better understanding.
The pathogen-repo-guide will be updated to included the target
`dump_ncbi_dataset_report` to easily generate the uncurated NCBI Dataset
metadata.¹

This commit updates the tutorial to use this new target so that the
user does not need to manually run the extra commands to see the raw
metadata.

¹ nextstrain/pathogen-repo-guide#38
The previous commit updates the creating-an-ingest-workflow tutorial
to use a custom Snakemake target to generate the uncurated NCBI Dataset
metadata so we no longer need the separate snippet.

I am still providing the extra commands to generate the uncurated
data in the running-an-ingest-workflow tutorial because I wanted users
to be able to see the uncurated data even if the existing pathogen
workflow does _not_ include the custom Snakemake target.
Include our intention to write phylogenetic tutorials
Replace big YAML block with written instructions to create the custom
config file based on suggestion from @jameshadfield in review.

#195 (comment)
@joverlee521
Copy link
Contributor Author

Rebased to include the latest changes merged in #191

@joverlee521
Copy link
Contributor Author

Merging since I got a round of 👍 in today's priorities meeting.

@joverlee521 joverlee521 merged commit 65c2849 into master Apr 3, 2024
4 checks passed
@joverlee521 joverlee521 deleted the add-ingest-tutorials branch April 3, 2024 20:21
Comment on lines +53 to +58
.. code-block::

$ nextstrain build ingest
Using profile profiles/default and workflow specific profile profiles/default for setting default command line arguments.
Building DAG of jobs…
[...a lot of output...]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these new code-blocks which show an example shell session with a prompt and command and optionally output should be .. code-block:: console to properly handle the various bits inside with highlighting and the copy/paste button.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to know! I just tried updating to .. code-block::console locally, I see a difference in syntax highlighting but the copy/paste behavior is still the same where it only copies the command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding missing code-block languages in #197

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…but the copy/paste behavior is still the same where it only copies the command.

Ah! I thought we were using the syntax-aware prompt/output exclusion method of sphinx-copybutton, but we're using the pattern-matching exclusion method.

All good then.

joverlee521 added a commit that referenced this pull request Apr 19, 2024
Be explicit about code-block language to get the proper syntax
highlighting. Prompted by post-merge review
#195 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tutorials for ingest pipelines
4 participants