Add ingest #10

kimandrews · 2024-01-19T23:59:21Z

Description of proposed changes

Add ingest folder from pathogen-repo-template and make measles-specific modifications, following these steps:

Add ingest folder from pathogen-repo-template
Add measles-specific config parameters
Remove files that are not currently being used for this workflow, including files related to Entrez data-fetching, Nextclade, and automation
Remove strain name-fixing in anticipation of using accession numbers as identifiers for phylogenetic analysis
Add NCBI Datasets "virus-name" field to metadata because it includes measles genotype identifiers

ingest/rules/fetch_from_ncbi.smk

huddlej

This workflow ran for me without issue, @kimandrews. I added a couple of minor inline comments. Looking at these metadata and sequences for the first time, some thoughts I had were:

We're throwing out a lot of data by filtering to 5000 bp sequences. It looks like the majority of sequences represent one or two genes (H and sometimes F, e.g., https://www.ncbi.nlm.nih.gov/nuccore/KJ183860.1). We should consider eventually building gene trees to use those smaller sequences.
There are 6 samples with a host annotation that is not human (e.g., https://www.ncbi.nlm.nih.gov/nuccore/MG954083.1) and over 4000 without a host annotation at all. None of these samples have sequences that are long enough to get included in the current tree, but we should consider whether we want this workflow to produce human-specific measles trees or multi-host trees especially if we end up making gene trees. Since it's possible that the records without host annotations are not from human hosts, we would want a human-specific workflow to apply an additional host filter in the phylogenetic workflow.

A bigger issue I noticed when trying to use the outputs of this workflow for the phylogenetic workflow is that the use of the strain column as the metadata id column leads to duplicated records when running augur filter and causes the phylogenetic workflow to crash. I expect we'll need ingest to use the accession as the strain column, but that's probably something you've discussed with @j23414 and @joverlee521 already?

huddlej · 2024-01-24T19:48:52Z

ingest/profiles/nextstrain_automation/defaults.yaml

+
+# Nextstrain AWS S3 Bucket with pathogen prefix
+# Replace <pathogen> with the pathogen repo name.
+s3_dst: "s3://nextstrain-data/files/workflows/<pathogen>"


This line needs to be updated to refer to measles instead of <pathogen>.

As discussed, the ingest/profiles directory has been removed for now (601edee) and will be added as a later change.

ingest/vendored/.github/workflows/ci.yaml

ingest/config/defaults.yaml

joverlee521

It's nice to see this come along @kimandrews ! I'm really glad the zika examples that @j23414 worked through was helpful!

As we discussed in person, I think we can reduce confusion by removing the config parameters and rules that will not be used now. This includes the Entrez route for fetching sequences and the Nextclade related rules in rules/nextclade.smk.

ingest/bin/fix-measles-strain-names.py

ingest/rules/fetch_from_ncbi.smk

ingest/config/defaults.yaml

joverlee521 · 2024-01-24T22:39:39Z

the use of the strain column as the metadata id column leads to duplicated records when running augur filter and causes the phylogenetic workflow to crash. I expect we'll need ingest to use the accession as the strain column, but that's probably something you've discussed with @j23414 and @joverlee521 already?

Overall, we are going down the route of using accession as the --metadata-id-column for Augur commands instead of the default strain field. So this will require changes in the phylo workflow that should be included in future PRs.

jameshadfield

Thanks Kim! Really useful for me to read through what a fresh ingest implementation using the new template repo looks like

ingest/README.md

ingest/rules/curate.smk

ingest/config/defaults.yaml

jameshadfield · 2024-01-25T03:04:00Z

ingest/config/defaults.yaml

@@ -0,0 +1,117 @@
+# This configuration file should contain all required configuration parameters


[a general comment on the config, or perhaps the forthcoming tutorial, with the intention of discussion rather than specific changes]

Discussing this with Kim today most of the values were copied over from Zika. This includes the extracted NCBI fields (as thus, as discussed in another comment thread below, we're missing some important fields). It also includes the curate fields, which are necessarily data-dependent (e.g. expected date fields). Is the best-practice to fill in this config field-by-field after inspecting the raw NCBI data, or copy over from another repo and then asses the output?

I had set up the template for the first route to force users to become familiar with the raw NCBI data and make conscious decisions to fill in the config. However, looking at this now, it might be too steep of a learning for new users.

Adding default values was also requested in nextstrain/pathogen-repo-guide#4 and Kim also provided feedback that it was easier to be able to run and assess the output before making changes.
Seems like the template should just include the suggested default config params and users can then change them as needed.

Seems like the template should just include the suggested default config params and users can then change them as needed.

👍 I guess the tutorial is the best place to talk through how one can asses the data to make the config specific to their data? Another option is to start with ~no curation and use that outcome to inform what curation steps to use.

Good discussion! I went ahead and created an pathogen-repo-guide Issue to further discuss a ~no curation option: nextstrain/pathogen-repo-guide#30

ingest/bin/fix-measles-strain-names.py

jameshadfield · 2024-01-25T03:23:41Z

ingest/bin/fix-measles-strain-names.py

Kim & I briefly discussed what the point of normalizing these names is today. The majority of sequences don't have a strain name (and thus in our current curate usage we are using the accession). For those with actual strain names, these rules (taken from fauna + zika) aren't producing a nice uniform set of names as there's just too much variability. Briefly scanning through the data there is no coherent formatting to (non-accession) strain names; for instance here's a few examples:

FR_DF_16_1944 V2484482 up/2014_536_T TOG_PLA_OGO_13_202 MV/India_Assam/7586AD/2018/D8 MKAK/MEP/2016/1883 MO_2891 MeaNS_sample_ID_55662 zhongshan20140428 B95a

One option is to ignore strain name manipulation (and not replace empty strain names with accessions), use accession as the canonical ID for both Augur and Auspice, and then include strain names as extra metadata in the Auspice JSON for the nodes where it exists. Combined with nextstrain/augur#1383 and Auspice's ability to search across all metadata I think this would make for a nice outcome.

Strain name manipulation has been removed, and empty strain names are no longer replaced (56bfad6)

Nice!

For posterities sake, one thing I didn't quite realise is that those strain names above were being "corrected", e.g.

original: up/2014-536-T curated: up/2014_536_T

which is another reason to skip name curation.

From slack thread, deciding to not replace empty strain names with accessions may have downstream effects in the phylogenetic workflow (e.g. If we display strain name on tips, then Auspice will generate a random string for the name that is displayed in the tree...which is undesirable behaviour.

From @joverlee521 on the same thread:

I think we can leave it as-is and just display the accession as the sequence name in the tree.
There's WIP to allow Auspice to display arbitrary fields as the name so we can then toggle between accession and strain name in the future.

@kimandrews, up-to-you if you want to revert c9dfc64 or leave it. We can also revisit this later, during your phylogenetic PR.

ingest/rules/nextclade.smk

ingest/profiles/nextstrain_automation/defaults.yaml

Prompted by nextstrain/measles#10 (comment) These have become common commands used across Nextstrain pathogen repos¹ since they were added to the Docker runtime² ¹ https://github.com/search?q=org%3Anextstrain+AND+%28%22csvtk%22+OR+%22tsv-select%22+OR+%22seqkit%22%29&type=code ² nextstrain/docker-base@992c575

joverlee521

Thanks for working through the first round of feedback @kimandrews!

I've left a couple more suggestions to remove more sections of the workflow that are not being used.

ingest/README.md

ingest/Snakefile

joverlee521 · 2024-01-31T22:13:38Z

ingest/rules/fetch_from_ncbi.smk

+2. Fetch from Entrez (https://www.ncbi.nlm.nih.gov/books/NBK25501/)
+    - requires `entrez_search_term` config
+    - Returns all available data via a GenBank file
+    - Requires a custom script to parse the necessary fields from the GenBank file


This section of the docstring can be removed since this workflow does not include Entrez rules anymore.

This change was made in cc13eb6

jameshadfield · 2024-02-01T02:31:00Z

ingest/README.md

+
+This produces a `results` directory with the following outputs:
+- sequences.fasta
+- metadata.tsv


It also includes all_metadata.tsv.

In phylo workflows we have {data,results,auspice} and it's understood that the files in results can be numerous and change frequently with workflow updates. For ingest we only have {data,results}. My understanding is that more complex ingest workflows will populate results with many files. So maybe we could change the wording here to indicate that of the files in results these two are the ones that should be used for downstream analysis.

Ah, the all_metadata.tsv looks like an intermediate data file, not the final result metadata.tsv.

@kimandrews, you can change the path from results/all_metadata.tsv to data/all_metadata.tsv so there's less confusion on what the final output files should be (e.g. results/sequences.fasta and results/metadata.tsv.

measles/ingest/rules/curate.smk

Line 64 in 90a70cd

metadata="results/all_metadata.tsv",

measles/ingest/rules/curate.smk

Lines 119 to 123 in 90a70cd

rule subset_metadata:

input:

metadata="results/all_metadata.tsv",

output:

subset_metadata="results/metadata.tsv",

Done in f3fe0c1

jameshadfield · 2024-02-01T02:43:07Z

ingest/rules/curate.smk

+# the input as NDJSON records from stdin and output NDJSON records to stdout.
+# The final step of the pipeline should convert the NDJSON records to two
+# separate files: a metadata TSV and a sequences FASTA.
+rule curate:


Consider the following an idea to implement later, it doesn't have to be part of this PR

The virus name column contains information the phylo build will certainly want to show. It may even be the default colouring in auspice (or may be used as the basis for nextclade clade assignment which will then become the default colouring). The structure of values here is diverse, but the majority follow a straightforward pattern. Using cat results/metadata.tsv | cut -f 3 | sort | uniq -c | sort -r the top 10 are:

6586 Measles morbillivirus 6557 Measles virus genotype D8 4883 Measles virus genotype B3 1546 Measles virus genotype D4 1341 Measles virus genotype H1 668 Measles virus genotype D9 158 Measles virus genotype H1a 137 Measles virus genotype A 131 Measles virus genotype D5 72 Measles virus genotype B2

I don't think it'd be too hard to write a small script to extract as many genotypes as possible from this.

[also non-blocking comment]

I concur with @jameshadfield regarding the inclusion of genotype as a user-desired feature to display, and a first step for creating a nextclade dataset. (exciting! :D ) You can open an issue titled "Parse genotype from NCBI data." Once initiated, you can proceed by creating a distinct branch and subsequent pull request to maintain a focused scope (and focused review process :) ).

To address this task, you have the option to either develop a new script from the ground up or to adapt existing scripts such as fix-zika-strain-names.py or vendored/transform-authors. These scripts already read in from ndjson, but you'd have to modify it to parse virus-name (or virus_name) and infer a new "genotype" field, and invoke it somewhere in the curate rule. I'm happy to discuss this when you get to this task or if you have any questions.

I opened an issue for this here

jameshadfield · 2024-02-01T03:10:00Z

ingest/config/defaults.yaml

+    isolate-lineage: strain
+    geo-region: region
+    geo-location: location
+    isolate-collection-date: date


There are many more sample collection dates (or dates which we can effectively use as sample collection dates) beyond this field.

For strain names following the WHO schema, the "XX.YY" corresponds to

Date of rash onset if known, otherwise date of specimen collection by epidemiological week (1-53) and year.

So, e.g., Accession JX187583, which doesn't have a date in our workflow, has sample name MVs/Parma.ITA/47.08 in nuccore so it'd have date of 2008-11-XX | 2008-12-XX

Confusingly, looking at the raw NCBI dataset output that sample name does not appear¹, so there seems to be no way to extract it using the datasets command. I'm not an expert on the new datasets CLI (entrez, for all its messiness, is what I use), but I'd think it's worth working out what's going on here. I've cross posted this to slack.

¹ Using the unzipped output from rule fetch_ncbi_dataset_package, then grep 'JX187583' data/ncbi_dataset/ncbi_dataset/data/data_report.jsonl | gron.

Thanks for pointing to the WHO schema for measles strain names! I've put in a request to the NCBI datasets team to add the strain field. We'll see if it's possible on their in end.

In the meantime, we have many records that do have a strain name that are missing dates.

So I'd think regardless of whether we use datasets or Entrez, it'd be worth writing a custom script to parse potential dates out of strain names.

Sounds good. Similar to my comment above I'd think it fine to defer for a later PR if @kimandrews wanted to merge this and get the phylo part working with this new ingest data. Your call.

Thanks for the feedback and clarifications on how measles data is being reported in GenBank.

Of the current samples that meet our minimum length requirement (5000bp; n=937) and do not have dates listed (n=157), only 3 samples have strain names from which the sampling date could be inferred.

I'm weighing the following options for moving forward:

Write custom code that will recover sampling dates for the 3 samples

Wait for NCBI Datasets to include the strain attribute, then write custom code that identifies which column the strain name is in (i.e. either the "strain" or "isolate" column) and subsequently uses strain name to infer sampling dates for samples with missing dates

Don't use NCBI Datasets; write custom code to parse the GenBank records

Only use samples that have sampling dates in the "date" column (don't try to infer missing dates from the strain name)

Other?

As reported by the datasets program. I haven't created any custom code for parsing GenBank records yet.

I'd recommend writing this up as an issue to tackle either later on or simultaneously with the phylo build. I think all 4 options could be comfortably considered out-of-scope for this PR. It seems that 'strain' will be parsed by NCBI sometime this year, and I don't it's unreasonable to wait for that, which gives a fair bit of time to think through the parsing approach and how it fits into the project. I don't know much about the temporal signal in measles phylogenetics, so I'd think to explore that as part of the phylo build in order to gauge how important filling in these missing data could be.

Number of samples with strain names might change when we change the minimum length filters for gene specific builds in #13.

However, I also definitely support pushing this into a later issue.

Yeah, without the minimum length filter, there are definitely more samples with strain names that could be used to recover empty dates. Seems worthwhile to eventually get this sorted out in a later PR.

I created an issue about creating custom code to parse measles strain names

Copy the ingest directory from pathogen-repo-template: https://github.com/nextstrain/pathogen-repo-template/tree/b8ae886b25877a218ad50380fb44f8825d50aedb/ingest The `ingest/vendored` subdirectory is not copied over since that folder should be added with `git-subrepo`. Future commits will change this to work with measles data.

…/vendored subrepo: subdir: "ingest/vendored" merged: "a0faef5" upstream: origin: "https://github.com/nextstrain/ingest" branch: "main" commit: "a0faef5" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"

Add taxon id and other config parameters related to the curate pipeline. Remove config parameters related to Nextclade because we do not currently have a Nextclade measles dataset.

Harmonizing with modifications to the [pathogen-repo-guide](nextstrain/pathogen-repo-guide@14e5dca)

Remove nextclade-related if-else statements from Snakefile. This also requires editing final rule in curate.smk to output results/metadata.tsv instead of results/subset_metadata.tsv

joverlee521

I've left one final comment to address for the vendored subrepo, but the other changes look good to me.

ingest/vendored/.gitrepo

subrepo: subdir: "ingest/vendored" merged: "e83c214" upstream: origin: "https://github.com/nextstrain/ingest" branch: "main" commit: "e83c214" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"

Follow up to review in the measles PR nextstrain/measles#10 (comment)

Change the output folder for all_metadata.tsv from ./results to ./data Follow up to [#10 (comment)](#10 (comment))

kimandrews requested review from j23414 and joverlee521 January 23, 2024 00:14

j23414 reviewed Jan 23, 2024

View reviewed changes

ingest/rules/fetch_from_ncbi.smk Outdated Show resolved Hide resolved

kimandrews requested a review from a team January 23, 2024 19:21

huddlej reviewed Jan 24, 2024

View reviewed changes

joverlee521 requested changes Jan 24, 2024

View reviewed changes

ingest/bin/fix-measles-strain-names.py Outdated Show resolved Hide resolved

ingest/rules/fetch_from_ncbi.smk Outdated Show resolved Hide resolved

ingest/config/defaults.yaml Outdated Show resolved Hide resolved

jameshadfield reviewed Jan 25, 2024

View reviewed changes

joverlee521 mentioned this pull request Jan 25, 2024

ambient-setup: Add csvtk, seqkit, and tsv-utils nextstrain/docs.nextstrain.org#183

Merged

1 task

kimandrews force-pushed the add-ingest branch from f0a5194 to 0e39ced Compare January 30, 2024 00:27

This was referenced Jan 30, 2024

Use accession number in phylogenetic workflow #11

Closed

Consider a human-host-specific workflow #12

Open

Consider building gene-specific phylogenies #13

Closed

kimandrews requested a review from a team January 31, 2024 19:57

kimandrews marked this pull request as ready for review January 31, 2024 19:59

joverlee521 self-requested a review January 31, 2024 21:38

joverlee521 requested changes Jan 31, 2024

View reviewed changes

jameshadfield reviewed Feb 1, 2024

View reviewed changes

joverlee521 mentioned this pull request Feb 2, 2024

Request for additional attributes for Virus Data Report ncbi/datasets#310

Closed

This was referenced Feb 7, 2024

Create custom code to parse Measles strain names #14

Open

Update top-level .gitignore #15

Merged

kimandrews added 6 commits February 9, 2024 16:07

Add ingest config parameters

e830f65

Add taxon id and other config parameters related to the curate pipeline. Remove config parameters related to Nextclade because we do not currently have a Nextclade measles dataset.

Remove Entrez rules and config parameters

375d9ec

Remove nextclade-related rules

9f6cc30

Remove ingest/profiles

7e10c4b

kimandrews added 9 commits February 9, 2024 16:07

Add abbr_authors to metadata_columns

07416e6

Fix typo in ingest/rules/curate.smk

0fca27e

Add NCBI Datasets "virus-name" field

9ebfebc

Remove nextclade_field_map.tsv

9afc28f

Don't replace empty strain names

c9dfc64

Renamed "config" directory to "defaults"

d2737ed

Harmonizing with modifications to the [pathogen-repo-guide](nextstrain/pathogen-repo-guide@14e5dca)

ingest/README.md: remove Profiles, change config to defaults

6976bd4

Remove nextclade from Snakefile

3fb3e9e

Remove nextclade-related if-else statements from Snakefile. This also requires editing final rule in curate.smk to output results/metadata.tsv instead of results/subset_metadata.tsv

Remove Entrez from docstring of fetch_from_ncbi.smk

90a70cd

kimandrews force-pushed the add-ingest branch from cc13eb6 to 90a70cd Compare February 10, 2024 00:08

j23414 mentioned this pull request Feb 11, 2024

Provide a no-curation option for NCBI Ingest nextstrain/pathogen-repo-guide#30

Closed

joverlee521 approved these changes Feb 12, 2024

View reviewed changes

ingest/vendored/.gitrepo Outdated Show resolved Hide resolved

kimandrews added 2 commits February 14, 2024 11:27

Update Changelog

e766152

kimandrews merged commit 75df4da into main Feb 14, 2024
6 checks passed

kimandrews deleted the add-ingest branch February 14, 2024 19:33

kimandrews mentioned this pull request Feb 15, 2024

Parse genotype from NCBI data #16

Closed

joverlee521 mentioned this pull request Feb 15, 2024

Update ingest to output to data/ for intermediate files nextstrain/pathogen-repo-guide#31

Closed

joverlee521 added a commit to nextstrain/pathogen-repo-guide that referenced this pull request Feb 17, 2024

ingest: Move intermediate files to data directory

8451d68

Follow up to review in the measles PR nextstrain/measles#10 (comment)

joverlee521 mentioned this pull request Feb 17, 2024

ingest: Move intermediate files to data directory nextstrain/pathogen-repo-guide#32

Merged

1 task

kimandrews added a commit that referenced this pull request Feb 20, 2024

ingest: Move intermediate file to data directory

f3fe0c1

Change the output folder for all_metadata.tsv from ./results to ./data Follow up to [#10 (comment)](#10 (comment))

kimandrews mentioned this pull request Feb 20, 2024

ingest: Move intermediate file to data directory #17

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ingest #10

Add ingest #10

kimandrews commented Jan 19, 2024 •

edited

Loading

huddlej left a comment

huddlej Jan 24, 2024

kimandrews Jan 30, 2024

joverlee521 left a comment

joverlee521 commented Jan 24, 2024

jameshadfield left a comment

jameshadfield Jan 25, 2024

joverlee521 Jan 25, 2024

jameshadfield Jan 25, 2024

j23414 Feb 11, 2024

jameshadfield Jan 25, 2024

kimandrews Jan 31, 2024

jameshadfield Feb 1, 2024

j23414 Feb 11, 2024

joverlee521 left a comment

joverlee521 Jan 31, 2024

kimandrews Feb 8, 2024

jameshadfield Feb 1, 2024

j23414 Feb 11, 2024

kimandrews Feb 22, 2024

jameshadfield Feb 1, 2024

j23414 Feb 11, 2024

kimandrews Feb 15, 2024

jameshadfield Feb 1, 2024 •

edited

Loading

joverlee521 Feb 2, 2024

joverlee521 Feb 2, 2024

jameshadfield Feb 2, 2024

kimandrews Feb 5, 2024

kimandrews Feb 6, 2024

jameshadfield Feb 6, 2024

joverlee521 Feb 6, 2024

kimandrews Feb 6, 2024

kimandrews Feb 8, 2024

joverlee521 left a comment

		@@ -0,0 +1,117 @@
		# This configuration file should contain all required configuration parameters

	rule subset_metadata:
	input:
	metadata="results/all_metadata.tsv",
	output:
	subset_metadata="results/metadata.tsv",

Add ingest #10

Add ingest #10

Conversation

kimandrews commented Jan 19, 2024 • edited Loading

Description of proposed changes

huddlej left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joverlee521 left a comment

Choose a reason for hiding this comment

joverlee521 commented Jan 24, 2024

jameshadfield left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joverlee521 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshadfield Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joverlee521 left a comment

Choose a reason for hiding this comment

kimandrews commented Jan 19, 2024 •

edited

Loading

jameshadfield Feb 1, 2024 •

edited

Loading