Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ingest #10

Merged
merged 17 commits into from
Feb 14, 2024
Merged

Add ingest #10

merged 17 commits into from
Feb 14, 2024

Conversation

kimandrews
Copy link
Contributor

@kimandrews kimandrews commented Jan 19, 2024

Description of proposed changes

Add ingest folder from pathogen-repo-template and make measles-specific modifications, following these steps:

  • Add ingest folder from pathogen-repo-template
  • Add measles-specific config parameters
  • Remove files that are not currently being used for this workflow, including files related to Entrez data-fetching, Nextclade, and automation
  • Remove strain name-fixing in anticipation of using accession numbers as identifiers for phylogenetic analysis
  • Add NCBI Datasets "virus-name" field to metadata because it includes measles genotype identifiers

@kimandrews kimandrews requested a review from a team January 23, 2024 19:21
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow ran for me without issue, @kimandrews. I added a couple of minor inline comments. Looking at these metadata and sequences for the first time, some thoughts I had were:

  1. We're throwing out a lot of data by filtering to 5000 bp sequences. It looks like the majority of sequences represent one or two genes (H and sometimes F, e.g., https://www.ncbi.nlm.nih.gov/nuccore/KJ183860.1). We should consider eventually building gene trees to use those smaller sequences.
  2. There are 6 samples with a host annotation that is not human (e.g., https://www.ncbi.nlm.nih.gov/nuccore/MG954083.1) and over 4000 without a host annotation at all. None of these samples have sequences that are long enough to get included in the current tree, but we should consider whether we want this workflow to produce human-specific measles trees or multi-host trees especially if we end up making gene trees. Since it's possible that the records without host annotations are not from human hosts, we would want a human-specific workflow to apply an additional host filter in the phylogenetic workflow.

A bigger issue I noticed when trying to use the outputs of this workflow for the phylogenetic workflow is that the use of the strain column as the metadata id column leads to duplicated records when running augur filter and causes the phylogenetic workflow to crash. I expect we'll need ingest to use the accession as the strain column, but that's probably something you've discussed with @j23414 and @joverlee521 already?


# Nextstrain AWS S3 Bucket with pathogen prefix
# Replace <pathogen> with the pathogen repo name.
s3_dst: "s3://nextstrain-data/files/workflows/<pathogen>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line needs to be updated to refer to measles instead of <pathogen>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, the ingest/profiles directory has been removed for now (601edee) and will be added as a later change.

ingest/vendored/.github/workflows/ci.yaml Outdated Show resolved Hide resolved
ingest/config/defaults.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice to see this come along @kimandrews ! I'm really glad the zika examples that @j23414 worked through was helpful!

As we discussed in person, I think we can reduce confusion by removing the config parameters and rules that will not be used now. This includes the Entrez route for fetching sequences and the Nextclade related rules in rules/nextclade.smk.

ingest/bin/fix-measles-strain-names.py Outdated Show resolved Hide resolved
ingest/rules/fetch_from_ncbi.smk Outdated Show resolved Hide resolved
ingest/config/defaults.yaml Outdated Show resolved Hide resolved
@joverlee521
Copy link
Contributor

the use of the strain column as the metadata id column leads to duplicated records when running augur filter and causes the phylogenetic workflow to crash. I expect we'll need ingest to use the accession as the strain column, but that's probably something you've discussed with @j23414 and @joverlee521 already?

Overall, we are going down the route of using accession as the --metadata-id-column for Augur commands instead of the default strain field. So this will require changes in the phylo workflow that should be included in future PRs.

Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kim! Really useful for me to read through what a fresh ingest implementation using the new template repo looks like

ingest/README.md Show resolved Hide resolved
ingest/README.md Show resolved Hide resolved
ingest/rules/curate.smk Outdated Show resolved Hide resolved
ingest/config/defaults.yaml Outdated Show resolved Hide resolved
@@ -0,0 +1,117 @@
# This configuration file should contain all required configuration parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[a general comment on the config, or perhaps the forthcoming tutorial, with the intention of discussion rather than specific changes]

Discussing this with Kim today most of the values were copied over from Zika. This includes the extracted NCBI fields (as thus, as discussed in another comment thread below, we're missing some important fields). It also includes the curate fields, which are necessarily data-dependent (e.g. expected date fields). Is the best-practice to fill in this config field-by-field after inspecting the raw NCBI data, or copy over from another repo and then asses the output?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had set up the template for the first route to force users to become familiar with the raw NCBI data and make conscious decisions to fill in the config. However, looking at this now, it might be too steep of a learning for new users.

Adding default values was also requested in nextstrain/pathogen-repo-guide#4 and Kim also provided feedback that it was easier to be able to run and assess the output before making changes.
Seems like the template should just include the suggested default config params and users can then change them as needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the template should just include the suggested default config params and users can then change them as needed.

👍 I guess the tutorial is the best place to talk through how one can asses the data to make the config specific to their data? Another option is to start with ~no curation and use that outcome to inform what curation steps to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good discussion! I went ahead and created an pathogen-repo-guide Issue to further discuss a ~no curation option: nextstrain/pathogen-repo-guide#30

ingest/bin/fix-measles-strain-names.py Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kim & I briefly discussed what the point of normalizing these names is today. The majority of sequences don't have a strain name (and thus in our current curate usage we are using the accession). For those with actual strain names, these rules (taken from fauna + zika) aren't producing a nice uniform set of names as there's just too much variability. Briefly scanning through the data there is no coherent formatting to (non-accession) strain names; for instance here's a few examples:

FR_DF_16_1944
V2484482
up/2014_536_T
TOG_PLA_OGO_13_202
MV/India_Assam/7586AD/2018/D8
MKAK/MEP/2016/1883
MO_2891
MeaNS_sample_ID_55662
zhongshan20140428
B95a

One option is to ignore strain name manipulation (and not replace empty strain names with accessions), use accession as the canonical ID for both Augur and Auspice, and then include strain names as extra metadata in the Auspice JSON for the nodes where it exists. Combined with nextstrain/augur#1383 and Auspice's ability to search across all metadata I think this would make for a nice outcome.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strain name manipulation has been removed, and empty strain names are no longer replaced (56bfad6)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

For posterities sake, one thing I didn't quite realise is that those strain names above were being "corrected", e.g.

original: up/2014-536-T
curated:  up/2014_536_T

which is another reason to skip name curation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From slack thread, deciding to not replace empty strain names with accessions may have downstream effects in the phylogenetic workflow (e.g. If we display strain name on tips, then Auspice will generate a random string for the name that is displayed in the tree...which is undesirable behaviour.

From @joverlee521 on the same thread:

I think we can leave it as-is and just display the accession as the sequence name in the tree.
There's WIP to allow Auspice to display arbitrary fields as the name so we can then toggle between accession and strain name in the future.

@kimandrews, up-to-you if you want to revert c9dfc64 or leave it. We can also revisit this later, during your phylogenetic PR.

ingest/rules/nextclade.smk Outdated Show resolved Hide resolved
ingest/profiles/nextstrain_automation/defaults.yaml Outdated Show resolved Hide resolved
ingest/profiles/nextstrain_automation/defaults.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through the first round of feedback @kimandrews!

I've left a couple more suggestions to remove more sections of the workflow that are not being used.

ingest/README.md Outdated Show resolved Hide resolved
ingest/README.md Outdated Show resolved Hide resolved
ingest/Snakefile Outdated Show resolved Hide resolved
Comment on lines 23 to 26
2. Fetch from Entrez (https://www.ncbi.nlm.nih.gov/books/NBK25501/)
- requires `entrez_search_term` config
- Returns all available data via a GenBank file
- Requires a custom script to parse the necessary fields from the GenBank file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section of the docstring can be removed since this workflow does not include Entrez rules anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was made in cc13eb6


This produces a `results` directory with the following outputs:
- sequences.fasta
- metadata.tsv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also includes all_metadata.tsv.

In phylo workflows we have {data,results,auspice} and it's understood that the files in results can be numerous and change frequently with workflow updates. For ingest we only have {data,results}. My understanding is that more complex ingest workflows will populate results with many files. So maybe we could change the wording here to indicate that of the files in results these two are the ones that should be used for downstream analysis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the all_metadata.tsv looks like an intermediate data file, not the final result metadata.tsv.

@kimandrews, you can change the path from results/all_metadata.tsv to data/all_metadata.tsv so there's less confusion on what the final output files should be (e.g. results/sequences.fasta and results/metadata.tsv.

metadata="results/all_metadata.tsv",

rule subset_metadata:
input:
metadata="results/all_metadata.tsv",
output:
subset_metadata="results/metadata.tsv",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f3fe0c1

# the input as NDJSON records from stdin and output NDJSON records to stdout.
# The final step of the pipeline should convert the NDJSON records to two
# separate files: a metadata TSV and a sequences FASTA.
rule curate:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following an idea to implement later, it doesn't have to be part of this PR

The virus name column contains information the phylo build will certainly want to show. It may even be the default colouring in auspice (or may be used as the basis for nextclade clade assignment which will then become the default colouring). The structure of values here is diverse, but the majority follow a straightforward pattern. Using cat results/metadata.tsv | cut -f 3 | sort | uniq -c | sort -r the top 10 are:

6586 Measles morbillivirus
6557 Measles virus genotype D8
4883 Measles virus genotype B3
1546 Measles virus genotype D4
1341 Measles virus genotype H1
 668 Measles virus genotype D9
 158 Measles virus genotype H1a
 137 Measles virus genotype A
 131 Measles virus genotype D5
  72 Measles virus genotype B2

I don't think it'd be too hard to write a small script to extract as many genotypes as possible from this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[also non-blocking comment]

I concur with @jameshadfield regarding the inclusion of genotype as a user-desired feature to display, and a first step for creating a nextclade dataset. (exciting! :D ) You can open an issue titled "Parse genotype from NCBI data." Once initiated, you can proceed by creating a distinct branch and subsequent pull request to maintain a focused scope (and focused review process :) ).

To address this task, you have the option to either develop a new script from the ground up or to adapt existing scripts such as fix-zika-strain-names.py or vendored/transform-authors. These scripts already read in from ndjson, but you'd have to modify it to parse virus-name (or virus_name) and infer a new "genotype" field, and invoke it somewhere in the curate rule. I'm happy to discuss this when you get to this task or if you have any questions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened an issue for this here

isolate-lineage: strain
geo-region: region
geo-location: location
isolate-collection-date: date
Copy link
Member

@jameshadfield jameshadfield Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many more sample collection dates (or dates which we can effectively use as sample collection dates) beyond this field.

For strain names following the WHO schema, the "XX.YY" corresponds to

Date of rash onset if known, otherwise date of specimen collection by epidemiological week (1-53) and year.

So, e.g., Accession JX187583, which doesn't have a date in our workflow, has sample name MVs/Parma.ITA/47.08 in nuccore so it'd have date of 2008-11-XX | 2008-12-XX

Confusingly, looking at the raw NCBI dataset output that sample name does not appear¹, so there seems to be no way to extract it using the datasets command. I'm not an expert on the new datasets CLI (entrez, for all its messiness, is what I use), but I'd think it's worth working out what's going on here. I've cross posted this to slack.

¹ Using the unzipped output from rule fetch_ncbi_dataset_package, then grep 'JX187583' data/ncbi_dataset/ncbi_dataset/data/data_report.jsonl | gron.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing to the WHO schema for measles strain names! I've put in a request to the NCBI datasets team to add the strain field. We'll see if it's possible on their in end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meantime, we have many records that do have a strain name that are missing dates.

So I'd think regardless of whether we use datasets or Entrez, it'd be worth writing a custom script to parse potential dates out of strain names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Similar to my comment above I'd think it fine to defer for a later PR if @kimandrews wanted to merge this and get the phylo part working with this new ingest data. Your call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback and clarifications on how measles data is being reported in GenBank.

Of the current samples that meet our minimum length requirement (5000bp; n=937) and do not have dates listed (n=157), only 3 samples have strain names from which the sampling date could be inferred.

I'm weighing the following options for moving forward:

  1. Write custom code that will recover sampling dates for the 3 samples
  2. Wait for NCBI Datasets to include the strain attribute, then write custom code that identifies which column the strain name is in (i.e. either the "strain" or "isolate" column) and subsequently uses strain name to infer sampling dates for samples with missing dates
  3. Don't use NCBI Datasets; write custom code to parse the GenBank records
  4. Only use samples that have sampling dates in the "date" column (don't try to infer missing dates from the strain name)
  5. Other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As reported by the datasets program. I haven't created any custom code for parsing GenBank records yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend writing this up as an issue to tackle either later on or simultaneously with the phylo build. I think all 4 options could be comfortably considered out-of-scope for this PR. It seems that 'strain' will be parsed by NCBI sometime this year, and I don't it's unreasonable to wait for that, which gives a fair bit of time to think through the parsing approach and how it fits into the project. I don't know much about the temporal signal in measles phylogenetics, so I'd think to explore that as part of the phylo build in order to gauge how important filling in these missing data could be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of samples with strain names might change when we change the minimum length filters for gene specific builds in #13.

However, I also definitely support pushing this into a later issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, without the minimum length filter, there are definitely more samples with strain names that could be used to recover empty dates. Seems worthwhile to eventually get this sorted out in a later PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an issue about creating custom code to parse measles strain names

Copy the ingest directory from pathogen-repo-template:

https://github.com/nextstrain/pathogen-repo-template/tree/b8ae886b25877a218ad50380fb44f8825d50aedb/ingest

The `ingest/vendored` subdirectory is not copied over since that folder should
be added with `git-subrepo`.

Future commits will change this to work with measles data.
…/vendored

subrepo:
  subdir:   "ingest/vendored"
  merged:   "a0faef5"
upstream:
  origin:   "https://github.com/nextstrain/ingest"
  branch:   "main"
  commit:   "a0faef5"
git-subrepo:
  version:  "0.4.6"
  origin:   "https://github.com/ingydotnet/git-subrepo"
  commit:   "110b9eb"
Add taxon id and other config parameters related to the curate pipeline.
Remove config parameters related to Nextclade because we do not currently have a Nextclade measles dataset.
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left one final comment to address for the vendored subrepo, but the other changes look good to me.

ingest/vendored/.gitrepo Outdated Show resolved Hide resolved
subrepo:
  subdir:   "ingest/vendored"
  merged:   "e83c214"
upstream:
  origin:   "https://github.com/nextstrain/ingest"
  branch:   "main"
  commit:   "e83c214"
git-subrepo:
  version:  "0.4.6"
  origin:   "https://github.com/ingydotnet/git-subrepo"
  commit:   "110b9eb"
@kimandrews kimandrews merged commit 75df4da into main Feb 14, 2024
6 checks passed
@kimandrews kimandrews deleted the add-ingest branch February 14, 2024 19:33
joverlee521 added a commit to nextstrain/pathogen-repo-guide that referenced this pull request Feb 17, 2024
kimandrews added a commit that referenced this pull request Feb 20, 2024
Change the output folder for all_metadata.tsv
from ./results to ./data
Follow up to [#10 (comment)](#10 (comment))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants