Fetch records from GenBank via Entrez

This uses the Entrez API rather than the approach employed by our other ingest pipelines (e.g. https://github.com/nextstrain/monkeypox/blob/master/ingest/bin/genbank-url) as a lot of important metadata for HBV is contained in the features block of the GenBank record.
nextstrain · Jun 27, 2023 · 0339f27 · 0339f27
1 parent 80bdf40
commit 0339f27
Show file tree

Hide file tree

Showing 3 changed files with 57 additions and 9 deletions.
diff --git a/README.md b/README.md
@@ -4,14 +4,6 @@ Currently a WIP
 
 Based on Katie Kistler's work in [blab/adaptive-evolution](https://github.com/blab/adaptive-evolution)
 
-## Obtain data
-
-1. Obtain `./ingest/genbank_sequences.gb`
-    - example search: [(Hepatitis B virus) AND (complete genome)](https://www.ncbi.nlm.nih.gov/nuccore/?term=(Hepatitis+B+virus)+AND+(complete+genome)), which returns ~11k samples
-      - Could also explore `"Hepatitis B virus"[porgn:__txid10407]`
-    - download as "Complete Record" / "File" / "GenBank (full)", a ~100Mb file.
-
-
 ## Ingest
 
 You can run the ingest part of the snakemake pipeline via:
@@ -20,6 +12,8 @@ You can run the ingest part of the snakemake pipeline via:
 snakemake --cores 4 -npf ingest/results/aligned.fasta ingest/results/sequences.fasta ingest/results/metadata.tsv
 ```
 
+Ingest will fetch genomes from NCBI's Entrez API and write to `./ingest/data/genbank.gb`.
+As of mid 2023 there are around ~11k genomes and the full GenBank file is ~150Mb.
 
 ## Phylo
 

diff --git a/ingest/ingest.smk b/ingest/ingest.smk
@@ -1,6 +1,17 @@
+rule fetch_genbank:
+    params:
+        term = 'Hepatitis+B+virus[All+Fields]complete+genome[All+Fields]'
+    output:
+        genbank = "ingest/data/genbank.gb",
+    retries: 1  # Requires snakemake 7.7.0 or later
+    shell:
+        """
+        ingest/scripts/fetch-genbank.py --term {params.term:q} --output {output.genbank}
+        """
+
 rule parse_genbank:
     input:
-        genbank = "ingest/data/genbank_sequences.gb",
+        genbank = "ingest/data/genbank.gb",
     output:
         sequences = "ingest/results/genbank.fasta",
         metadata = "ingest/results/genbank.tsv",

diff --git a/ingest/scripts/fetch-genbank.py b/ingest/scripts/fetch-genbank.py
@@ -0,0 +1,43 @@
+#!/usr/bin/env python3
+
+import json
+import argparse
+from Bio import SeqIO, Entrez
+
+# to use the efetch API, the docs indicate only around 200 IDs should be provided per request
+# https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch
+BATCH_SIZE = 200
+
+# As of mid 2023 there are ~11k records for HBV
+MAX_IDS = 100_100
+
+Entrez.email = "hello@nextstrain.org"
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Annotate sequences using a genbank reference')
+    parser.add_argument('--term', help='Genbank search term. Replace spaces with "+"', default='Hepatitis+B+virus[All+Fields]complete+genome[All+Fields]')
+    parser.add_argument('--output', required=True, type=str, help='Output file (Genbank)')
+    return parser.parse_args()
+
+def get_sequence_ids(term):
+    handle = Entrez.esearch(db="nucleotide", term=term, retmode="json", retmax=MAX_IDS)
+    data = json.loads(handle.read())
+    ids = data['esearchresult']['idlist']
+    print(f"Search term '{term}' returned {len(ids)} IDs. Processing in batches of n={BATCH_SIZE}.")
+    idx = 0
+    while idx < len(ids):
+        yield ids[idx:idx+BATCH_SIZE]
+        idx+=BATCH_SIZE
+
+def fetch(id_list):
+    handle = Entrez.efetch(db="nucleotide", id=','.join(id_list), rettype="gb", retmode="text")
+    return SeqIO.parse(handle, "genbank")
+
+if __name__=="__main__":
+    args = parse_args()
+
+    genomes = []
+    for id_list in get_sequence_ids(args.term):
+        genomes.extend(fetch(id_list))
+
+    SeqIO.write(genomes, args.output, "genbank")