Skip to content

Commit

Permalink
Fetch records from GenBank via Entrez
Browse files Browse the repository at this point in the history
This uses the Entrez API rather than the approach employed by our other
ingest pipelines (e.g. https://github.com/nextstrain/monkeypox/blob/master/ingest/bin/genbank-url)
as a lot of important metadata for HBV is contained in the features
block of the GenBank record.
  • Loading branch information
jameshadfield committed Jun 27, 2023
1 parent 80bdf40 commit 0339f27
Show file tree
Hide file tree
Showing 3 changed files with 57 additions and 9 deletions.
10 changes: 2 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,6 @@ Currently a WIP

Based on Katie Kistler's work in [blab/adaptive-evolution](https://github.com/blab/adaptive-evolution)

## Obtain data

1. Obtain `./ingest/genbank_sequences.gb`
- example search: [(Hepatitis B virus) AND (complete genome)](https://www.ncbi.nlm.nih.gov/nuccore/?term=(Hepatitis+B+virus)+AND+(complete+genome)), which returns ~11k samples
- Could also explore `"Hepatitis B virus"[porgn:__txid10407]`
- download as "Complete Record" / "File" / "GenBank (full)", a ~100Mb file.


## Ingest

You can run the ingest part of the snakemake pipeline via:
Expand All @@ -20,6 +12,8 @@ You can run the ingest part of the snakemake pipeline via:
snakemake --cores 4 -npf ingest/results/aligned.fasta ingest/results/sequences.fasta ingest/results/metadata.tsv
```

Ingest will fetch genomes from NCBI's Entrez API and write to `./ingest/data/genbank.gb`.
As of mid 2023 there are around ~11k genomes and the full GenBank file is ~150Mb.

## Phylo

Expand Down
13 changes: 12 additions & 1 deletion ingest/ingest.smk
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
rule fetch_genbank:
params:
term = 'Hepatitis+B+virus[All+Fields]complete+genome[All+Fields]'
output:
genbank = "ingest/data/genbank.gb",
retries: 1 # Requires snakemake 7.7.0 or later
shell:
"""
ingest/scripts/fetch-genbank.py --term {params.term:q} --output {output.genbank}
"""

rule parse_genbank:
input:
genbank = "ingest/data/genbank_sequences.gb",
genbank = "ingest/data/genbank.gb",
output:
sequences = "ingest/results/genbank.fasta",
metadata = "ingest/results/genbank.tsv",
Expand Down
43 changes: 43 additions & 0 deletions ingest/scripts/fetch-genbank.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#!/usr/bin/env python3

import json
import argparse
from Bio import SeqIO, Entrez

# to use the efetch API, the docs indicate only around 200 IDs should be provided per request
# https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch
BATCH_SIZE = 200

# As of mid 2023 there are ~11k records for HBV
MAX_IDS = 100_100

Entrez.email = "hello@nextstrain.org"

def parse_args():
parser = argparse.ArgumentParser(description='Annotate sequences using a genbank reference')
parser.add_argument('--term', help='Genbank search term. Replace spaces with "+"', default='Hepatitis+B+virus[All+Fields]complete+genome[All+Fields]')
parser.add_argument('--output', required=True, type=str, help='Output file (Genbank)')
return parser.parse_args()

def get_sequence_ids(term):
handle = Entrez.esearch(db="nucleotide", term=term, retmode="json", retmax=MAX_IDS)
data = json.loads(handle.read())
ids = data['esearchresult']['idlist']
print(f"Search term '{term}' returned {len(ids)} IDs. Processing in batches of n={BATCH_SIZE}.")
idx = 0
while idx < len(ids):
yield ids[idx:idx+BATCH_SIZE]
idx+=BATCH_SIZE

def fetch(id_list):
handle = Entrez.efetch(db="nucleotide", id=','.join(id_list), rettype="gb", retmode="text")
return SeqIO.parse(handle, "genbank")

if __name__=="__main__":
args = parse_args()

genomes = []
for id_list in get_sequence_ids(args.term):
genomes.extend(fetch(id_list))

SeqIO.write(genomes, args.output, "genbank")

0 comments on commit 0339f27

Please sign in to comment.