Name		Name	Last commit message	Last commit date
parent directory ..
defaults		defaults
scripts		scripts
vendored		vendored
README.md		README.md
Snakefile		Snakefile

README.md

nextstrain.org/hbv/ingest

This is the ingest pipeline for Hepatitis B (HBV) virus sequences

NOTE: This ingest pipeline is in development and the inferred metadata (especially, but not limited to, "clade_nextclade") should not be used for scientific results.

Software requirements

Follow the standard installation instructions for Nextstrain's suite of software tools.

Usage

NOTE: These command examples assume you are within the ingest directory.

snakemake --cores 4

This produces a number of intermediate files in data/ as well as three files in results/ for downstream analysis:

results/metadata.tsv
results/sequences.fasta
results/aligned.fasta

Steps involved

GenBank data as inputs

GenBank sequences and metadata are fetched via a NCBI Entrez query. As of mid 2024 there are around ~11.5k genomes and the full GenBank file is ~150Mb.

Genomes rotated to use a consistent origin

There is a jupyter notebook exploring the process behind this - see ../notebooks/alignment-qc.ipynb

Accuracy of Nextclade inference

Nextclade v3 is used to align all genomes and assign genotype based on a guide tree we have created.

Preliminary stats can be seen in ingest/data/metadata.summary.txt after an ingest build has completed.

Configuration

Configuration parameters are in defaults/config.yaml. These may be overridden by using Snakemake's --configfile or --config options.

Environment Variables

None currently required

`ingest/vendored`

This repository uses git subrepo to manage copies of ingest scripts in ingest/vendored, from nextstrain/ingest.

See vendored/README.md for instructions on how to update the vendored scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest

ingest

README.md

nextstrain.org/hbv/ingest

Software requirements

Usage

Steps involved

GenBank data as inputs

Genomes rotated to use a consistent origin

Accuracy of Nextclade inference

Configuration

Environment Variables

`ingest/vendored`

Files

ingest

Directory actions

More options

Directory actions

More options

Latest commit

History

ingest

Folders and files

parent directory

README.md

nextstrain.org/hbv/ingest

Software requirements

Usage

Steps involved

GenBank data as inputs

Genomes rotated to use a consistent origin

Accuracy of Nextclade inference

Configuration

Environment Variables

ingest/vendored

`ingest/vendored`