MGnify Genome Downloader

Single Python script for downloading genome data from the MGnify database via the EBI Metagenomics API. This tool allows you to efficiently download various types of genome files including protein sequences, nucleotide sequences, annotations, and other analysis results.

Features

Automated genome ID collection: Fetch all available genome IDs from the MGnify API
Multiple file type support: Download various file types including FAA, FNA, GFF, and annotation files
Organized directory structure: Automatically organize downloaded files by type into subdirectories
Resume capability: Continue interrupted downloads from a specific point
Batch processing: Process thousands of genomes efficiently with configurable delays
Error handling: Robust error handling with detailed logging and failure tracking
Flexible input: Use existing genome ID lists or fetch from API

Installation

Prerequisites

Python 3.6 or higher
Required Python packages:
```
pip install requests
```

Download

Clone this repository or download the mgnify_downloader.py script:

git clone <repository-url>
cd mgnify-downloader

Quick Start

1. Download genome IDs and FAA files (default)

python mgnify_downloader.py

This will:

Fetch all genome IDs from MGnify API
Save IDs to mgnify_genome_ids.txt
Download FAA files to mgnify_files/faa/

2. Use existing genome ID list

python mgnify_downloader.py --ids your_genome_list.txt --type faa,fna -o /path/to/output

3. Download multiple file types

python mgnify_downloader.py --ids genome_list.txt --type faa,fna,gff,eggNOG.tsv

Usage

Command Line Options

python mgnify_downloader.py [OPTIONS]

Required Arguments

--ids FILE: Path to file containing genome accession IDs (one per line). If not provided, genome IDs will be fetched from the API.

Optional Arguments

--type TYPES: Comma-separated list of file types to download (default: faa)
-o, --output DIR: Output directory for downloaded files (default: mgnify_files)
--delay SECONDS: Delay between genome downloads in seconds (default: 0.2)
--start-from INDEX: Index to start downloading from for resuming (default: 0)
--max-files NUMBER: Maximum number of genomes to process (useful for testing)
--genome-ids-output FILE: Output file for genome IDs when collecting from API (default: mgnify_genome_ids.txt)
--list-types: List available file types and their subdirectories
--verbose, -v: Enable verbose logging
--help, -h: Show help message

Available File Types

File Type	Directory	Description
`faa`	`faa/`	Predicted CDS (amino acids)
`fna`	`fna/`	Nucleic Acid Sequence (DNA)
`fna.fai`	`fna/`	Nucleic Acid Sequence index
`gff`	`gff/`	Genome Annotation
`dbcan.gff`	`annotations/`	Genome dbCAN Annotation
`eggNOG.tsv`	`annotations/`	eggNOG annotation
`InterProScan.tsv`	`annotations/`	InterProScan annotation
`kegg_pathways.tsv`	`annotations/`	KEGG Pathway Completeness
`rRNAs.fasta`	`rna/`	Genome rRNA Sequence

To see this list in the terminal:

python mgnify_downloader.py --list-types

Examples

Basic Usage

Download only protein sequences (FAA files):

python mgnify_downloader.py --ids genome_list.txt --type faa

Download sequence files (both nucleotide and protein):

python mgnify_downloader.py --ids genome_list.txt --type fna,faa

Download annotation files:

python mgnify_downloader.py --ids genome_list.txt --type eggNOG.tsv,InterProScan.tsv,kegg_pathways.tsv

Advanced Usage

Custom output directory:

python mgnify_downloader.py --ids genome_list.txt --type faa,fna -o /data/mgnify_genomes

Resume interrupted download:

python mgnify_downloader.py --ids genome_list.txt --start-from 1500 --type faa

Test with limited files:

python mgnify_downloader.py --ids genome_list.txt --max-files 10 --type faa

Slower download pace:

python mgnify_downloader.py --ids genome_list.txt --delay 0.5 --type faa

Complete workflow from scratch:

# First, collect all genome IDs and download FAA files
python mgnify_downloader.py --type faa -o /data/mgnify

# Then, download additional file types using the collected IDs
python mgnify_downloader.py --ids mgnify_genome_ids.txt --type fna,gff -o /data/mgnify

Output Structure

The script organizes downloaded files into subdirectories by type:

output_directory/
├── faa/ # Protein sequences
│ ├── MGYG000000001.faa
│ ├── MGYG000000002.faa
│ └── ...
├── fna/ # Nucleotide sequences and indices
│ ├── MGYG000000001.fna
│ ├── MGYG000000001.fna.fai
│ └── ...
├── gff/ # Genome annotations
│ ├── MGYG000000001.gff
│ └── ...
├── annotations/ # Various annotation files
│ ├── MGYG000000001_eggNOG.tsv
│ ├── MGYG000000001_InterProScan.tsv
│ └── ...
├── rna/ # RNA sequences
│ ├── MGYG000000001_rRNAs.fasta
│ └── ...
└── failed_downloads.json # List of failed downloads (if any)

Input Format

The genome IDs file should contain one accession ID per line:

MGYG000000001
MGYG000000002
MGYG000000003
...

Error Handling and Logging

Comprehensive logging: All operations are logged with timestamps
Failure tracking: Failed downloads are saved to failed_downloads.json
Resume capability: Skip already downloaded files automatically
Graceful interruption: Handle Ctrl+C interruptions cleanly
Network resilience: Retry failed requests and continue with next files

Log Levels

INFO (default): Shows progress and summary information
DEBUG (with --verbose): Shows detailed request information

Performance Considerations

Rate limiting: Built-in delays between requests to respect server limits
Chunked downloads: Large files are downloaded in chunks to handle memory efficiently
Concurrent processing: Process multiple file types per genome sequentially
Skip existing files: Automatically skip files that already exist

Recommended Settings

For large datasets (>1000 genomes): --delay 0.2 (default)
For smaller datasets: --delay 0.1
For slow connections: --delay 0.5

Troubleshooting

Common Issues

"File not found" errors: Some genomes may not have all file types available
Network timeouts: Use larger delay values or retry with resume functionality
Disk space: Ensure sufficient disk space (FAA files are typically 1-10MB each)
Memory usage: The script is designed to handle large datasets efficiently

Getting Help

Check available file types:

python mgnify_downloader.py --list-types

Enable verbose logging:

python mgnify_downloader.py --verbose [other options]

Test with small dataset:

python mgnify_downloader.py --ids genome_list.txt --max-files 5 --type faa

API Information

This script uses the EBI Metagenomics API:

Base URL: https://www.ebi.ac.uk/metagenomics/api/v1
Rate limiting: Built-in delays to respect server limits
Authentication: Not required for public data

License

MIT

Contributing

Citation

If you use this tool in your research, please cite the MGnify database:

Richardson, L.J., Allen, B., Baldi, G. et al. MGnify: the microbiome analysis resource in 2023. Nucleic Acids Res (2023). https://doi.org/10.1093/nar/gkac1080

Changelog

Version 0.0.1

Initial release with basic download functionality
Support for multiple file types
Organized directory structure
Resume capability
Comprehensive error handling

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
downloads.json		downloads.json
genomes.json		genomes.json
mgnify_downloader.py		mgnify_downloader.py
mgnify_genome_ids.txt		mgnify_genome_ids.txt

License

inspirewind/MGnify-Downloader

Folders and files

Latest commit

History

Repository files navigation

MGnify Genome Downloader

Features

Installation

Prerequisites

Download

Quick Start

1. Download genome IDs and FAA files (default)

2. Use existing genome ID list

3. Download multiple file types

Usage

Command Line Options

Required Arguments

Optional Arguments

Available File Types

Examples

Basic Usage

Advanced Usage

Output Structure

Input Format

Error Handling and Logging

Log Levels

Performance Considerations

Recommended Settings

Troubleshooting

Common Issues

Getting Help

API Information

License

Contributing

Citation

Changelog

Version 0.0.1

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages