Single Python script for downloading genome data from the MGnify database via the EBI Metagenomics API. This tool allows you to efficiently download various types of genome files including protein sequences, nucleotide sequences, annotations, and other analysis results.
- Automated genome ID collection: Fetch all available genome IDs from the MGnify API
- Multiple file type support: Download various file types including FAA, FNA, GFF, and annotation files
- Organized directory structure: Automatically organize downloaded files by type into subdirectories
- Resume capability: Continue interrupted downloads from a specific point
- Batch processing: Process thousands of genomes efficiently with configurable delays
- Error handling: Robust error handling with detailed logging and failure tracking
- Flexible input: Use existing genome ID lists or fetch from API
- Python 3.6 or higher
- Required Python packages:
pip install requests
Clone this repository or download the mgnify_downloader.py script:
git clone <repository-url>
cd mgnify-downloaderpython mgnify_downloader.pyThis will:
- Fetch all genome IDs from MGnify API
- Save IDs to
mgnify_genome_ids.txt - Download FAA files to
mgnify_files/faa/
python mgnify_downloader.py --ids your_genome_list.txt --type faa,fna -o /path/to/outputpython mgnify_downloader.py --ids genome_list.txt --type faa,fna,gff,eggNOG.tsvpython mgnify_downloader.py [OPTIONS]--ids FILE: Path to file containing genome accession IDs (one per line). If not provided, genome IDs will be fetched from the API.
--type TYPES: Comma-separated list of file types to download (default:faa)-o, --output DIR: Output directory for downloaded files (default:mgnify_files)--delay SECONDS: Delay between genome downloads in seconds (default:0.2)--start-from INDEX: Index to start downloading from for resuming (default:0)--max-files NUMBER: Maximum number of genomes to process (useful for testing)--genome-ids-output FILE: Output file for genome IDs when collecting from API (default:mgnify_genome_ids.txt)--list-types: List available file types and their subdirectories--verbose, -v: Enable verbose logging--help, -h: Show help message
| File Type | Directory | Description |
|---|---|---|
faa |
faa/ |
Predicted CDS (amino acids) |
fna |
fna/ |
Nucleic Acid Sequence (DNA) |
fna.fai |
fna/ |
Nucleic Acid Sequence index |
gff |
gff/ |
Genome Annotation |
dbcan.gff |
annotations/ |
Genome dbCAN Annotation |
eggNOG.tsv |
annotations/ |
eggNOG annotation |
InterProScan.tsv |
annotations/ |
InterProScan annotation |
kegg_pathways.tsv |
annotations/ |
KEGG Pathway Completeness |
rRNAs.fasta |
rna/ |
Genome rRNA Sequence |
To see this list in the terminal:
python mgnify_downloader.py --list-types-
Download only protein sequences (FAA files):
python mgnify_downloader.py --ids genome_list.txt --type faa
-
Download sequence files (both nucleotide and protein):
python mgnify_downloader.py --ids genome_list.txt --type fna,faa
-
Download annotation files:
python mgnify_downloader.py --ids genome_list.txt --type eggNOG.tsv,InterProScan.tsv,kegg_pathways.tsv
-
Custom output directory:
python mgnify_downloader.py --ids genome_list.txt --type faa,fna -o /data/mgnify_genomes
-
Resume interrupted download:
python mgnify_downloader.py --ids genome_list.txt --start-from 1500 --type faa
-
Test with limited files:
python mgnify_downloader.py --ids genome_list.txt --max-files 10 --type faa
-
Slower download pace:
python mgnify_downloader.py --ids genome_list.txt --delay 0.5 --type faa
-
Complete workflow from scratch:
# First, collect all genome IDs and download FAA files python mgnify_downloader.py --type faa -o /data/mgnify # Then, download additional file types using the collected IDs python mgnify_downloader.py --ids mgnify_genome_ids.txt --type fna,gff -o /data/mgnify
The script organizes downloaded files into subdirectories by type:
output_directory/
├── faa/ # Protein sequences
│ ├── MGYG000000001.faa
│ ├── MGYG000000002.faa
│ └── ...
├── fna/ # Nucleotide sequences and indices
│ ├── MGYG000000001.fna
│ ├── MGYG000000001.fna.fai
│ └── ...
├── gff/ # Genome annotations
│ ├── MGYG000000001.gff
│ └── ...
├── annotations/ # Various annotation files
│ ├── MGYG000000001_eggNOG.tsv
│ ├── MGYG000000001_InterProScan.tsv
│ └── ...
├── rna/ # RNA sequences
│ ├── MGYG000000001_rRNAs.fasta
│ └── ...
└── failed_downloads.json # List of failed downloads (if any)The genome IDs file should contain one accession ID per line:
MGYG000000001
MGYG000000002
MGYG000000003
...
- Comprehensive logging: All operations are logged with timestamps
- Failure tracking: Failed downloads are saved to
failed_downloads.json - Resume capability: Skip already downloaded files automatically
- Graceful interruption: Handle Ctrl+C interruptions cleanly
- Network resilience: Retry failed requests and continue with next files
- INFO (default): Shows progress and summary information
- DEBUG (with
--verbose): Shows detailed request information
- Rate limiting: Built-in delays between requests to respect server limits
- Chunked downloads: Large files are downloaded in chunks to handle memory efficiently
- Concurrent processing: Process multiple file types per genome sequentially
- Skip existing files: Automatically skip files that already exist
- For large datasets (>1000 genomes):
--delay 0.2(default) - For smaller datasets:
--delay 0.1 - For slow connections:
--delay 0.5
- "File not found" errors: Some genomes may not have all file types available
- Network timeouts: Use larger delay values or retry with resume functionality
- Disk space: Ensure sufficient disk space (FAA files are typically 1-10MB each)
- Memory usage: The script is designed to handle large datasets efficiently
-
Check available file types:
python mgnify_downloader.py --list-types
-
Enable verbose logging:
python mgnify_downloader.py --verbose [other options]
-
Test with small dataset:
python mgnify_downloader.py --ids genome_list.txt --max-files 5 --type faa
This script uses the EBI Metagenomics API:
- Base URL:
https://www.ebi.ac.uk/metagenomics/api/v1 - Rate limiting: Built-in delays to respect server limits
- Authentication: Not required for public data
MIT
If you use this tool in your research, please cite the MGnify database:
Richardson, L.J., Allen, B., Baldi, G. et al. MGnify: the microbiome analysis resource in 2023. Nucleic Acids Res (2023). https://doi.org/10.1093/nar/gkac1080
- Initial release with basic download functionality
- Support for multiple file types
- Organized directory structure
- Resume capability
- Comprehensive error handling