Skip to content

inspirewind/MGnify-Downloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MGnify Genome Downloader

Single Python script for downloading genome data from the MGnify database via the EBI Metagenomics API. This tool allows you to efficiently download various types of genome files including protein sequences, nucleotide sequences, annotations, and other analysis results.

Features

  • Automated genome ID collection: Fetch all available genome IDs from the MGnify API
  • Multiple file type support: Download various file types including FAA, FNA, GFF, and annotation files
  • Organized directory structure: Automatically organize downloaded files by type into subdirectories
  • Resume capability: Continue interrupted downloads from a specific point
  • Batch processing: Process thousands of genomes efficiently with configurable delays
  • Error handling: Robust error handling with detailed logging and failure tracking
  • Flexible input: Use existing genome ID lists or fetch from API

Installation

Prerequisites

  • Python 3.6 or higher
  • Required Python packages:
    pip install requests

Download

Clone this repository or download the mgnify_downloader.py script:

git clone <repository-url>
cd mgnify-downloader

Quick Start

1. Download genome IDs and FAA files (default)

python mgnify_downloader.py

This will:

  • Fetch all genome IDs from MGnify API
  • Save IDs to mgnify_genome_ids.txt
  • Download FAA files to mgnify_files/faa/

2. Use existing genome ID list

python mgnify_downloader.py --ids your_genome_list.txt --type faa,fna -o /path/to/output

3. Download multiple file types

python mgnify_downloader.py --ids genome_list.txt --type faa,fna,gff,eggNOG.tsv

Usage

Command Line Options

python mgnify_downloader.py [OPTIONS]

Required Arguments

  • --ids FILE: Path to file containing genome accession IDs (one per line). If not provided, genome IDs will be fetched from the API.

Optional Arguments

  • --type TYPES: Comma-separated list of file types to download (default: faa)
  • -o, --output DIR: Output directory for downloaded files (default: mgnify_files)
  • --delay SECONDS: Delay between genome downloads in seconds (default: 0.2)
  • --start-from INDEX: Index to start downloading from for resuming (default: 0)
  • --max-files NUMBER: Maximum number of genomes to process (useful for testing)
  • --genome-ids-output FILE: Output file for genome IDs when collecting from API (default: mgnify_genome_ids.txt)
  • --list-types: List available file types and their subdirectories
  • --verbose, -v: Enable verbose logging
  • --help, -h: Show help message

Available File Types

File Type Directory Description
faa faa/ Predicted CDS (amino acids)
fna fna/ Nucleic Acid Sequence (DNA)
fna.fai fna/ Nucleic Acid Sequence index
gff gff/ Genome Annotation
dbcan.gff annotations/ Genome dbCAN Annotation
eggNOG.tsv annotations/ eggNOG annotation
InterProScan.tsv annotations/ InterProScan annotation
kegg_pathways.tsv annotations/ KEGG Pathway Completeness
rRNAs.fasta rna/ Genome rRNA Sequence

To see this list in the terminal:

python mgnify_downloader.py --list-types

Examples

Basic Usage

  1. Download only protein sequences (FAA files):

    python mgnify_downloader.py --ids genome_list.txt --type faa
  2. Download sequence files (both nucleotide and protein):

    python mgnify_downloader.py --ids genome_list.txt --type fna,faa
  3. Download annotation files:

    python mgnify_downloader.py --ids genome_list.txt --type eggNOG.tsv,InterProScan.tsv,kegg_pathways.tsv

Advanced Usage

  1. Custom output directory:

    python mgnify_downloader.py --ids genome_list.txt --type faa,fna -o /data/mgnify_genomes
  2. Resume interrupted download:

    python mgnify_downloader.py --ids genome_list.txt --start-from 1500 --type faa
  3. Test with limited files:

    python mgnify_downloader.py --ids genome_list.txt --max-files 10 --type faa
  4. Slower download pace:

    python mgnify_downloader.py --ids genome_list.txt --delay 0.5 --type faa
  5. Complete workflow from scratch:

    # First, collect all genome IDs and download FAA files
    python mgnify_downloader.py --type faa -o /data/mgnify
    
    # Then, download additional file types using the collected IDs
    python mgnify_downloader.py --ids mgnify_genome_ids.txt --type fna,gff -o /data/mgnify

Output Structure

The script organizes downloaded files into subdirectories by type:

output_directory/
├── faa/ # Protein sequences
│ ├── MGYG000000001.faa
│ ├── MGYG000000002.faa
│ └── ...
├── fna/ # Nucleotide sequences and indices
│ ├── MGYG000000001.fna
│ ├── MGYG000000001.fna.fai
│ └── ...
├── gff/ # Genome annotations
│ ├── MGYG000000001.gff
│ └── ...
├── annotations/ # Various annotation files
│ ├── MGYG000000001_eggNOG.tsv
│ ├── MGYG000000001_InterProScan.tsv
│ └── ...
├── rna/ # RNA sequences
│ ├── MGYG000000001_rRNAs.fasta
│ └── ...
└── failed_downloads.json # List of failed downloads (if any)

Input Format

The genome IDs file should contain one accession ID per line:

MGYG000000001
MGYG000000002
MGYG000000003
...

Error Handling and Logging

  • Comprehensive logging: All operations are logged with timestamps
  • Failure tracking: Failed downloads are saved to failed_downloads.json
  • Resume capability: Skip already downloaded files automatically
  • Graceful interruption: Handle Ctrl+C interruptions cleanly
  • Network resilience: Retry failed requests and continue with next files

Log Levels

  • INFO (default): Shows progress and summary information
  • DEBUG (with --verbose): Shows detailed request information

Performance Considerations

  • Rate limiting: Built-in delays between requests to respect server limits
  • Chunked downloads: Large files are downloaded in chunks to handle memory efficiently
  • Concurrent processing: Process multiple file types per genome sequentially
  • Skip existing files: Automatically skip files that already exist

Recommended Settings

  • For large datasets (>1000 genomes): --delay 0.2 (default)
  • For smaller datasets: --delay 0.1
  • For slow connections: --delay 0.5

Troubleshooting

Common Issues

  1. "File not found" errors: Some genomes may not have all file types available
  2. Network timeouts: Use larger delay values or retry with resume functionality
  3. Disk space: Ensure sufficient disk space (FAA files are typically 1-10MB each)
  4. Memory usage: The script is designed to handle large datasets efficiently

Getting Help

  1. Check available file types:

    python mgnify_downloader.py --list-types
  2. Enable verbose logging:

    python mgnify_downloader.py --verbose [other options]
  3. Test with small dataset:

    python mgnify_downloader.py --ids genome_list.txt --max-files 5 --type faa

API Information

This script uses the EBI Metagenomics API:

  • Base URL: https://www.ebi.ac.uk/metagenomics/api/v1
  • Rate limiting: Built-in delays to respect server limits
  • Authentication: Not required for public data

License

MIT

Contributing

Citation

If you use this tool in your research, please cite the MGnify database:

Richardson, L.J., Allen, B., Baldi, G. et al. MGnify: the microbiome analysis resource in 2023. Nucleic Acids Res (2023). https://doi.org/10.1093/nar/gkac1080

Changelog

Version 0.0.1

  • Initial release with basic download functionality
  • Support for multiple file types
  • Organized directory structure
  • Resume capability
  • Comprehensive error handling

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages