MicroGenomer: A Foundation Model for Transferable Microbial Genome Representations Enabling Multi-scale Genomic Understanding and Ecophysiological Trait Prediction
MicroGenomer is a foundation model for transferable microbial genome representations, enabling multi-scale genomic comprehension and ecophysiological trait prediction. It adopts a hierarchical training strategy that integrates large-scale genomic sequence pre-training (234.5 billion base pairs), domain-specific mid-training with the GTDB-curated marker gene set, and task-specific post-training. Surpassing existing gene-scale encoders, MicroGenomer generates robust embeddings at the whole-genome scale.
Figure 1. Overview of MicroGenomer.Download the GitHub repository and extract the files to a designated folder.
The input data required for MicroGenomer model inference and genomic representation extraction is formatted as a comma-separated values (CSV) file, which exclusively contains gene sequences with annotated Coding Sequences (CDS) — the core genomic regions that encode functional proteins in microbial genomes. The input data format is as follows:
genome_id: Unique identifier for the current genomeaa_seq: DNA sequence corresponding to CDS in the current genomeunique_id: Unique ID for identifying the DNA sequence
Example:
unique_id,aa_seq,genome_id
RS_GCF_013393365.1@TIGR00168,TTGCCGTCCGTAG...GTAGTAAATAA,RS_GCF_013393365.1
RS_GCF_013393365.1@TIGR00382,TTGGCAAAAGATA...TGAAATCGTAA,RS_GCF_013393365.1
...# Create a Conda Python environment
conda create -n MicroGenomer python=3.10
conda activate MicroGenomer
git clone https://github.com/BGIResearch/MicroGenomer.git
cd MicroGenomer
pip install -r requirements.txt- Download the weights for MicroGenomer.
- Place the
MicroGenomer-470Manddownstream_tasksfolders in theweights/directory.
Six downstream tasks available: maximum growth rate, oxygen tolerance, salinity tolerance, optimal pH, optimal temperature and probiotic prediction. It also supports extracting embeddings.
bash run.sh \
--input_path '/path/to/input/' \
--output_dir '/path/to/output/' \
[--task 'downstream_task'] \ # Options: test_growth, test_oxygen, test_salinity, test_pH, test_temperature, test_probiotic, extract_embed or none
[--level 'level_of_probiotic'] \ # For test_probiotic task only. Options: family, genus.--input_path: Path to the input file/folder.--output_dir: Path for saving output files.--task: Downstream task option.--level: Level of probiotic prediction. Default is family.
The input path can be a single CSV file or a folder. If it is a folder, all CSV files in that folder will be processed in batches. If the task parameter is extract_embed or is empty, the model will only extract embedding and will not perform any specific downstream tasks.
- First, pull the Docker image:
docker pull sunhaotong0605/microgenomer:1.0- Run Inference for Different Downstream Tasks:
docker run --rm --gpus all \
-v /path/to/input/:/data/input \
-v /path/to/output/:/data/output \
-e INPUT_PATH="/data/input" \
-e OUTPUT_PATH="/data/output" \
sunhaotong0605/microgenomer:1.0 \
sh -c 'bash start.sh $INPUT_PATH $OUTPUT_PATH test_growth'Please replace /path/to/input/, /path/to/output/, and the downstream task option(e.g., test_growth) with your actual paths and desired task.
This project is licensed under the MIT License. See LICENSE for more details.
Kang Q, Guo Y, Hu B, et al. MicroGenomer: A Foundation Model for Transferable Microbial Genome Representations Enabling Multi-scale Genomic Understanding and Ecophysiological Trait Prediction. bioRxiv. doi: https://doi.org/10.64898/2025.12.28.696777.
