biom-benchmark

Setting up

Model Weights

Model configurations and weights should be downloaded and placed in the model/pretrained/[Model Name] directory.

RNAFM: The pretrained model can be downloaded from the following link: https://proj.cse.cuhk.edu.hk/rnafm/api/download?filename=RNA-FM_pretrained.pth

RNABERT and RNAMSM: Model weights can be downloaded from Link and Link. This link is referenced from RNAErnie.

RNAErnie: The original model used in the RNAErnie publication is based on the PaddlePaddle framework, which is incompatible with other models. We used the PyTorch version of the model released by the authors. (https://huggingface.co/LLM-EDA/RNAErnie/tree/main)

SpliceBERT: The model weights are available on Zenodo.

DNABERT: DNABERT provides a series of models using different k-mer settings. We use the most popular version, DNA_bert_3.

DNABERT2: The models are available at this link.

Nucleotide Transformer: The authors provided a series of models. We use the best version reported in the article, nucleotide-transformer-v2-500m-multi-species.

Hardware Requirements

All analyses were conducted on a cluster node equipped with 32 CPU cores and 4 Nvidia Tesla A100 40G GPUs. At least one GPU is necessary for executing a single task.

Software environment

A Linux system is required.

We recommend using conda and pip to manage the software environments:

conda env create -f environment_1019.yml

Run pipelines

The datasets required for the analyses can be sourced from the Data Availability sections of our previous publications. For convenience, we have also uploaded essential data files to Google Drive, which you can download and place in the ./dataset folder. All data files should be placed in a subfolder of ./dataset folder. e.g. ./dataset/m6a_data/

nRC prediction

An example script for training and testing the models is provided at scripts/cls/seq_cls_nRC_1e-4.sh

m6A prediction

An example script for training and testing the models is provided at scripts/m6A/m6a_miCLIP_101_1e-4.sh

Splicing prediction

Due to its size (over 100GB), the splicing dataset cannot be directly uploaded. Please refer to the Methods section for instructions on how to generate the dataset. We do provide intermediate files through Google Drive to facilitate this process.

Run scripts/makedata_splice.sh to create datasets.
An example of training and testing all models is available at scripts/splice/splice_3.sh

The entire process, depending on the GPU, may require several days to complete.

MRL prediction

An example of training and testing all models is available at scripts/mrl/mrl_1e-3.sh

Gather results

We extract the test results from the textual output of the program and compile them into a table.

For clear output information, we recommend separating standard output from error messages when executing the scripts, for example:

bash scripts/run_splice_train_test_53.sh > output.txt 2>error_output.txt

Our test was performed on a slurm cluster, thus the outputs could be split easily. An example is available at scripts/cls/HPC_run_seq_cls_1.sh

Then, to convert the output into a table, run the parse_output.py script located in the analyzer folder:

cd analyzer
python parse_output.py -i analyzer/nRC_1_4.out

The generated table will serve as the source data for subsequent plotting. An example of this process can be found in the Jupyter Notebook located at 'analyzer/analyze.ipynb'.

Code Description

The source code is structured into several folders:

dataset: Contains scripts and utilities for creating and loading datasets.
evaluator: Houses the functionality for loading models, conducting training, and performing evaluations.
logs: Designated output directory for all log files.
model: Includes the definitions and implementations of the various models used.
scripts: Provides reference scripts to guide the execution of the project.

The main entry points of the program are the following Python scripts: seq_cls.py, m6a_cls.py, splice_cls.py and mrl_pred.py. These scripts can be customized to accommodate specific testing requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
analyzer		analyzer
dataset		dataset
evaluator		evaluator
model		model
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
environment_1019.yaml		environment_1019.yaml
evaluator.py		evaluator.py
m6a_cls.py		m6a_cls.py
mrl_pred.py		mrl_pred.py
pretrain.py		pretrain.py
seq_cls.py		seq_cls.py
splice_cls.py		splice_cls.py
task_emb.py		task_emb.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

biom-benchmark

Setting up

Model Weights

Hardware Requirements

Software environment

Run pipelines

nRC prediction

m6A prediction

Splicing prediction

MRL prediction

Gather results

Code Description

About

Releases

Packages

Languages

License

ShenLab-Genomics/biombenchmark

Folders and files

Latest commit

History

Repository files navigation

biom-benchmark

Setting up

Model Weights

Hardware Requirements

Software environment

Run pipelines

nRC prediction

m6A prediction

Splicing prediction

MRL prediction

Gather results

Code Description

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages