Code repository for the annotation of silver and bronze corpora related to the CoDiet project.
This repository provides the scripts required to generate Bronze and Silver annotated test sets for the CoDiet dataset. The annotation process integrates multiple pipelines, including dictionary-based matching, MetaMap, enzyme annotation, PhenoBERT, MicrobELP, and BERN2.
Please note that the Codabench link will be made public once the manuscript is accepted. If you would like to contribute your model prior to publication, feel free to contact us to obtain access to a private URL.
Create and activate the main Conda environment:
conda create -n CoDiet_machine
conda activate CoDiet_machine
conda install pip
pip install pandas numpy openpyxlgit clone https://github.com/omicsNLP/CoDietCorpus.git
cd CoDietCorpuswget https://zenodo.org/records/17610205/files/CoDiet-Gold-private.zip
unzip ./CoDiet-Gold-private.zippython ./scripts/input_text.pyThis script creates a directory named passages_input inside ./output. It will contain one text file per passage extracted from each PMCID. Each file follows the naming convention PMCID_PASSAGE_NUMBER.txt and includes the raw passage text from the original article.
python ./scripts/dictionary_matching.pyThis step takes the files in passages_input along with the dictionaries stored in the ./data/dictionary directory.
It generates a new directory called dictionary_output inside ./output, containing the dictionary-based annotations for the following nine categories:
- computational
- dataType
- dietMethod
- diseasePhenotype
- foodRelated
- methodology
- modelOrganism
- populationCharacteristic
- sampleType Each output file includes all matches found for these categories.
python ./scripts/priority_dictionary_matching.pyThis step takes the files in passages_input along with the dictionaries stored in the ./data/priority_dictionary directory.
It generates a new directory called priority_dictionary_output inside ./output, containing the dictionary-based annotations for the following eight categories:
- computational
- dietMethod
- diseasePhenotype
- foodRelated
- metabolite
- populationCharacteristic
- proteinEnzyme
- sampleType Each output file includes all matches found for these categories.
python ./scripts/AnnotationEnzymes.pyThis step takes the original BioC files from ./CoDiet-Gold-private and creates a new directory called ./enzyme_annotated inside ./output. The directory contains the same BioC files, now including annotations added to the annotation field.
In this stage, the system identifies and labels Enzyme mentions for the proteinEnzyme category.
MetaMap must be installed and configured properly. If the Metamap instance is not running, start the MetaMap instance from the correct folder:
./bin/skrmedpostctl start
./bin/wsdserverctl startThen run:
git clone https://github.com/biomedicalinformaticsgroup/ParallelPyMetaMap.git
pip install ./ParallelPyMetaMap
python ./scripts/ppmm.pyThis step processes the files in passages_input using MetaMap. MetaMap is run with a restricted vocabulary consisting of the following semantic types: ['food', 'bdsu', 'lbpr', 'inpr', 'resa']. The resulting MetaMap outputs are used to generate annotations for the following categories:
- foodRelated
- sampleType
- dataType
- methodology
All annotations are written in output_ParallelPyMetaMap_text_mo within the ./output directory.
git clone https://github.com/omicsNLP/microbELP.git
pip install ./microbELPUsing the single-core CPU:
python ./scripts/microELP.pyor the multiprocessing implementation:
python ./scripts/parallel_microELP.pyThis step takes the original BioC files from ./CoDiet-Gold-private and creates a new directory called ./microbELP_result inside ./output. The directory contains the same BioC files, now including annotations added to the annotation field.
In this stage, the system identifies and labels microbiome mentions.
Create a separate environment for PhenoBERT:
conda deactivate
conda create -n CoDiet_phenobert python=3.10
conda activate CoDiet_phenobert
conda install pip
pip install gdownSet up PhenoBERT:
git clone https://github.com/EclipseCN/PhenoBERT.git
gdown --folder "https://drive.google.com/drive/folders/1jIqW19JJPzYuyUadxB5Mmfh-pWRiEopH"
mv ./PhenoBERT_data/models/* ./PhenoBERT/phenobert/models/
mv ./PhenoBERT_data/embeddings/* ./PhenoBERT/phenobert/embeddings/
rm -rf ./PhenoBERT_data/
mkdir ./output/phenobert_output
cd PhenoBERT
pip install -r requirements.txt
python setup.py
pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117
pip install stanza==1.6.1 numpy==1.24.3
python -c "import nltk; nltk.download('averaged_perceptron_tagger_eng')"
cd phenobert/utils
python ./annotate.py -i ../../../output/passages_input/ -o ../../../output/phenobert_output/
cd ../../..
conda deactivateThis step takes the files in passages_input and annotates them with phenotype-related entities using PhenoBERT. The resulting annotated files are saved in a directory named phenobert_output inside ./output.
8️⃣ BERN2 Annotation
If not already done, exit the PhenoBERT directory:
cd ../../..- ~70GB of free disk space
- For GPU: ≥63.5GB RAM and ≥5GB GPU memory
- Linux or WSL (Windows Subsystem for Linux)
- Create environment and install dependencies
conda create -n CoDiet_bern2 python=3.7
conda activate CoDiet_bern2
conda install pytorch==1.9.0 cudatoolkit=10.2 -c pytorch
conda install faiss-gpu libfaiss-avx2 -c conda-forge
conda install pip
pip install gdown- Download BERN2
git clone https://github.com/dmis-lab/BERN2.git
cd BERN2
pip install -r requirements.txt
gdown --folder "https://drive.google.com/file/d/147b3OhU4IdQi121ZBUSqO1XKdKoXE5DK"
tar -zxvf resources_v1.1.b.tar.gz
md5sum resources_v1.1.b.tar.gz
# make sure the md5sum is 'c0db4e303d1ccf6bf56b42eda2fe05d0'
rm resources_v1.1.b.tar.gz- Install CRF (required for GNormPlus)
cd resources/GNormPlusJava
tar -zxvf CRF++-0.58.tar.gz
mv CRF++-0.58 CRF
cd CRF
./configure --prefix="$HOME"
make
make install
cd ../../..- Start the BERN2 server
GPU (Linux/WSL)
export CUDA_VISIBLE_DEVICES=0
cd scripts
nohup bash run_bern2.sh &
cd ../..CPU
cd scripts
nohup bash run_bern2_cpu.sh &
cd ../..- Run inference
python ./scripts/bern2.py
bash ./BERN2/scripts/stop_bern2.sh
conda deactivateThis step processes the files in passages_input using the local BERN2 server. All predictions are saved in a directory named bern2_output inside ./output. BERN2 generates annotations for the following four categories:
- proteinEnzyme
- geneSNP
- diseasePhenotype
- modelOrganism
Each output file includes the extracted entities.
conda activate CoDiet_machine
python ./scripts/bronze.pyThis step collects all annotation results generated in ./output and collates them into their corresponding BioC file. It also applies an early BERT-based model developed for metabolite NER (from a forthcoming publication) to identify metabolite mentions. All extracted entities are merged and added to the annotation field of the original BioC files from CoDiet-Gold-private. The fully annotated files, with the thirteen categories, are saved in the ./bronze directory.
python ./scripts/silver.pyThis step takes the annotated BioC files generated in ./bronze and applies rule-based logic to resolve overlapping or conflicting annotations. The script selects the most appropriate annotations according to these rules and saves the refined BioC files in the ./silver directory.
Published literature can be subject to copyright with restrictions on redistribution. Users need to be mindful of the data storage requirements and how the derived products are presented and shared. Many publishers provide guidance on the use of content for redistribution and use in research.