🍎 CoDietCorpus

Code repository for the annotation of silver and bronze corpora related to the CoDiet project.

This repository provides the scripts required to generate Bronze and Silver annotated test sets for the CoDiet dataset. The annotation process integrates multiple pipelines, including dictionary-based matching, MetaMap, enzyme annotation, PhenoBERT, MicrobELP, and BERN2.

Please note that the Codabench link will be made public once the manuscript is accepted. If you would like to contribute your model prior to publication, feel free to contact us to obtain access to a private URL.

♻️ Environment Setup

Create and activate the main Conda environment:

conda create -n CoDiet_machine
conda activate CoDiet_machine
conda install pip
pip install pandas numpy openpyxl

🗃️ Collect the codebase

git clone https://github.com/omicsNLP/CoDietCorpus.git
cd CoDietCorpus

⬇️ Download the Data

wget https://zenodo.org/records/17610205/files/CoDiet-Gold-private.zip
unzip ./CoDiet-Gold-private.zip

🚀 Run the Annotation Scripts

1️⃣ Input Text Processing

python ./scripts/input_text.py

This script creates a directory named passages_input inside ./output. It will contain one text file per passage extracted from each PMCID. Each file follows the naming convention PMCID_PASSAGE_NUMBER.txt and includes the raw passage text from the original article.

2️⃣ Dictionary Matching

python ./scripts/dictionary_matching.py

This step takes the files in passages_input along with the dictionaries stored in the ./data/dictionary directory. It generates a new directory called dictionary_output inside ./output, containing the dictionary-based annotations for the following nine categories:

computational
dataType
dietMethod
diseasePhenotype
foodRelated
methodology
modelOrganism
populationCharacteristic
sampleType Each output file includes all matches found for these categories.

3️⃣ Priority Dictionary Matching

python ./scripts/priority_dictionary_matching.py

This step takes the files in passages_input along with the dictionaries stored in the ./data/priority_dictionary directory. It generates a new directory called priority_dictionary_output inside ./output, containing the dictionary-based annotations for the following eight categories:

computational
dietMethod
diseasePhenotype
foodRelated
metabolite
populationCharacteristic
proteinEnzyme
sampleType Each output file includes all matches found for these categories.

4️⃣ Enzyme Annotation

python ./scripts/AnnotationEnzymes.py

⚠️ Note: This script is an adaptation of another library eNzymER.

This step takes the original BioC files from ./CoDiet-Gold-private and creates a new directory called ./enzyme_annotated inside ./output. The directory contains the same BioC files, now including annotations added to the annotation field.

In this stage, the system identifies and labels Enzyme mentions for the proteinEnzyme category.

5️⃣ MetaMap-based Annotation

MetaMap must be installed and configured properly. If the Metamap instance is not running, start the MetaMap instance from the correct folder:

./bin/skrmedpostctl start
./bin/wsdserverctl start

Then run:

git clone https://github.com/biomedicalinformaticsgroup/ParallelPyMetaMap.git
pip install ./ParallelPyMetaMap
python ./scripts/ppmm.py

This step processes the files in passages_input using MetaMap. MetaMap is run with a restricted vocabulary consisting of the following semantic types: ['food', 'bdsu', 'lbpr', 'inpr', 'resa']. The resulting MetaMap outputs are used to generate annotations for the following categories:

foodRelated
sampleType
dataType
methodology

All annotations are written in output_ParallelPyMetaMap_text_mo within the ./output directory.

⚠️ Warning: Ensure MetaMap config matches the script, or update accordingly.

6️⃣ MicrobELP Annotation

git clone https://github.com/omicsNLP/microbELP.git
pip install ./microbELP

Using the single-core CPU:

python ./scripts/microELP.py

or the multiprocessing implementation:

python ./scripts/parallel_microELP.py

This step takes the original BioC files from ./CoDiet-Gold-private and creates a new directory called ./microbELP_result inside ./output. The directory contains the same BioC files, now including annotations added to the annotation field.

In this stage, the system identifies and labels microbiome mentions.

7️⃣ PhenoBERT Annotation

Create a separate environment for PhenoBERT:

conda deactivate
conda create -n CoDiet_phenobert python=3.10
conda activate CoDiet_phenobert
conda install pip
pip install gdown

Set up PhenoBERT:

git clone https://github.com/EclipseCN/PhenoBERT.git
gdown --folder "https://drive.google.com/drive/folders/1jIqW19JJPzYuyUadxB5Mmfh-pWRiEopH"

mv ./PhenoBERT_data/models/* ./PhenoBERT/phenobert/models/
mv ./PhenoBERT_data/embeddings/* ./PhenoBERT/phenobert/embeddings/
rm -rf ./PhenoBERT_data/
mkdir ./output/phenobert_output

cd PhenoBERT
pip install -r requirements.txt
python setup.py
pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117
pip install stanza==1.6.1 numpy==1.24.3
python -c "import nltk; nltk.download('averaged_perceptron_tagger_eng')"

cd phenobert/utils
python ./annotate.py -i ../../../output/passages_input/ -o ../../../output/phenobert_output/
cd ../../..
conda deactivate

This step takes the files in passages_input and annotates them with phenotype-related entities using PhenoBERT. The resulting annotated files are saved in a directory named phenobert_output inside ./output.

8️⃣ BERN2 Annotation

If not already done, exit the PhenoBERT directory:

cd ../../..

⚠️ Note: This set of instructions is an adaptation of the official README from the original library BERN2 README.

⚠️ Prerequisites: This installation requires:

~70GB of free disk space
For GPU: ≥63.5GB RAM and ≥5GB GPU memory
Linux or WSL (Windows Subsystem for Linux)

Create environment and install dependencies

conda create -n CoDiet_bern2 python=3.7
conda activate CoDiet_bern2
conda install pytorch==1.9.0 cudatoolkit=10.2 -c pytorch
conda install faiss-gpu libfaiss-avx2 -c conda-forge
conda install pip
pip install gdown

Download BERN2

git clone https://github.com/dmis-lab/BERN2.git
cd BERN2
pip install -r requirements.txt

gdown --folder "https://drive.google.com/file/d/147b3OhU4IdQi121ZBUSqO1XKdKoXE5DK"
tar -zxvf resources_v1.1.b.tar.gz
md5sum resources_v1.1.b.tar.gz
# make sure the md5sum is 'c0db4e303d1ccf6bf56b42eda2fe05d0'
rm resources_v1.1.b.tar.gz

Install CRF (required for GNormPlus)

cd resources/GNormPlusJava
tar -zxvf CRF++-0.58.tar.gz
mv CRF++-0.58 CRF
cd CRF
./configure --prefix="$HOME"
make
make install
cd ../../..

Start the BERN2 server

GPU (Linux/WSL)

export CUDA_VISIBLE_DEVICES=0
cd scripts
nohup bash run_bern2.sh &
cd ../..

CPU

cd scripts
nohup bash run_bern2_cpu.sh &
cd ../..

Run inference

python ./scripts/bern2.py
bash ./BERN2/scripts/stop_bern2.sh 
conda deactivate

This step processes the files in passages_input using the local BERN2 server. All predictions are saved in a directory named bern2_output inside ./output. BERN2 generates annotations for the following four categories:

proteinEnzyme
geneSNP
diseasePhenotype
modelOrganism

Each output file includes the extracted entities.

9️⃣ Combine Predictions & Infer Metabolites to generate the Bronze dataset

conda activate CoDiet_machine
python ./scripts/bronze.py

This step collects all annotation results generated in ./output and collates them into their corresponding BioC file. It also applies an early BERT-based model developed for metabolite NER (from a forthcoming publication) to identify metabolite mentions. All extracted entities are merged and added to the annotation field of the original BioC files from CoDiet-Gold-private. The fully annotated files, with the thirteen categories, are saved in the ./bronze directory.

🔟 Bronze to Silver Conversion

python ./scripts/silver.py

This step takes the annotated BioC files generated in ./bronze and applies rule-based logic to resolve overlapping or conflicting annotations. The script selects the most appropriate annotations according to these rules and saves the refined BioC files in the ./silver directory.

⚠️ Important - Please Read!

Published literature can be subject to copyright with restrictions on redistribution. Users need to be mindful of the data storage requirements and how the derived products are presented and shared. Many publishers provide guidance on the use of content for redistribution and use in research.

🌍 Used the following repositories

GitHub repository	Paper
cadmus	n/a
Auto-CORPus	Paper
BERN2	Paper
PhenoBERT	Paper
TABoLiSTM	Paper
eNzymER	Paper
microbELP	Paper

👥 Code Contributors

👉 Antoine 👉 Joram

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
EnzymeLists		EnzymeLists
data		data
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🍎 CoDietCorpus

♻️ Environment Setup

🗃️ Collect the codebase

⬇️ Download the Data

🚀 Run the Annotation Scripts

1️⃣ Input Text Processing

2️⃣ Dictionary Matching

3️⃣ Priority Dictionary Matching

4️⃣ Enzyme Annotation

5️⃣ MetaMap-based Annotation

6️⃣ MicrobELP Annotation

7️⃣ PhenoBERT Annotation

8️⃣ BERN2 Annotation

9️⃣ Combine Predictions & Infer Metabolites to generate the Bronze dataset

🔟 Bronze to Silver Conversion

⚠️ Important - Please Read!

🌍 Used the following repositories

👥 Code Contributors

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

omicsNLP/CoDietCorpus

Folders and files

Latest commit

History

Repository files navigation

🍎 CoDietCorpus

♻️ Environment Setup

🗃️ Collect the codebase

⬇️ Download the Data

🚀 Run the Annotation Scripts

1️⃣ Input Text Processing

2️⃣ Dictionary Matching

3️⃣ Priority Dictionary Matching

4️⃣ Enzyme Annotation

5️⃣ MetaMap-based Annotation

6️⃣ MicrobELP Annotation

7️⃣ PhenoBERT Annotation

8️⃣ BERN2 Annotation

9️⃣ Combine Predictions & Infer Metabolites to generate the Bronze dataset

🔟 Bronze to Silver Conversion

⚠️ Important - Please Read!

🌍 Used the following repositories

👥 Code Contributors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages