Skip to content

omicsNLP/CoDietCorpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI:10.1101/2021.01.08.425887 DOI:10.5281/zenodo.17305411 Codabench CoDiet


🍎 CoDietCorpus

Code repository for the annotation of silver and bronze corpora related to the CoDiet project.

This repository provides the scripts required to generate Bronze and Silver annotated test sets for the CoDiet dataset. The annotation process integrates multiple pipelines, including dictionary-based matching, MetaMap, enzyme annotation, PhenoBERT, MicrobELP, and BERN2.

Please note that the Codabench link will be made public once the manuscript is accepted. If you would like to contribute your model prior to publication, feel free to contact us to obtain access to a private URL.


♻️ Environment Setup

Create and activate the main Conda environment:

conda create -n CoDiet_machine
conda activate CoDiet_machine
conda install pip
pip install pandas numpy openpyxl

🗃️ Collect the codebase

git clone https://github.com/omicsNLP/CoDietCorpus.git
cd CoDietCorpus

⬇️ Download the Data

wget https://zenodo.org/records/17610205/files/CoDiet-Gold-private.zip
unzip ./CoDiet-Gold-private.zip

🚀 Run the Annotation Scripts

1️⃣ Input Text Processing

python ./scripts/input_text.py

This script creates a directory named passages_input inside ./output. It will contain one text file per passage extracted from each PMCID. Each file follows the naming convention PMCID_PASSAGE_NUMBER.txt and includes the raw passage text from the original article.

2️⃣ Dictionary Matching

python ./scripts/dictionary_matching.py

This step takes the files in passages_input along with the dictionaries stored in the ./data/dictionary directory. It generates a new directory called dictionary_output inside ./output, containing the dictionary-based annotations for the following nine categories:

  • computational
  • dataType
  • dietMethod
  • diseasePhenotype
  • foodRelated
  • methodology
  • modelOrganism
  • populationCharacteristic
  • sampleType Each output file includes all matches found for these categories.

3️⃣ Priority Dictionary Matching

python ./scripts/priority_dictionary_matching.py

This step takes the files in passages_input along with the dictionaries stored in the ./data/priority_dictionary directory. It generates a new directory called priority_dictionary_output inside ./output, containing the dictionary-based annotations for the following eight categories:

  • computational
  • dietMethod
  • diseasePhenotype
  • foodRelated
  • metabolite
  • populationCharacteristic
  • proteinEnzyme
  • sampleType Each output file includes all matches found for these categories.
python ./scripts/AnnotationEnzymes.py

⚠️ Note: This script is an adaptation of another library eNzymER.

This step takes the original BioC files from ./CoDiet-Gold-private and creates a new directory called ./enzyme_annotated inside ./output. The directory contains the same BioC files, now including annotations added to the annotation field.

In this stage, the system identifies and labels Enzyme mentions for the proteinEnzyme category.

MetaMap must be installed and configured properly. If the Metamap instance is not running, start the MetaMap instance from the correct folder:

./bin/skrmedpostctl start
./bin/wsdserverctl start

Then run:

git clone https://github.com/biomedicalinformaticsgroup/ParallelPyMetaMap.git
pip install ./ParallelPyMetaMap
python ./scripts/ppmm.py

This step processes the files in passages_input using MetaMap. MetaMap is run with a restricted vocabulary consisting of the following semantic types: ['food', 'bdsu', 'lbpr', 'inpr', 'resa']. The resulting MetaMap outputs are used to generate annotations for the following categories:

  • foodRelated
  • sampleType
  • dataType
  • methodology

All annotations are written in output_ParallelPyMetaMap_text_mo within the ./output directory.

⚠️ Warning: Ensure MetaMap config matches the script, or update accordingly.

git clone https://github.com/omicsNLP/microbELP.git
pip install ./microbELP

Using the single-core CPU:

python ./scripts/microELP.py

or the multiprocessing implementation:

python ./scripts/parallel_microELP.py

This step takes the original BioC files from ./CoDiet-Gold-private and creates a new directory called ./microbELP_result inside ./output. The directory contains the same BioC files, now including annotations added to the annotation field.

In this stage, the system identifies and labels microbiome mentions.

Create a separate environment for PhenoBERT:

conda deactivate
conda create -n CoDiet_phenobert python=3.10
conda activate CoDiet_phenobert
conda install pip
pip install gdown

Set up PhenoBERT:

git clone https://github.com/EclipseCN/PhenoBERT.git
gdown --folder "https://drive.google.com/drive/folders/1jIqW19JJPzYuyUadxB5Mmfh-pWRiEopH"

mv ./PhenoBERT_data/models/* ./PhenoBERT/phenobert/models/
mv ./PhenoBERT_data/embeddings/* ./PhenoBERT/phenobert/embeddings/
rm -rf ./PhenoBERT_data/
mkdir ./output/phenobert_output

cd PhenoBERT
pip install -r requirements.txt
python setup.py
pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117
pip install stanza==1.6.1 numpy==1.24.3
python -c "import nltk; nltk.download('averaged_perceptron_tagger_eng')"

cd phenobert/utils
python ./annotate.py -i ../../../output/passages_input/ -o ../../../output/phenobert_output/
cd ../../..
conda deactivate

This step takes the files in passages_input and annotates them with phenotype-related entities using PhenoBERT. The resulting annotated files are saved in a directory named phenobert_output inside ./output.

If not already done, exit the PhenoBERT directory:

cd ../../..

⚠️ Note: This set of instructions is an adaptation of the official README from the original library BERN2 README.

⚠️ Prerequisites: This installation requires:

  • ~70GB of free disk space
  • For GPU: ≥63.5GB RAM and ≥5GB GPU memory
  • Linux or WSL (Windows Subsystem for Linux)
  1. Create environment and install dependencies
conda create -n CoDiet_bern2 python=3.7
conda activate CoDiet_bern2
conda install pytorch==1.9.0 cudatoolkit=10.2 -c pytorch
conda install faiss-gpu libfaiss-avx2 -c conda-forge
conda install pip
pip install gdown
  1. Download BERN2
git clone https://github.com/dmis-lab/BERN2.git
cd BERN2
pip install -r requirements.txt

gdown --folder "https://drive.google.com/file/d/147b3OhU4IdQi121ZBUSqO1XKdKoXE5DK"
tar -zxvf resources_v1.1.b.tar.gz
md5sum resources_v1.1.b.tar.gz
# make sure the md5sum is 'c0db4e303d1ccf6bf56b42eda2fe05d0'
rm resources_v1.1.b.tar.gz
  1. Install CRF (required for GNormPlus)
cd resources/GNormPlusJava
tar -zxvf CRF++-0.58.tar.gz
mv CRF++-0.58 CRF
cd CRF
./configure --prefix="$HOME"
make
make install
cd ../../..
  1. Start the BERN2 server

GPU (Linux/WSL)

export CUDA_VISIBLE_DEVICES=0
cd scripts
nohup bash run_bern2.sh &
cd ../..

CPU

cd scripts
nohup bash run_bern2_cpu.sh &
cd ../..
  1. Run inference
python ./scripts/bern2.py
bash ./BERN2/scripts/stop_bern2.sh 
conda deactivate

This step processes the files in passages_input using the local BERN2 server. All predictions are saved in a directory named bern2_output inside ./output. BERN2 generates annotations for the following four categories:

  • proteinEnzyme
  • geneSNP
  • diseasePhenotype
  • modelOrganism

Each output file includes the extracted entities.

9️⃣ Combine Predictions & Infer Metabolites to generate the Bronze dataset

conda activate CoDiet_machine
python ./scripts/bronze.py

This step collects all annotation results generated in ./output and collates them into their corresponding BioC file. It also applies an early BERT-based model developed for metabolite NER (from a forthcoming publication) to identify metabolite mentions. All extracted entities are merged and added to the annotation field of the original BioC files from CoDiet-Gold-private. The fully annotated files, with the thirteen categories, are saved in the ./bronze directory.

🔟 Bronze to Silver Conversion

python ./scripts/silver.py

This step takes the annotated BioC files generated in ./bronze and applies rule-based logic to resolve overlapping or conflicting annotations. The script selects the most appropriate annotations according to these rules and saves the refined BioC files in the ./silver directory.


⚠️ Important - Please Read!

Published literature can be subject to copyright with restrictions on redistribution. Users need to be mindful of the data storage requirements and how the derived products are presented and shared. Many publishers provide guidance on the use of content for redistribution and use in research.


🌍 Used the following repositories


👥 Code Contributors


👉 Antoine
  
👉 Joram


About

Code repository for the creation and annotation of corpora in the CoDiet project

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages