This repository provides a structured, four-step pipeline to collect, filter, process, and annotate scientific literature that applies deep learning and large language models (LLMs) to problems in virology and epidemiology. It integrates data from trusted sources like PubMed, bioRxiv, and medRxiv, with a strong focus on:
- Collecting metadata from scientific databases
- Semantic filtering of results to ensure relevance
- Extracting structured information using LLMs
- Human validation and correction of the extracted data
This pipeline enables high-quality datasets that support downstream tasks such as biomedical NLP research, systematic review automation, and AI benchmarking.
The project is organized into the following modular steps:
Each stage of the pipeline is organized into a dedicated folder:
Folder Name | Description |
---|---|
step_01_metadata_collection/ |
Scripts and outputs related to collecting metadata from PubMed, bioRxiv, and medRxiv. |
step_02_semantic_filtering/ |
Python scripts to semantically filter abstracts using sentence embeddings. |
step_03_text_extraction_llm/ |
Code to extract and structure full-text data from PDFs or XMLs using OCR or parsing + LLaMA. |
step_04_human_annotated_data/ |
Ground-truth corrections and final structured data annotated by humans. |
Folder: step_01_metadata_collection/
This step retrieves metadata and abstracts from PubMed, bioRxiv, and medRxiv using different tools:
- PubMed: Data is collected manually using the PubMed web interface.
- bioRxiv and medRxiv: Metadata is retrieved using the R package
medrxivr
.
All three sources use the same keyword queries and topic filters to ensure consistent data collection across repositories.
📘 Metadata Details
PubMed Metadata Fields:
PMID: PubMed ID, a unique identifier for each publication
Title: Title of the publication
Authors: List of authors in the format Last Name, First Initials
Citation: Citation details, including volume, issue, and pages
First Author: The first listed author
Journal/Book: Name of the journal/book
Publication Year: Year of publication
Create Date: Date created in PubMed
PMCID: PubMed Central ID
NIHMS ID: NIH Manuscript Submission ID
DOI: Digital Object Identifier
Abstract: Abstract text
bioRxiv / medRxiv Metadata Fields:
Authors: List of authors
Year of publication: Year when the article was posted
Title of article: Title of the article
category: Scientific category (e.g., Immunology, Epidemiology)
date: Posting date on the preprint server
Abstract: Article abstract
doi: Digital Object Identifier for the article
Metadata from PubMed is collected via manual queries using the official PubMed web interface.
- A full record of query construction and methodology is documented here: Query Design Document
Before running the scripts, make sure R and RStudio are installed:
- Install R: https://cran.r-project.org/
- Install RStudio: https://posit.co/download/rstudio-desktop/
Install required R packages:
install.packages("medrxivr")
install.packages("dplyr")
install.packages("readxl")
install.packages("writexl")
This script fetches metadata from bioRxiv or medRxiv using keyword lists and filters results by relevant scientific categories.
How to run:
Rscript medrxiv_metadata_fetcher.R
Or provide full path:
Rscript "C:/Users/YourName/Desktop/project_folder/medrxiv_metadata_fetcher.R"
What it does:
- Downloads metadata using
mx_api_content()
with user-defined date ranges and source - Saves raw data as
.rds
files - Executes topic-specific keyword searches via
query_list
- Filters by categories (e.g., Epidemiology, Immunology, Bioinformatics)
- Extracts fields like title, authors, category, date, abstract, DOI
- Adds derived fields like publication year and publication name
- Outputs
.xlsx
files for each topic
PDF Download: The script attempts to download full-text PDFs using mx_download() from the medrxivr package. If mx_download() fails to retrieve some files (e.g., due to server restrictions or silent failures), the script automatically falls back to a custom curl-based downloader. This backup method retries only the missing files and uses appropriate headers (including a User-Agent and Referer) to ensure compatibility with servers like medRxiv and bioRxiv.
Example output files:
virology-llm.xlsx
virology-transformer.xlsx
virology-deep-learning.xlsx
This script consolidates topic-specific outputs and eliminates duplicates.
How to run:
folder_path <- "C:/Users/yourname/Desktop/bio_lib"
- Combines
.xlsx
files into one dataset - Identifies and logs duplicate DOIs to
duplicates_by_doi.xlsx
- Outputs cleaned dataset to
aggregated_deduplicated.xlsx
Final Output:
- A single, deduplicated file containing metadata from all topic areas
Folder: step_02_semantic_filtering/
This step applies embedding-based semantic filtering to refine the initial search results. It uses pretrained sentence transformers to evaluate the relevance of abstracts to the target domain.
Goals:
- Remove papers with irrelevant or ambiguous keyword matches
- Prioritize highly relevant documents for LLM processing
Typical script(s):
semantic_filtering.py
Dependencies:
- Python ≥ 3.7
sentence-transformers
pandas
numpy
scikit-learn
Installation:
pip install sentence-transformers pandas numpy scikit-learn
Suggested model:
all-MiniLM-L6-v2
fromsentence-transformers
How to run:
python semantic_filtering.py
Folder: step_03_text_extraction_llm/
This step uses a local large language model (LLaMA 3.2B via Ollama) to extract structured fields from full paper texts. The input can be either PDF (processed via OCR) or XML (parsed as plain text).
Goals:
- Process full text from PDFs or XMLs
- Strip reference sections
- Run LLM extraction of predefined fields from full text
Dependencies:
- Python ≥ 3.8
pdf2image
pytesseract
ollama
pandas
re
,json
,logging
,pathlib
Installation:
pip install pdf2image pytesseract ollama pandas
System Requirements:
- Poppler installed and added to PATH
- Tesseract OCR installed and accessible
- Ollama installed and running LLaMA 3.2B locally
How to run:
python Textextraction_performancemetrics_medrxiv.py
python Textextraction_additionalfields_medrxiv.py
Output:
Extracted_LLaMA_Output.csv
: Structured file with extracted fields from papers
Folder: step_04_human_annotated_data/
This step includes manually reviewed or corrected outputs from the LLM extraction phase. It covers detailed validation and correction of the following structured fields:
research_aim
: The primary goal or aim of the researchresearch_problem
: The specific research problem addressedai_objective
: What AI is being used for in the researchai_methodology
: A brief description of the AI-based approach usedai_method_details
: Details on techniques, architectures, or modelsai_method_type
: The AI method(s) used (as a list if multiple)type_of_underlying_data
: The raw input data used for experimentsdataset_name
: Name(s) of the dataset(s) useddisease_name
: Viral or infectious disease studied (list if multiple)virology_subdomain
: Subfield of virology addressedwas_performance_measured
: Boolean flagperformance_results
: Summary of the resultsperformance_measurement_details
: Description of how performance was evaluated
Key contents:
Extracted_LLaMA_Output_biorxiv_groundtruth_tobecompleted.csv
Purpose:
- Serve as ground truth for evaluation
- Enable supervised fine-tuning or training of models
- Clone the repository:
git clone https://github.com/jd-coderepos/virology-ai-papers.git
cd virology-ai-papers
-
Follow the instructions in each step's folder.
-
Install required R and Python packages as outlined above.
- Literature mining in virology and public health
- Evaluation of LLM capabilities in scientific information extraction
- Systematic review automation in life sciences
If you use this dataset or codebase, please consider citing or referencing this repository in your work.