🦚 Virology-AI Papers: A Pipeline for Mining Deep Learning Research in Virology and Epidemiology

This repository provides a structured, four-step pipeline to collect, filter, process, and annotate scientific literature that applies deep learning and large language models (LLMs) to problems in virology and epidemiology. It integrates data from trusted sources like PubMed, bioRxiv, and medRxiv, with a strong focus on:

Collecting metadata from scientific databases
Semantic filtering of results to ensure relevance
Extracting structured information using LLMs
Human validation and correction of the extracted data

This pipeline enables high-quality datasets that support downstream tasks such as biomedical NLP research, systematic review automation, and AI benchmarking.

📂 Repository Structure

The project is organized into the following modular steps:

📁 Folder Overview

Each stage of the pipeline is organized into a dedicated folder:

Folder Name	Description
`step_01_metadata_collection/`	Scripts and outputs related to collecting metadata from PubMed, bioRxiv, and medRxiv.
`step_02_semantic_filtering/`	Python scripts to semantically filter abstracts using sentence embeddings.
`step_03_text_extraction_llm/`	Code to extract and structure full-text data from PDFs or XMLs using OCR or parsing + LLaMA.
`step_04_human_annotated_data/`	Ground-truth corrections and final structured data annotated by humans.

🔹 Step 01 — Metadata Collection

Folder: step_01_metadata_collection/

This step retrieves metadata and abstracts from PubMed, bioRxiv, and medRxiv using different tools:

PubMed: Data is collected manually using the PubMed web interface.
bioRxiv and medRxiv: Metadata is retrieved using the R package medrxivr.

All three sources use the same keyword queries and topic filters to ensure consistent data collection across repositories.

📘 Metadata Details

PubMed Metadata Fields:

PMID: PubMed ID, a unique identifier for each publication

Title: Title of the publication

Authors: List of authors in the format Last Name, First Initials

Citation: Citation details, including volume, issue, and pages

First Author: The first listed author

Journal/Book: Name of the journal/book

Publication Year: Year of publication

Create Date: Date created in PubMed

PMCID: PubMed Central ID

NIHMS ID: NIH Manuscript Submission ID

DOI: Digital Object Identifier

Abstract: Abstract text

bioRxiv / medRxiv Metadata Fields:

Authors: List of authors

Year of publication: Year when the article was posted

Title of article: Title of the article

category: Scientific category (e.g., Immunology, Epidemiology)

date: Posting date on the preprint server

Abstract: Article abstract

doi: Digital Object Identifier for the article

📘 PubMed Metadata Collection

Metadata from PubMed is collected via manual queries using the official PubMed web interface.

A full record of query construction and methodology is documented here: Query Design Document

📋 Prerequisites for bioRxiv and medRxiv Scripts

Before running the scripts, make sure R and RStudio are installed:

Install R: https://cran.r-project.org/
Install RStudio: https://posit.co/download/rstudio-desktop/

Install required R packages:

install.packages("medrxivr")
install.packages("dplyr")
install.packages("readxl")
install.packages("writexl")

▶️ Script 1: `medrxiv_metadata_fetcher.R`

This script fetches metadata from bioRxiv or medRxiv using keyword lists and filters results by relevant scientific categories.

How to run:

Rscript medrxiv_metadata_fetcher.R

Or provide full path:

Rscript "C:/Users/YourName/Desktop/project_folder/medrxiv_metadata_fetcher.R"

What it does:

Downloads metadata using mx_api_content() with user-defined date ranges and source
Saves raw data as .rds files
Executes topic-specific keyword searches via query_list
Filters by categories (e.g., Epidemiology, Immunology, Bioinformatics)
Extracts fields like title, authors, category, date, abstract, DOI
Adds derived fields like publication year and publication name
Outputs .xlsx files for each topic

PDF Download: The script attempts to download full-text PDFs using mx_download() from the medrxivr package. If mx_download() fails to retrieve some files (e.g., due to server restrictions or silent failures), the script automatically falls back to a custom curl-based downloader. This backup method retries only the missing files and uses appropriate headers (including a User-Agent and Referer) to ensure compatibility with servers like medRxiv and bioRxiv.

Example output files:

virology-llm.xlsx
virology-transformer.xlsx
virology-deep-learning.xlsx

▶️ Script 2: `aggregate_and_deduplicate_doi.R`

This script consolidates topic-specific outputs and eliminates duplicates.

How to run:

folder_path <- "C:/Users/yourname/Desktop/bio_lib"

Combines .xlsx files into one dataset
Identifies and logs duplicate DOIs to duplicates_by_doi.xlsx
Outputs cleaned dataset to aggregated_deduplicated.xlsx

Final Output:

A single, deduplicated file containing metadata from all topic areas

🔹 Step 02 — Semantic Filtering

Folder: step_02_semantic_filtering/

This step applies embedding-based semantic filtering to refine the initial search results. It uses pretrained sentence transformers to evaluate the relevance of abstracts to the target domain.

Goals:

Remove papers with irrelevant or ambiguous keyword matches
Prioritize highly relevant documents for LLM processing

Typical script(s):

semantic_filtering.py

Dependencies:

Python ≥ 3.7
sentence-transformers
pandas
numpy
scikit-learn

Installation:

pip install sentence-transformers pandas numpy scikit-learn

Suggested model:

all-MiniLM-L6-v2 from sentence-transformers

How to run:

python semantic_filtering.py

🔹 Step 03 — Text Extraction with LLM

Folder: step_03_text_extraction_llm/

This step uses a local large language model (LLaMA 3.2B via Ollama) to extract structured fields from full paper texts. The input can be either PDF (processed via OCR) or XML (parsed as plain text).

Goals:

Process full text from PDFs or XMLs
Strip reference sections
Run LLM extraction of predefined fields from full text

Dependencies:

Python ≥ 3.8
pdf2image
pytesseract
ollama
pandas
re, json, logging, pathlib

Installation:

pip install pdf2image pytesseract ollama pandas

System Requirements:

Poppler installed and added to PATH
Tesseract OCR installed and accessible
Ollama installed and running LLaMA 3.2B locally

How to run:

python Textextraction_performancemetrics_medrxiv.py
python Textextraction_additionalfields_medrxiv.py

Output:

Extracted_LLaMA_Output.csv: Structured file with extracted fields from papers

🔹 Step 04 — Human Annotation

Folder: step_04_human_annotated_data/

This step includes manually reviewed or corrected outputs from the LLM extraction phase. It covers detailed validation and correction of the following structured fields:

research_aim: The primary goal or aim of the research
research_problem: The specific research problem addressed
ai_objective: What AI is being used for in the research
ai_methodology: A brief description of the AI-based approach used
ai_method_details: Details on techniques, architectures, or models
ai_method_type: The AI method(s) used (as a list if multiple)
type_of_underlying_data: The raw input data used for experiments
dataset_name: Name(s) of the dataset(s) used
disease_name: Viral or infectious disease studied (list if multiple)
virology_subdomain: Subfield of virology addressed
was_performance_measured: Boolean flag
performance_results: Summary of the results
performance_measurement_details: Description of how performance was evaluated

Key contents:

Extracted_LLaMA_Output_biorxiv_groundtruth_tobecompleted.csv

Purpose:

Serve as ground truth for evaluation
Enable supervised fine-tuning or training of models

🚀 Getting Started

Clone the repository:

git clone https://github.com/jd-coderepos/virology-ai-papers.git
cd virology-ai-papers

Follow the instructions in each step's folder.
Install required R and Python packages as outlined above.

🎓 Use Cases

Literature mining in virology and public health
Evaluation of LLM capabilities in scientific information extraction
Systematic review automation in life sciences

🌐 Citation

If you use this dataset or codebase, please consider citing or referencing this repository in your work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🦚 Virology-AI Papers: A Pipeline for Mining Deep Learning Research in Virology and Epidemiology

📂 Repository Structure

📁 Folder Overview

🔹 Step 01 — Metadata Collection

📘 PubMed Metadata Collection

📋 Prerequisites for bioRxiv and medRxiv Scripts

▶️ Script 1: `medrxiv_metadata_fetcher.R`

▶️ Script 2: `aggregate_and_deduplicate_doi.R`

🔹 Step 02 — Semantic Filtering

🔹 Step 03 — Text Extraction with LLM

🔹 Step 04 — Human Annotation

🚀 Getting Started

🎓 Use Cases

🌐 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
step_01_metadata_collection		step_01_metadata_collection
step_02_semantic_filtering		step_02_semantic_filtering
step_03_text_extraction_llm		step_03_text_extraction_llm
step_04_human_annotated_data		step_04_human_annotated_data
README.md		README.md

jd-coderepos/virology-ai-papers

Folders and files

Latest commit

History

Repository files navigation

🦚 Virology-AI Papers: A Pipeline for Mining Deep Learning Research in Virology and Epidemiology

📂 Repository Structure

📁 Folder Overview

🔹 Step 01 — Metadata Collection

📘 PubMed Metadata Collection

📋 Prerequisites for bioRxiv and medRxiv Scripts

▶️ Script 1: medrxiv_metadata_fetcher.R

▶️ Script 2: aggregate_and_deduplicate_doi.R

🔹 Step 02 — Semantic Filtering

🔹 Step 03 — Text Extraction with LLM

🔹 Step 04 — Human Annotation

🚀 Getting Started

🎓 Use Cases

🌐 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

▶️ Script 1: `medrxiv_metadata_fetcher.R`

▶️ Script 2: `aggregate_and_deduplicate_doi.R`

Packages