Skip to content

A configurable Python tool for extracting research data from PubMed for any disease or research topic

License

Notifications You must be signed in to change notification settings

Proveer/pubmed-research-extractor

Repository files navigation

Generic Research Data Extractor

Author: Proveer Ghosh

Python License Status

A simple, configurable tool for extracting research data from PubMed for any disease or research topic.

Quick Start

  1. Install dependencies:

    pip install -r requirements.txt
  2. Run with default CIDP configuration:

    python generic_research_extractor.py
  3. Quick test (recommended first):

    python quick_test.py

Configuration

Using Default Configuration (CIDP)

The tool is pre-configured for CIDP research in config.py. Simply update the email and run.

Using Example Configurations

Example configurations for other diseases are available in example_configs.py:

  • Multiple Sclerosis (MS)
  • Parkinson's Disease
  • Alzheimer's Disease
  • Rheumatoid Arthritis
  • Type 2 Diabetes

Example usage:

from generic_research_extractor import GenericDataExtractor
from example_configs import MSConfig

extractor = GenericDataExtractor(MSConfig)
results = extractor.extract_data()

Creating Custom Configuration

Create a new configuration class with the required fields:

class MyDiseaseConfig:
    EMAIL = "your@email.com"
    RESEARCH_FOCUS = "MyDisease"
    RESEARCH_FULL_NAME = "My Disease Full Name"
    START_DATE = "2020/01/01"
    END_DATE = "2024/12/31"
    MAX_RESULTS_PER_QUERY = 100
    OUTPUT_DIR = "mydisease_results"
    
    SEARCH_QUERIES = {
        "general": [
            "MyDisease OR \"My Disease Full Name\"",
            "MyDisease AND (diagnosis OR treatment)"
        ],
        "treatment_studies": [
            "MyDisease AND (treatment OR therapy)",
            "MyDisease AND (clinical trial OR RCT)"
        ]
    }
    
    # Optional: Treatment terms for filtering
    TREATMENT_TERMS = [
        "treatment1", "treatment2", "therapy"
    ]

Output

The tool generates CSV files organized by research category:

  • {disease}_all_combined.csv - All extracted articles
  • {disease}_general.csv - General research articles
  • {disease}_treatment_studies.csv - Treatment-focused studies
  • {disease}_clinical_trials.csv - Clinical trial articles
  • {disease}_treatment_focused.csv - Articles filtered by treatment terms
  • {disease}_summary_statistics.csv - Summary statistics

Core Files

  • generic_research_extractor.py - Main extractor class
  • pubmedScrapper.py - PubMed scraping functionality
  • config.py - Default configuration (CIDP)
  • example_configs.py - Example configurations for other diseases
  • quick_test.py - Quick testing script

Key Features

  • Generic & Configurable: Works with any disease or research topic
  • Smart Filtering: Uses TREATMENT_TERMS to create focused datasets
  • Clinical Trial Detection: Automatically identifies NCT IDs
  • Multiple Categories: Organizes searches by research type
  • CSV Export: Saves data in multiple CSV files for analysis

Requirements

  • Python 3.7+
  • Dependencies listed in requirements.txt
  • Valid email for PubMed API access

Contributing

Contributions are welcome! Please feel free to:

  • Submit bug reports and feature requests
  • Add new disease configurations
  • Improve search query strategies
  • Enhance documentation

License

MIT License - feel free to use and modify for your research needs.

About

A configurable Python tool for extracting research data from PubMed for any disease or research topic

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages