Author: Proveer Ghosh
A simple, configurable tool for extracting research data from PubMed for any disease or research topic.
-
Install dependencies:
pip install -r requirements.txt
-
Run with default CIDP configuration:
python generic_research_extractor.py
-
Quick test (recommended first):
python quick_test.py
The tool is pre-configured for CIDP research in config.py. Simply update the email and run.
Example configurations for other diseases are available in example_configs.py:
- Multiple Sclerosis (MS)
- Parkinson's Disease
- Alzheimer's Disease
- Rheumatoid Arthritis
- Type 2 Diabetes
Example usage:
from generic_research_extractor import GenericDataExtractor
from example_configs import MSConfig
extractor = GenericDataExtractor(MSConfig)
results = extractor.extract_data()Create a new configuration class with the required fields:
class MyDiseaseConfig:
EMAIL = "your@email.com"
RESEARCH_FOCUS = "MyDisease"
RESEARCH_FULL_NAME = "My Disease Full Name"
START_DATE = "2020/01/01"
END_DATE = "2024/12/31"
MAX_RESULTS_PER_QUERY = 100
OUTPUT_DIR = "mydisease_results"
SEARCH_QUERIES = {
"general": [
"MyDisease OR \"My Disease Full Name\"",
"MyDisease AND (diagnosis OR treatment)"
],
"treatment_studies": [
"MyDisease AND (treatment OR therapy)",
"MyDisease AND (clinical trial OR RCT)"
]
}
# Optional: Treatment terms for filtering
TREATMENT_TERMS = [
"treatment1", "treatment2", "therapy"
]The tool generates CSV files organized by research category:
{disease}_all_combined.csv- All extracted articles{disease}_general.csv- General research articles{disease}_treatment_studies.csv- Treatment-focused studies{disease}_clinical_trials.csv- Clinical trial articles{disease}_treatment_focused.csv- Articles filtered by treatment terms{disease}_summary_statistics.csv- Summary statistics
generic_research_extractor.py- Main extractor classpubmedScrapper.py- PubMed scraping functionalityconfig.py- Default configuration (CIDP)example_configs.py- Example configurations for other diseasesquick_test.py- Quick testing script
- Generic & Configurable: Works with any disease or research topic
- Smart Filtering: Uses TREATMENT_TERMS to create focused datasets
- Clinical Trial Detection: Automatically identifies NCT IDs
- Multiple Categories: Organizes searches by research type
- CSV Export: Saves data in multiple CSV files for analysis
- Python 3.7+
- Dependencies listed in
requirements.txt - Valid email for PubMed API access
Contributions are welcome! Please feel free to:
- Submit bug reports and feature requests
- Add new disease configurations
- Improve search query strategies
- Enhance documentation
MIT License - feel free to use and modify for your research needs.