Track, index, and analyze scientific papers from PubMed and bioRxiv for specific drug targets, compounds, or therapeutic areas.
- Multi-source ingestion: PubMed and bioRxiv APIs
- Semantic search: BioBERT embeddings for intelligent literature search
- Trend analysis: Track research trends over time
- RAG pipeline: LLM-powered synthesis and summarization
pip install -r requirements.txtCopy the example environment file and edit it:
cp .env.example .envEdit .env and set your PubMed email:
PUBMED_EMAIL=your-email@example.com
PUBMED_API_KEY=your-api-key-here # Optional, for higher rate limits
To get a PubMed API key:
- Visit: https://www.ncbi.nlm.nih.gov/account/
- Create an NCBI account
- Generate an API key from your account settings
Run the test script to validate all components:
python test_pipeline.pyThis will:
- Test PubMed client with a sample query
- Test bioRxiv client with a sample query
- Generate embeddings for fetched papers
- Store papers in ChromaDB and perform semantic search
streamlit run app.pyThe application will open in your browser at http://localhost:8501
Purpose: Retrieve and index papers from PubMed and bioRxiv
Steps:
- Enter a search query (e.g., "EGFR inhibitor", "CAR-T cell therapy")
- Select data sources (PubMed, bioRxiv, or both)
- Configure max results per source (10-500)
- Set how many days back to search (7-365)
- Click "Fetch and Index Papers"
What happens:
- Papers are fetched from selected sources
- Abstracts are embedded using BioBERT
- Papers are stored in ChromaDB with deduplication
- Database statistics are updated
Purpose: Find relevant papers using natural language queries
Steps:
- Enter a search query in natural language (e.g., "novel therapeutic approaches for cancer")
- Set number of results (5-50)
- Optionally filter by source
- Click "Search"
Features:
- Semantic similarity matching (not keyword-based)
- Expandable results with full abstracts
- Similarity scores for each result
- Author and keyword information
Purpose: Visualize research trends over time
Steps:
- Select start and end dates
- Click "Analyze Trends"
Visualizations:
- Publications over time (line chart)
- Distribution by source (pie chart)
- Top query tags (bar chart)
- Top keywords (horizontal bar chart)
EGFR inhibitor cancerCAR-T cell therapy leukemiaAlzheimer's disease amyloid betaCRISPR gene editingmRNA vaccine COVID-19PD-1 checkpoint inhibitor melanoma
novel therapeutic approaches for cancerresistance mechanisms to EGFR inhibitorsclinical trials for Alzheimer's treatmentsafety concerns with CAR-T therapybiomarkers for early cancer detection
Edit .env to use a different sentence-transformers model:
EMBEDDING_MODEL=allenai/scibert_scivocab_uncased
Popular biomedical models:
pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb(default, 768 dim)allenai/scibert_scivocab_uncased(768 dim)microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract(768 dim)
By default, papers are stored in ./chroma_db. To change:
CHROMA_PERSIST_DIR=/path/to/your/database
- Make sure you've fetched and indexed papers first (Tab 1)
- Check database statistics in the sidebar
- First-time model download can be slow
- Consider using a smaller model
- Reduce batch size in
src/embedder.py
- Add
PUBMED_API_KEYto.envfor higher limits - Without key: 3 requests/second
- With key: 10 requests/second
- bioRxiv API only supports date-range queries
- Try a broader date range or more general search terms
- Python 3.8+
- 4GB RAM minimum (8GB recommended for large datasets)
- Internet connection for API access and model download
- ~2GB disk space for embedding models
- All data is stored locally in ChromaDB
- No data is sent to third parties (except API queries)
- PubMed and bioRxiv APIs are public and free to use
- Batch processing: Fetch papers in batches of 100-200 for optimal performance
- Incremental indexing: Use smaller date ranges and fetch regularly
- Cache models: Models are cached after first download
- Database persistence: ChromaDB persists automatically, no need to re-index
Error:
chromadb.errors.InvalidDimensionException: Embedding dimension 384 does not match collection dimensionality 768
Cause: You changed the embedding model in .env but the database was created with a different model.
Solution:
python3 reset_database.pyOr manually delete the database:
rm -rf ./chroma_dbThen restart the application. You'll need to re-index your papers.
Error:
SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain
Cause: Your network uses self-signed SSL certificates.
Solution: Set VERIFY_SSL=false in your .env file:
VERIFY_SSL=falseCause: Database is empty or papers haven't been indexed yet.
Solution:
- Go to the "Fetch & Index" tab
- Enter a search query (e.g., "EGFR inhibitor cancer")
- Select data sources (PubMed, bioRxiv)
- Click "Fetch and Index Papers"
- Wait for indexing to complete
- Then use the "Semantic Search" tab
Cause: Making too many requests without an API key.
Solution: Get a free NCBI API key:
- Visit https://www.ncbi.nlm.nih.gov/account/
- Create an account or sign in
- Generate an API key
- Add it to
.env:PUBMED_API_KEY=your-key-here
With an API key, you get 10 requests/second instead of 3.
Cause: First-time download of embedding model (300-400MB).
Solution:
- Wait for the download to complete (only happens once)
- Model is cached for future use
- Or use a smaller model in
.env:
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2Error:
ModuleNotFoundError: No module named 'streamlit'
Solution:
pip3 install -r requirements.txtCause: bioRxiv API only supports date-range queries and returns limited results.
Solution:
- Try broader date ranges (90-180 days)
- Use more general search terms
- bioRxiv has fewer papers than PubMed, so fewer results are normal
Warnings:
Failed to send telemetry event...
Cause: ChromaDB trying to send analytics (already suppressed in code).
Impact: These are harmless warnings and don't affect functionality.
Solution: Warnings are already suppressed in the code. You can ignore them.
If you want to switch embedding models:
-
Update
.env:EMBEDDING_MODEL=your-new-model-name
-
Reset database:
python3 reset_database.py
-
Restart application:
streamlit run app.py
-
Re-index papers using the "Fetch & Index" tab
Fast & Compatible (384 dim):
sentence-transformers/all-MiniLM-L6-v2(default)- Best for: General use, fast performance
Biomedical-Specific (768 dim):
pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb- Best for: Better accuracy on biomedical terms (requires more memory)
Scientific (768 dim):
allenai/scibert_scivocab_uncased- Best for: Scientific literature
The sidebar in the Streamlit app shows:
- Total papers indexed
- Papers by source
- Papers by query tag
For a completely fresh start:
rm -rf ./chroma_db
rm -rf ./.streamlit/cache
streamlit run app.pyReduce batch size in src/embedder.py (line 50):
batch_size=16 # Instead of 32Reduce n_results in search (return fewer results):
n_results=5 # Instead of 10Switch to a smaller embedding model (384 dim instead of 768 dim).
If you encounter issues not listed here:
- Check the terminal output for detailed error messages
- Review the
README.mdandSETUP_GUIDE.md - Ensure all dependencies are installed:
pip3 install -r requirements.txt - Try resetting the database:
python3 reset_database.py
For persistent issues, check:
- Python version (3.8+ required)
- Available disk space (2GB+ recommended)
- Available RAM (4GB+ recommended)
