Skip to content

Track, index, and analyze scientific papers from PubMed and bioRxiv for specific drug targets, compounds, or therapeutic areas.

Notifications You must be signed in to change notification settings

pritampanda15/Literature-Intelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Literature Intelligence Tool

Track, index, and analyze scientific papers from PubMed and bioRxiv for specific drug targets, compounds, or therapeutic areas.

process

Features

  • Multi-source ingestion: PubMed and bioRxiv APIs
  • Semantic search: BioBERT embeddings for intelligent literature search
  • Trend analysis: Track research trends over time
  • RAG pipeline: LLM-powered synthesis and summarization

Setup and Usage Guide

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Configure Environment

Copy the example environment file and edit it:

cp .env.example .env

Edit .env and set your PubMed email:

PUBMED_EMAIL=your-email@example.com
PUBMED_API_KEY=your-api-key-here  # Optional, for higher rate limits

To get a PubMed API key:

3. Test the Pipeline

Run the test script to validate all components:

python test_pipeline.py

This will:

  • Test PubMed client with a sample query
  • Test bioRxiv client with a sample query
  • Generate embeddings for fetched papers
  • Store papers in ChromaDB and perform semantic search

4. Launch the Application

streamlit run app.py

The application will open in your browser at http://localhost:8501

Application Features

Tab 1: Fetch & Index

Purpose: Retrieve and index papers from PubMed and bioRxiv

Steps:

  1. Enter a search query (e.g., "EGFR inhibitor", "CAR-T cell therapy")
  2. Select data sources (PubMed, bioRxiv, or both)
  3. Configure max results per source (10-500)
  4. Set how many days back to search (7-365)
  5. Click "Fetch and Index Papers"

What happens:

  • Papers are fetched from selected sources
  • Abstracts are embedded using BioBERT
  • Papers are stored in ChromaDB with deduplication
  • Database statistics are updated

Tab 2: Semantic Search

Purpose: Find relevant papers using natural language queries

Steps:

  1. Enter a search query in natural language (e.g., "novel therapeutic approaches for cancer")
  2. Set number of results (5-50)
  3. Optionally filter by source
  4. Click "Search"

Features:

  • Semantic similarity matching (not keyword-based)
  • Expandable results with full abstracts
  • Similarity scores for each result
  • Author and keyword information

Tab 3: Trends Analysis

Purpose: Visualize research trends over time

Steps:

  1. Select start and end dates
  2. Click "Analyze Trends"

Visualizations:

  • Publications over time (line chart)
  • Distribution by source (pie chart)
  • Top query tags (bar chart)
  • Top keywords (horizontal bar chart)

Example Queries

Fetch & Index Queries

  • EGFR inhibitor cancer
  • CAR-T cell therapy leukemia
  • Alzheimer's disease amyloid beta
  • CRISPR gene editing
  • mRNA vaccine COVID-19
  • PD-1 checkpoint inhibitor melanoma

Semantic Search Queries

  • novel therapeutic approaches for cancer
  • resistance mechanisms to EGFR inhibitors
  • clinical trials for Alzheimer's treatment
  • safety concerns with CAR-T therapy
  • biomarkers for early cancer detection

Advanced Configuration

Custom Embedding Model

Edit .env to use a different sentence-transformers model:

EMBEDDING_MODEL=allenai/scibert_scivocab_uncased

Popular biomedical models:

  • pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb (default, 768 dim)
  • allenai/scibert_scivocab_uncased (768 dim)
  • microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract (768 dim)

ChromaDB Storage Location

By default, papers are stored in ./chroma_db. To change:

CHROMA_PERSIST_DIR=/path/to/your/database

Troubleshooting

"No papers found" when searching

  • Make sure you've fetched and indexed papers first (Tab 1)
  • Check database statistics in the sidebar

Slow embedding generation

  • First-time model download can be slow
  • Consider using a smaller model
  • Reduce batch size in src/embedder.py

PubMed API rate limit errors

  • Add PUBMED_API_KEY to .env for higher limits
  • Without key: 3 requests/second
  • With key: 10 requests/second

bioRxiv returns no results

  • bioRxiv API only supports date-range queries
  • Try a broader date range or more general search terms

System Requirements

  • Python 3.8+
  • 4GB RAM minimum (8GB recommended for large datasets)
  • Internet connection for API access and model download
  • ~2GB disk space for embedding models

Data Privacy

  • All data is stored locally in ChromaDB
  • No data is sent to third parties (except API queries)
  • PubMed and bioRxiv APIs are public and free to use

Performance Tips

  1. Batch processing: Fetch papers in batches of 100-200 for optimal performance
  2. Incremental indexing: Use smaller date ranges and fetch regularly
  3. Cache models: Models are cached after first download
  4. Database persistence: ChromaDB persists automatically, no need to re-index

Troubleshooting Guide

Common Issues and Solutions

1. Embedding Dimension Mismatch

Error:

chromadb.errors.InvalidDimensionException: Embedding dimension 384 does not match collection dimensionality 768

Cause: You changed the embedding model in .env but the database was created with a different model.

Solution:

python3 reset_database.py

Or manually delete the database:

rm -rf ./chroma_db

Then restart the application. You'll need to re-index your papers.


2. SSL Certificate Verification Failed

Error:

SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain

Cause: Your network uses self-signed SSL certificates.

Solution: Set VERIFY_SSL=false in your .env file:

VERIFY_SSL=false

3. No Papers Found in Search

Cause: Database is empty or papers haven't been indexed yet.

Solution:

  1. Go to the "Fetch & Index" tab
  2. Enter a search query (e.g., "EGFR inhibitor cancer")
  3. Select data sources (PubMed, bioRxiv)
  4. Click "Fetch and Index Papers"
  5. Wait for indexing to complete
  6. Then use the "Semantic Search" tab

4. PubMed API Rate Limit Errors

Cause: Making too many requests without an API key.

Solution: Get a free NCBI API key:

  1. Visit https://www.ncbi.nlm.nih.gov/account/
  2. Create an account or sign in
  3. Generate an API key
  4. Add it to .env: PUBMED_API_KEY=your-key-here

With an API key, you get 10 requests/second instead of 3.


5. Model Download is Slow

Cause: First-time download of embedding model (300-400MB).

Solution:

  • Wait for the download to complete (only happens once)
  • Model is cached for future use
  • Or use a smaller model in .env:
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

6. Streamlit Not Starting

Error:

ModuleNotFoundError: No module named 'streamlit'

Solution:

pip3 install -r requirements.txt

7. bioRxiv Returns No Results

Cause: bioRxiv API only supports date-range queries and returns limited results.

Solution:

  • Try broader date ranges (90-180 days)
  • Use more general search terms
  • bioRxiv has fewer papers than PubMed, so fewer results are normal

8. ChromaDB Telemetry Warnings

Warnings:

Failed to send telemetry event...

Cause: ChromaDB trying to send analytics (already suppressed in code).

Impact: These are harmless warnings and don't affect functionality.

Solution: Warnings are already suppressed in the code. You can ignore them.


Changing Embedding Models

If you want to switch embedding models:

  1. Update .env:

    EMBEDDING_MODEL=your-new-model-name
  2. Reset database:

    python3 reset_database.py
  3. Restart application:

    streamlit run app.py
  4. Re-index papers using the "Fetch & Index" tab

Recommended Models

Fast & Compatible (384 dim):

  • sentence-transformers/all-MiniLM-L6-v2 (default)
  • Best for: General use, fast performance

Biomedical-Specific (768 dim):

  • pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb
  • Best for: Better accuracy on biomedical terms (requires more memory)

Scientific (768 dim):

  • allenai/scibert_scivocab_uncased
  • Best for: Scientific literature

Database Management

Check Database Stats

The sidebar in the Streamlit app shows:

  • Total papers indexed
  • Papers by source
  • Papers by query tag

Clean Start

For a completely fresh start:

rm -rf ./chroma_db
rm -rf ./.streamlit/cache
streamlit run app.py

Performance Optimization

Slow Embedding Generation

Reduce batch size in src/embedder.py (line 50):

batch_size=16  # Instead of 32

Slow Search

Reduce n_results in search (return fewer results):

n_results=5  # Instead of 10

Memory Issues

Switch to a smaller embedding model (384 dim instead of 768 dim).


Getting Help

If you encounter issues not listed here:

  1. Check the terminal output for detailed error messages
  2. Review the README.md and SETUP_GUIDE.md
  3. Ensure all dependencies are installed: pip3 install -r requirements.txt
  4. Try resetting the database: python3 reset_database.py

For persistent issues, check:

  • Python version (3.8+ required)
  • Available disk space (2GB+ recommended)
  • Available RAM (4GB+ recommended)

About

Track, index, and analyze scientific papers from PubMed and bioRxiv for specific drug targets, compounds, or therapeutic areas.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages