Literature Intelligence Tool

Track, index, and analyze scientific papers from PubMed and bioRxiv for specific drug targets, compounds, or therapeutic areas.

Features

Multi-source ingestion: PubMed and bioRxiv APIs
Semantic search: BioBERT embeddings for intelligent literature search
Trend analysis: Track research trends over time
RAG pipeline: LLM-powered synthesis and summarization

Setup and Usage Guide

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Configure Environment

Copy the example environment file and edit it:

cp .env.example .env

Edit .env and set your PubMed email:

PUBMED_EMAIL=your-email@example.com
PUBMED_API_KEY=your-api-key-here  # Optional, for higher rate limits

To get a PubMed API key:

Visit: https://www.ncbi.nlm.nih.gov/account/
Create an NCBI account
Generate an API key from your account settings

3. Test the Pipeline

Run the test script to validate all components:

python test_pipeline.py

This will:

Test PubMed client with a sample query
Test bioRxiv client with a sample query
Generate embeddings for fetched papers
Store papers in ChromaDB and perform semantic search

4. Launch the Application

streamlit run app.py

The application will open in your browser at http://localhost:8501

Application Features

Tab 1: Fetch & Index

Purpose: Retrieve and index papers from PubMed and bioRxiv

Steps:

Enter a search query (e.g., "EGFR inhibitor", "CAR-T cell therapy")
Select data sources (PubMed, bioRxiv, or both)
Configure max results per source (10-500)
Set how many days back to search (7-365)
Click "Fetch and Index Papers"

What happens:

Papers are fetched from selected sources
Abstracts are embedded using BioBERT
Papers are stored in ChromaDB with deduplication
Database statistics are updated

Tab 2: Semantic Search

Purpose: Find relevant papers using natural language queries

Steps:

Enter a search query in natural language (e.g., "novel therapeutic approaches for cancer")
Set number of results (5-50)
Optionally filter by source
Click "Search"

Features:

Semantic similarity matching (not keyword-based)
Expandable results with full abstracts
Similarity scores for each result
Author and keyword information

Tab 3: Trends Analysis

Purpose: Visualize research trends over time

Steps:

Select start and end dates
Click "Analyze Trends"

Visualizations:

Publications over time (line chart)
Distribution by source (pie chart)
Top query tags (bar chart)
Top keywords (horizontal bar chart)

Example Queries

Fetch & Index Queries

EGFR inhibitor cancer
CAR-T cell therapy leukemia
Alzheimer's disease amyloid beta
CRISPR gene editing
mRNA vaccine COVID-19
PD-1 checkpoint inhibitor melanoma

Semantic Search Queries

novel therapeutic approaches for cancer
resistance mechanisms to EGFR inhibitors
clinical trials for Alzheimer's treatment
safety concerns with CAR-T therapy
biomarkers for early cancer detection

Advanced Configuration

Custom Embedding Model

Edit .env to use a different sentence-transformers model:

EMBEDDING_MODEL=allenai/scibert_scivocab_uncased

Popular biomedical models:

pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb (default, 768 dim)
allenai/scibert_scivocab_uncased (768 dim)
microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract (768 dim)

ChromaDB Storage Location

By default, papers are stored in ./chroma_db. To change:

CHROMA_PERSIST_DIR=/path/to/your/database

Troubleshooting

"No papers found" when searching

Make sure you've fetched and indexed papers first (Tab 1)
Check database statistics in the sidebar

Slow embedding generation

First-time model download can be slow
Consider using a smaller model
Reduce batch size in src/embedder.py

PubMed API rate limit errors

Add PUBMED_API_KEY to .env for higher limits
Without key: 3 requests/second
With key: 10 requests/second

bioRxiv returns no results

bioRxiv API only supports date-range queries
Try a broader date range or more general search terms

System Requirements

Python 3.8+
4GB RAM minimum (8GB recommended for large datasets)
Internet connection for API access and model download
~2GB disk space for embedding models

Data Privacy

All data is stored locally in ChromaDB
No data is sent to third parties (except API queries)
PubMed and bioRxiv APIs are public and free to use

Performance Tips

Batch processing: Fetch papers in batches of 100-200 for optimal performance
Incremental indexing: Use smaller date ranges and fetch regularly
Cache models: Models are cached after first download
Database persistence: ChromaDB persists automatically, no need to re-index

Troubleshooting Guide

Common Issues and Solutions

1. Embedding Dimension Mismatch

Error:

chromadb.errors.InvalidDimensionException: Embedding dimension 384 does not match collection dimensionality 768

Cause: You changed the embedding model in .env but the database was created with a different model.

Solution:

python3 reset_database.py

Or manually delete the database:

rm -rf ./chroma_db

Then restart the application. You'll need to re-index your papers.

2. SSL Certificate Verification Failed

Error:

SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain

Cause: Your network uses self-signed SSL certificates.

Solution: Set VERIFY_SSL=false in your .env file:

VERIFY_SSL=false

3. No Papers Found in Search

Cause: Database is empty or papers haven't been indexed yet.

Solution:

Go to the "Fetch & Index" tab
Enter a search query (e.g., "EGFR inhibitor cancer")
Select data sources (PubMed, bioRxiv)
Click "Fetch and Index Papers"
Wait for indexing to complete
Then use the "Semantic Search" tab

4. PubMed API Rate Limit Errors

Cause: Making too many requests without an API key.

Solution: Get a free NCBI API key:

Visit https://www.ncbi.nlm.nih.gov/account/
Create an account or sign in
Generate an API key
Add it to .env: PUBMED_API_KEY=your-key-here

With an API key, you get 10 requests/second instead of 3.

5. Model Download is Slow

Cause: First-time download of embedding model (300-400MB).

Solution:

Wait for the download to complete (only happens once)
Model is cached for future use
Or use a smaller model in .env:

EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

6. Streamlit Not Starting

Error:

ModuleNotFoundError: No module named 'streamlit'

Solution:

pip3 install -r requirements.txt

7. bioRxiv Returns No Results

Cause: bioRxiv API only supports date-range queries and returns limited results.

Solution:

Try broader date ranges (90-180 days)
Use more general search terms
bioRxiv has fewer papers than PubMed, so fewer results are normal

8. ChromaDB Telemetry Warnings

Warnings:

Failed to send telemetry event...

Cause: ChromaDB trying to send analytics (already suppressed in code).

Impact: These are harmless warnings and don't affect functionality.

Solution: Warnings are already suppressed in the code. You can ignore them.

Changing Embedding Models

If you want to switch embedding models:

Update .env:
```
EMBEDDING_MODEL=your-new-model-name
```
Reset database:
```
python3 reset_database.py
```
Restart application:
```
streamlit run app.py
```
Re-index papers using the "Fetch & Index" tab

Recommended Models

Fast & Compatible (384 dim):

sentence-transformers/all-MiniLM-L6-v2 (default)
Best for: General use, fast performance

Biomedical-Specific (768 dim):

pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb
Best for: Better accuracy on biomedical terms (requires more memory)

Scientific (768 dim):

allenai/scibert_scivocab_uncased
Best for: Scientific literature

Database Management

Check Database Stats

The sidebar in the Streamlit app shows:

Total papers indexed
Papers by source
Papers by query tag

Clean Start

For a completely fresh start:

rm -rf ./chroma_db
rm -rf ./.streamlit/cache
streamlit run app.py

Performance Optimization

Slow Embedding Generation

Reduce batch size in src/embedder.py (line 50):

batch_size=16  # Instead of 32

Slow Search

Reduce n_results in search (return fewer results):

n_results=5  # Instead of 10

Memory Issues

Switch to a smaller embedding model (384 dim instead of 768 dim).

Getting Help

If you encounter issues not listed here:

Check the terminal output for detailed error messages
Review the README.md and SETUP_GUIDE.md
Ensure all dependencies are installed: pip3 install -r requirements.txt
Try resetting the database: python3 reset_database.py

For persistent issues, check:

Python version (3.8+ required)
Available disk space (2GB+ recommended)
Available RAM (4GB+ recommended)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
Literature_Intelligence.png		Literature_Intelligence.png
README.md		README.md
app.py		app.py
config.py		config.py
requirements.txt		requirements.txt
reset_database.py		reset_database.py
test_pipeline.py		test_pipeline.py

pritampanda15/Literature-Intelligence

Folders and files

Latest commit

History

Repository files navigation

Literature Intelligence Tool

Features

Setup and Usage Guide

Quick Start

1. Install Dependencies

2. Configure Environment

3. Test the Pipeline

4. Launch the Application

Application Features

Tab 1: Fetch & Index

Tab 2: Semantic Search

Tab 3: Trends Analysis

Example Queries

Fetch & Index Queries

Semantic Search Queries

Advanced Configuration

Custom Embedding Model

ChromaDB Storage Location

Troubleshooting

"No papers found" when searching

Slow embedding generation

PubMed API rate limit errors

bioRxiv returns no results

System Requirements

Data Privacy

Performance Tips

Troubleshooting Guide

Common Issues and Solutions

1. Embedding Dimension Mismatch

2. SSL Certificate Verification Failed

3. No Papers Found in Search

4. PubMed API Rate Limit Errors

5. Model Download is Slow

6. Streamlit Not Starting

7. bioRxiv Returns No Results

8. ChromaDB Telemetry Warnings

Changing Embedding Models

Recommended Models

Database Management

Check Database Stats

Clean Start

Performance Optimization

Slow Embedding Generation

Slow Search

Memory Issues

Getting Help

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages