Skip to content

uclamii/CGT_FDR_Prototype

Repository files navigation

CGT FDR Prototype

This repository contains a comprehensive prototype for a Cancer Genetic Testing (CGT) chatbot system that helps patients understand their genetic test results and cancer risks. The system uses advanced language models with Retrieval-Augmented Generation (RAG) to provide clear, accurate, and empathetic responses to patient questions about genetic testing and hereditary cancer syndromes.

πŸš€ Features

Chatbot Implementations

  • chatbot_raw.py: Direct implementation using Microsoft Phi-4 model with comprehensive evaluation
  • chatbot_rag.py: RAG implementation combining Phi-4 with medical PDF guidelines
  • Memory-optimized: Designed to prevent system crashes with efficient memory management
  • Question Categorization: Automatic classification into 5 genetic counseling domains

Comprehensive Analysis & Reporting

  • analysis.py: Statistical analysis and report generation including:
    • Comprehensive Text Reports: Detailed analysis with executive summary, statistical comparisons, and category breakdowns
    • Summary Statistics CSV: Overall performance metrics comparing RAG vs Raw approaches
    • Category Analysis CSV: Detailed breakdown by question category with statistical significance testing
    • Publication-Ready Reports: Professional formatting suitable for grant reporting and publications
    • Statistical Rigor: Paired t-tests, Wilcoxon tests, Cohen's d effect sizes, and multiple comparison corrections

Comprehensive Answer Evaluation & Analysis

  • answer_evaluator.py: Comprehensive evaluation metrics including:
    • BERTScore vs Gold Standard: Precision, recall, and F1 scores against authoritative medical answers
    • Semantic similarity between questions and answers using sentence transformers
    • Readability metrics (Flesch reading ease and grade level)
    • Answer length analysis (character count, word count, sentence count, average sentence length)
    • Sentiment analysis (polarity and subjectivity)
    • Gold standard coverage tracking

Question Categorization System

Questions are automatically categorized into 5 genetic counseling domains:

  • Genetic Variant Interpretation (5 questions)
  • Inheritance Patterns (7 questions)
  • Family Risk Assessment (11 questions)
  • Gene-Specific Recommendations (14 questions)
  • Support and Resources (12 questions)

Data Processing & Migration Tools

  • regenerate_csvs.py: Convert existing CSV files to current format with comprehensive metrics
  • Answer cleaning utilities: Remove formatting artifacts and improve quality
  • Re-evaluation capabilities: Assess impact of data cleaning on metrics
  • Batch processing: Handle large question sets efficiently

Analysis & Visualization

  • Comprehensive Statistical Analysis: Professional reports with publication-ready formatting
  • Visual Dashboard Generation: Interactive 5-panel dashboard with statistical visualizations
  • Statistical comparison: Paired t-tests and effect size calculations between approaches
  • Category-specific analysis: Performance metrics by question domain
  • Integrated Output Location: All analysis outputs (reports, CSVs, and dashboards) saved to analysis/ directory

πŸ—‚οΈ Architecture

Raw Chatbot Approach

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Question  │────▢│   Phi-4     │────▢│   Answer    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚   Model     β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
                                              β–Ό
                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    β”‚ Comprehensive   β”‚
                                    β”‚ Evaluation      β”‚
                                    β”‚ β€’ BERTScore     β”‚
                                    β”‚ β€’ Readability   β”‚
                                    β”‚ β€’ Categorizationβ”‚
                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

RAG Approach

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Question  │────▢│  Vector     │────▢│  Medical    β”‚
β”‚ +Category   β”‚     β”‚  Search     β”‚     β”‚ Guidelines  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                              β”‚
                                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Answer    │◀────│   Phi-4     │◀────│  Context    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚   Model     β”‚     β”‚   Prompt    β”‚
      β”‚             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Comprehensive   │────▢│ Statistical │────▢│  Dashboard  β”‚
β”‚ Evaluation      β”‚     β”‚  Analysis   β”‚     β”‚ & Reports   β”‚
β”‚ β€’ BERTScore     β”‚     β”‚ β€’ t-tests   β”‚     β”‚ β€’ Category  β”‚
β”‚ β€’ Gold Standard β”‚     β”‚ β€’ Effect    β”‚     β”‚ β€’ Comparisonβ”‚
β”‚ β€’ 16 Metrics    β”‚     β”‚   Size      β”‚     β”‚ β€’ Visuals   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Two-Step Answer Generation

  1. RAG Context: Retrieves relevant medical guidelines from PDF documents
  2. LLM Knowledge: Falls back to model's general medical knowledge if needed
  3. Quality Assurance: Comprehensive evaluation with 16 metrics including BERTScore
  4. Gold Standard Comparison: Evaluates against authoritative medical answers

πŸ“‹ Requirements

Core Dependencies

torch>=1.9.0
transformers>=4.20.0
langchain>=0.0.200
langchain-community>=0.0.20
sentence-transformers>=2.2.0
chromadb>=0.3.0
textstat>=0.7.0
textblob>=0.17.0
bert-score>=0.3.13
PyPDF2>=3.0.0

Analysis Dependencies

pandas>=1.5.0
numpy>=1.21.0
matplotlib>=3.5.0
seaborn>=0.11.0
scipy>=1.9.0
plotly>=5.0.0

πŸ› οΈ Installation

  1. Clone the repository:
git clone [repository-url]
cd CGT_FDR_Prototype
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up directory structure:
CGT_FDR_Prototype/
β”œβ”€β”€ guidelines/              # PDF medical guidelines (NCCN, UpToDate, etc.)
β”œβ”€β”€ qa_outputs/             # Generated results with comprehensive metrics
β”‚   β”œβ”€β”€ questions_answers_raw.csv      # Raw chatbot results
β”‚   β”œβ”€β”€ questions_answers_rag.csv      # RAG chatbot results  
β”‚   └── questions_answers_comparison.csv # Merged comparison for dashboard
β”œβ”€β”€ analysis/               # Analysis outputs and reports
β”‚   β”œβ”€β”€ comprehensive_analysis_report.txt  # Detailed text report
β”‚   β”œβ”€β”€ summary_statistics.csv            # Overall performance metrics CSV
β”‚   β”œβ”€β”€ category_analysis.csv             # Category-specific analysis CSV
β”‚   β”œβ”€β”€ cgt_analysis_dashboard.png        # Visual dashboard (high-resolution)
β”‚   └── cgt_analysis_dashboard.pdf        # Visual dashboard (publication-ready)
β”œβ”€β”€ config/                 # Configuration files
β”‚   └── question_categories.json      # Question categories definition
β”œβ”€β”€ helper/                 # Helper functions and utilities
β”‚   └── clean_results.py            # Answer cleaning functionality
β”œβ”€β”€ chroma_db/              # Vector database for RAG
β”œβ”€β”€ archive/                # Archived older versions
β”œβ”€β”€ questions.txt           # Input questions (49 total)
β”œβ”€β”€ questions_answers.txt   # Gold standard answers for BERTScore
β”œβ”€β”€ chatbot_raw.py         # Raw chatbot implementation
β”œβ”€β”€ chatbot_rag.py         # RAG chatbot implementation (recommended)
β”œβ”€β”€ answer_evaluator.py    # Comprehensive evaluation metrics
β”œβ”€β”€ analysis.py            # Comprehensive statistical analysis and reporting
β”œβ”€β”€ dashboard.py           # Visual dashboard generation with statistical plots
└── README.md

πŸš€ Usage

1. Generate Answers

RAG Chatbot (Recommended):

python chatbot_rag.py
  • Generates qa_outputs/questions_answers_rag.csv
  • Uses PDF guidelines for context-aware responses
  • Includes 16 comprehensive evaluation metrics

Raw Chatbot (Baseline):

python chatbot_raw.py
  • Generates qa_outputs/questions_answers_raw.csv
  • Direct model responses without guidelines
  • Same comprehensive evaluation metrics for comparison

2. Generate Comprehensive Analysis

Statistical Analysis & Report Generation:

python analysis.py
  • Creates comprehensive statistical analysis comparing RAG vs Raw performance
  • Generates publication-ready text report with detailed findings and recommendations
  • Outputs to analysis/ directory:
    • comprehensive_analysis_report.txt: Complete analysis with executive summary, statistics, and category breakdowns
    • summary_statistics.csv: Overall performance metrics with statistical significance testing
    • category_analysis.csv: Category-specific performance analysis with effect sizes

Analysis Features:

  • Statistical Testing: Paired t-tests, Wilcoxon tests, and effect size calculations
  • Category Analysis: Performance breakdown by genetic counseling domain
  • BERTScore Integration: Evaluation against gold standard medical answers
  • Publication Quality: Professional formatting suitable for grant reports and academic publications

3. Generate Visual Dashboard

Interactive Dashboard Creation:

python dashboard.py
  • Creates comprehensive visual dashboard comparing RAG vs Raw performance
  • Generates publication-ready visualizations with statistical significance testing
  • Outputs to analysis/ directory:
    • cgt_analysis_dashboard.png: High-resolution dashboard image (300 DPI)
    • cgt_analysis_dashboard.pdf: Publication-ready PDF version

Dashboard Features:

  • 5-Panel Layout: Overview metrics, category performance, BERTScore comparisons, performance matrix, and summary table
  • Statistical Visualization: Statistical significance markers, effect sizes, and confidence intervals
  • BERTScore Integration: Visual comparison against gold standard medical answers
  • Category Analysis: Performance breakdown by genetic counseling domain with insights

4. Convert Existing Data (if needed)

Upgrade existing CSV files to current format:

python regenerate_csvs.py
  • Converts old CSV format to current format with 16 metrics
  • Adds BERTScore evaluation against gold standard
  • Creates comparison CSV for dashboard use

5. Key Output Files

Chatbot Results:

File Description Metrics
questions_answers_raw.csv Raw chatbot results 16 comprehensive metrics
questions_answers_rag.csv RAG chatbot results 16 comprehensive metrics
questions_answers_comparison.csv Side-by-side comparison Raw vs RAG with _raw/_rag suffixes

Analysis Reports:

File Description Content
comprehensive_analysis_report.txt Detailed statistical analysis Executive summary, statistical comparisons, category analysis, recommendations
summary_statistics.csv Overall performance metrics Statistical significance, effect sizes, improvement percentages
category_analysis.csv Category-specific breakdowns Performance by genetic counseling domain with statistical tests
cgt_analysis_dashboard.png Visual dashboard (high-res) 5-panel dashboard with charts, statistics, and category analysis
cgt_analysis_dashboard.pdf Visual dashboard (publication) Professional PDF version suitable for reports and presentations

πŸ“Š Evaluation Metrics

Quality Metrics (16 Total)

  • Semantic Similarity: Relevance between question and answer (0-1)
  • Answer Length: Character count analysis
  • Word Count: Token-level analysis
  • Readability: Flesch reading ease and Flesch-Kincaid grade level
  • Sentence Analysis: Count and average sentence length
  • Sentiment: Polarity (-1 to 1) and subjectivity (0-1)
  • BERTScore vs Gold Standard: Precision, recall, and F1 scores
  • Gold Standard Coverage: Boolean flag for availability of authoritative answers

BERTScore Evaluation

  • Uses distilbert-base-uncased model for semantic similarity
  • Compares generated answers against gold standard medical answers from questions_answers.txt
  • Provides precision, recall, and F1 scores for comprehensive evaluation
  • Critical for grant reporting and publication requirements

Question Categories

All 49 questions are automatically categorized into:

  • Genetic Variant Interpretation: Understanding test results and implications
  • Inheritance Patterns: Heredity, family transmission, reproductive options
  • Family Risk Assessment: Testing recommendations for relatives
  • Gene-Specific Recommendations: Cancer risks and screening by gene (BRCA1/2, Lynch syndrome genes)
  • Support and Resources: Insurance, emotional support, finding care

🎯 Performance Optimizations

Memory Management

  • Optimized token generation limits (200-300 tokens)
  • Memory clearing after each question
  • Efficient model loading with device_map="auto"
  • CPU-based embeddings to preserve GPU memory

Quality Improvements

  • Advanced prompting strategies for medical context
  • Minimum answer length requirements (30+ characters)
  • Improved context retrieval (5 relevant document chunks)
  • Two-step generation with fallback mechanisms

Evaluation Features

  • BERTScore integration for semantic similarity vs gold standard
  • Comprehensive readability analysis for patient communication
  • Category-specific performance tracking
  • Statistical significance testing capabilities

πŸ“ˆ Expected Results

Based on comprehensive testing, the RAG approach typically shows:

  • Improved BERTScore vs gold standard medical answers
  • Better semantic similarity (improved relevance to questions)
  • Longer, more detailed answers with medical context
  • Better use of authoritative medical guidelines
  • More consistent quality across all 5 question categories
  • Higher statistical significance in paired comparisons

πŸ”§ Configuration

Model Settings

MODEL_NAME = "microsoft/phi-4"
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
BERTSCORE_MODEL = "distilbert-base-uncased"
MAX_NEW_TOKENS = 300  # Increased for detailed answers
MIN_NEW_TOKENS = 50   # Ensures comprehensive responses
TEMPERATURE = 0.7
TOP_P = 0.9

RAG Settings

CHUNK_SIZE = 1000      # Increased for better context
CHUNK_OVERLAP = 200    # Improved coherence
RETRIEVAL_K = 5        # More comprehensive document retrieval

Question Categories Configuration

Question categories are stored in config/question_categories.json and can be modified to:

  • Add new question categories
  • Update existing questions in categories
  • Modify category names

The config file structure:

{
  "question_categories": {
    "Category Name": [
      "Question 1",
      "Question 2",
      ...
    ]
  }
}

Current categories:

  • Genetic Variant Interpretation (5 questions)
  • Inheritance Patterns (7 questions)
  • Family Risk Assessment (11 questions)
  • Gene-Specific Recommendations (14 questions)
  • Support and Resources (12 questions)

Evaluation Settings

FIELDNAMES = [
    "Question", "Answer", "Category",
    "semantic_similarity", "answer_length", "word_count", "char_count",
    "flesch_reading_ease", "flesch_kincaid_grade", "sentence_count", "avg_sentence_length",
    "sentiment_polarity", "sentiment_subjectivity",
    "bertscore_precision", "bertscore_recall", "bertscore_f1", "has_gold_standard"
]

🚨 Troubleshooting

Memory Issues

  • Reduce max_new_tokens in generation settings
  • Use CPU for embeddings: model_kwargs={'device': 'cpu'}
  • Clear cache regularly: torch.cuda.empty_cache() or torch.mps.empty_cache()

BERTScore Issues

  • Ensure questions_answers.txt exists with gold standard answers
  • Verify BERTScore model download: bert-score package
  • Check GPU/CPU availability for BERTScore computation

Quality Issues

  • Verify PDF documents in guidelines/ directory
  • Check question format in questions.txt (49 questions expected)
  • Review gold standard answers in questions_answers.txt
  • Validate CSV format with 16 metrics

File Structure Issues

  • Run regenerate_csvs.py to convert old format CSVs
  • Check for _enhanced suffix files (these are legacy versions)
  • Verify qa_outputs/ directory exists and is writable
  • Analysis outputs are saved with simple names (no timestamps) - new runs overwrite previous results

πŸ“Š Data Files Structure

Input Files

  • questions.txt: 49 genetic counseling questions
  • questions_answers.txt: Gold standard answers for BERTScore evaluation
  • guidelines/*.pdf: Medical guidelines (NCCN, UpToDate, etc.)

Output Files

  • qa_outputs/questions_answers_raw.csv: Raw chatbot results (16 metrics)
  • qa_outputs/questions_answers_rag.csv: RAG chatbot results (16 metrics)
  • qa_outputs/questions_answers_comparison.csv: Merged comparison with _raw/_rag suffixes

Archive

  • archive/: Contains older versions of scripts that have been updated
  • See archive/README.md for details on what was archived and why

πŸ“ Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Ensure all 16 evaluation metrics are properly implemented
  5. Test with both raw and RAG approaches
  6. Submit a pull request

πŸ“„ License

[Add your license information here]

🀝 Support

For questions or issues:

  1. Check the troubleshooting section
  2. Review the evaluation metrics documentation
  3. Verify BERTScore and gold standard setup
  4. Open an issue with detailed error information and system specifications

Note: This system is designed for research and development purposes in genetic counseling and testing. The evaluation metrics, including BERTScore comparison against gold standard medical answers, are designed to meet grant reporting and publication requirements. Always consult with healthcare professionals for actual medical advice.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages