This repository contains a comprehensive prototype for a Cancer Genetic Testing (CGT) chatbot system that helps patients understand their genetic test results and cancer risks. The system uses advanced language models with Retrieval-Augmented Generation (RAG) to provide clear, accurate, and empathetic responses to patient questions about genetic testing and hereditary cancer syndromes.
chatbot_raw.py: Direct implementation using Microsoft Phi-4 model with comprehensive evaluationchatbot_rag.py: RAG implementation combining Phi-4 with medical PDF guidelines- Memory-optimized: Designed to prevent system crashes with efficient memory management
- Question Categorization: Automatic classification into 5 genetic counseling domains
analysis.py: Statistical analysis and report generation including:- Comprehensive Text Reports: Detailed analysis with executive summary, statistical comparisons, and category breakdowns
- Summary Statistics CSV: Overall performance metrics comparing RAG vs Raw approaches
- Category Analysis CSV: Detailed breakdown by question category with statistical significance testing
- Publication-Ready Reports: Professional formatting suitable for grant reporting and publications
- Statistical Rigor: Paired t-tests, Wilcoxon tests, Cohen's d effect sizes, and multiple comparison corrections
answer_evaluator.py: Comprehensive evaluation metrics including:- BERTScore vs Gold Standard: Precision, recall, and F1 scores against authoritative medical answers
- Semantic similarity between questions and answers using sentence transformers
- Readability metrics (Flesch reading ease and grade level)
- Answer length analysis (character count, word count, sentence count, average sentence length)
- Sentiment analysis (polarity and subjectivity)
- Gold standard coverage tracking
Questions are automatically categorized into 5 genetic counseling domains:
- Genetic Variant Interpretation (5 questions)
- Inheritance Patterns (7 questions)
- Family Risk Assessment (11 questions)
- Gene-Specific Recommendations (14 questions)
- Support and Resources (12 questions)
regenerate_csvs.py: Convert existing CSV files to current format with comprehensive metrics- Answer cleaning utilities: Remove formatting artifacts and improve quality
- Re-evaluation capabilities: Assess impact of data cleaning on metrics
- Batch processing: Handle large question sets efficiently
- Comprehensive Statistical Analysis: Professional reports with publication-ready formatting
- Visual Dashboard Generation: Interactive 5-panel dashboard with statistical visualizations
- Statistical comparison: Paired t-tests and effect size calculations between approaches
- Category-specific analysis: Performance metrics by question domain
- Integrated Output Location: All analysis outputs (reports, CSVs, and dashboards) saved to
analysis/directory
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Question ββββββΆβ Phi-4 ββββββΆβ Answer β
βββββββββββββββ β Model β βββββββββββββββ
βββββββββββββββ β
βΌ
βββββββββββββββββββ
β Comprehensive β
β Evaluation β
β β’ BERTScore β
β β’ Readability β
β β’ Categorizationβ
βββββββββββββββββββ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Question ββββββΆβ Vector ββββββΆβ Medical β
β +Category β β Search β β Guidelines β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β
βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Answer βββββββ Phi-4 βββββββ Context β
βββββββββββββββ β Model β β Prompt β
β βββββββββββββββ βββββββββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββ βββββββββββββββ
β Comprehensive ββββββΆβ Statistical ββββββΆβ Dashboard β
β Evaluation β β Analysis β β & Reports β
β β’ BERTScore β β β’ t-tests β β β’ Category β
β β’ Gold Standard β β β’ Effect β β β’ Comparisonβ
β β’ 16 Metrics β β Size β β β’ Visuals β
βββββββββββββββββββ βββββββββββββββ βββββββββββββββ
- RAG Context: Retrieves relevant medical guidelines from PDF documents
- LLM Knowledge: Falls back to model's general medical knowledge if needed
- Quality Assurance: Comprehensive evaluation with 16 metrics including BERTScore
- Gold Standard Comparison: Evaluates against authoritative medical answers
torch>=1.9.0
transformers>=4.20.0
langchain>=0.0.200
langchain-community>=0.0.20
sentence-transformers>=2.2.0
chromadb>=0.3.0
textstat>=0.7.0
textblob>=0.17.0
bert-score>=0.3.13
PyPDF2>=3.0.0
pandas>=1.5.0
numpy>=1.21.0
matplotlib>=3.5.0
seaborn>=0.11.0
scipy>=1.9.0
plotly>=5.0.0
- Clone the repository:
git clone [repository-url]
cd CGT_FDR_Prototype- Install dependencies:
pip install -r requirements.txt- Set up directory structure:
CGT_FDR_Prototype/
βββ guidelines/ # PDF medical guidelines (NCCN, UpToDate, etc.)
βββ qa_outputs/ # Generated results with comprehensive metrics
β βββ questions_answers_raw.csv # Raw chatbot results
β βββ questions_answers_rag.csv # RAG chatbot results
β βββ questions_answers_comparison.csv # Merged comparison for dashboard
βββ analysis/ # Analysis outputs and reports
β βββ comprehensive_analysis_report.txt # Detailed text report
β βββ summary_statistics.csv # Overall performance metrics CSV
β βββ category_analysis.csv # Category-specific analysis CSV
β βββ cgt_analysis_dashboard.png # Visual dashboard (high-resolution)
β βββ cgt_analysis_dashboard.pdf # Visual dashboard (publication-ready)
βββ config/ # Configuration files
β βββ question_categories.json # Question categories definition
βββ helper/ # Helper functions and utilities
β βββ clean_results.py # Answer cleaning functionality
βββ chroma_db/ # Vector database for RAG
βββ archive/ # Archived older versions
βββ questions.txt # Input questions (49 total)
βββ questions_answers.txt # Gold standard answers for BERTScore
βββ chatbot_raw.py # Raw chatbot implementation
βββ chatbot_rag.py # RAG chatbot implementation (recommended)
βββ answer_evaluator.py # Comprehensive evaluation metrics
βββ analysis.py # Comprehensive statistical analysis and reporting
βββ dashboard.py # Visual dashboard generation with statistical plots
βββ README.md
RAG Chatbot (Recommended):
python chatbot_rag.py- Generates
qa_outputs/questions_answers_rag.csv - Uses PDF guidelines for context-aware responses
- Includes 16 comprehensive evaluation metrics
Raw Chatbot (Baseline):
python chatbot_raw.py- Generates
qa_outputs/questions_answers_raw.csv - Direct model responses without guidelines
- Same comprehensive evaluation metrics for comparison
Statistical Analysis & Report Generation:
python analysis.py- Creates comprehensive statistical analysis comparing RAG vs Raw performance
- Generates publication-ready text report with detailed findings and recommendations
- Outputs to
analysis/directory:comprehensive_analysis_report.txt: Complete analysis with executive summary, statistics, and category breakdownssummary_statistics.csv: Overall performance metrics with statistical significance testingcategory_analysis.csv: Category-specific performance analysis with effect sizes
Analysis Features:
- Statistical Testing: Paired t-tests, Wilcoxon tests, and effect size calculations
- Category Analysis: Performance breakdown by genetic counseling domain
- BERTScore Integration: Evaluation against gold standard medical answers
- Publication Quality: Professional formatting suitable for grant reports and academic publications
Interactive Dashboard Creation:
python dashboard.py- Creates comprehensive visual dashboard comparing RAG vs Raw performance
- Generates publication-ready visualizations with statistical significance testing
- Outputs to
analysis/directory:cgt_analysis_dashboard.png: High-resolution dashboard image (300 DPI)cgt_analysis_dashboard.pdf: Publication-ready PDF version
Dashboard Features:
- 5-Panel Layout: Overview metrics, category performance, BERTScore comparisons, performance matrix, and summary table
- Statistical Visualization: Statistical significance markers, effect sizes, and confidence intervals
- BERTScore Integration: Visual comparison against gold standard medical answers
- Category Analysis: Performance breakdown by genetic counseling domain with insights
Upgrade existing CSV files to current format:
python regenerate_csvs.py- Converts old CSV format to current format with 16 metrics
- Adds BERTScore evaluation against gold standard
- Creates comparison CSV for dashboard use
Chatbot Results:
| File | Description | Metrics |
|---|---|---|
questions_answers_raw.csv |
Raw chatbot results | 16 comprehensive metrics |
questions_answers_rag.csv |
RAG chatbot results | 16 comprehensive metrics |
questions_answers_comparison.csv |
Side-by-side comparison | Raw vs RAG with _raw/_rag suffixes |
Analysis Reports:
| File | Description | Content |
|---|---|---|
comprehensive_analysis_report.txt |
Detailed statistical analysis | Executive summary, statistical comparisons, category analysis, recommendations |
summary_statistics.csv |
Overall performance metrics | Statistical significance, effect sizes, improvement percentages |
category_analysis.csv |
Category-specific breakdowns | Performance by genetic counseling domain with statistical tests |
cgt_analysis_dashboard.png |
Visual dashboard (high-res) | 5-panel dashboard with charts, statistics, and category analysis |
cgt_analysis_dashboard.pdf |
Visual dashboard (publication) | Professional PDF version suitable for reports and presentations |
- Semantic Similarity: Relevance between question and answer (0-1)
- Answer Length: Character count analysis
- Word Count: Token-level analysis
- Readability: Flesch reading ease and Flesch-Kincaid grade level
- Sentence Analysis: Count and average sentence length
- Sentiment: Polarity (-1 to 1) and subjectivity (0-1)
- BERTScore vs Gold Standard: Precision, recall, and F1 scores
- Gold Standard Coverage: Boolean flag for availability of authoritative answers
- Uses
distilbert-base-uncasedmodel for semantic similarity - Compares generated answers against gold standard medical answers from
questions_answers.txt - Provides precision, recall, and F1 scores for comprehensive evaluation
- Critical for grant reporting and publication requirements
All 49 questions are automatically categorized into:
- Genetic Variant Interpretation: Understanding test results and implications
- Inheritance Patterns: Heredity, family transmission, reproductive options
- Family Risk Assessment: Testing recommendations for relatives
- Gene-Specific Recommendations: Cancer risks and screening by gene (BRCA1/2, Lynch syndrome genes)
- Support and Resources: Insurance, emotional support, finding care
- Optimized token generation limits (200-300 tokens)
- Memory clearing after each question
- Efficient model loading with
device_map="auto" - CPU-based embeddings to preserve GPU memory
- Advanced prompting strategies for medical context
- Minimum answer length requirements (30+ characters)
- Improved context retrieval (5 relevant document chunks)
- Two-step generation with fallback mechanisms
- BERTScore integration for semantic similarity vs gold standard
- Comprehensive readability analysis for patient communication
- Category-specific performance tracking
- Statistical significance testing capabilities
Based on comprehensive testing, the RAG approach typically shows:
- Improved BERTScore vs gold standard medical answers
- Better semantic similarity (improved relevance to questions)
- Longer, more detailed answers with medical context
- Better use of authoritative medical guidelines
- More consistent quality across all 5 question categories
- Higher statistical significance in paired comparisons
MODEL_NAME = "microsoft/phi-4"
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
BERTSCORE_MODEL = "distilbert-base-uncased"
MAX_NEW_TOKENS = 300 # Increased for detailed answers
MIN_NEW_TOKENS = 50 # Ensures comprehensive responses
TEMPERATURE = 0.7
TOP_P = 0.9CHUNK_SIZE = 1000 # Increased for better context
CHUNK_OVERLAP = 200 # Improved coherence
RETRIEVAL_K = 5 # More comprehensive document retrievalQuestion categories are stored in config/question_categories.json and can be modified to:
- Add new question categories
- Update existing questions in categories
- Modify category names
The config file structure:
{
"question_categories": {
"Category Name": [
"Question 1",
"Question 2",
...
]
}
}Current categories:
- Genetic Variant Interpretation (5 questions)
- Inheritance Patterns (7 questions)
- Family Risk Assessment (11 questions)
- Gene-Specific Recommendations (14 questions)
- Support and Resources (12 questions)
FIELDNAMES = [
"Question", "Answer", "Category",
"semantic_similarity", "answer_length", "word_count", "char_count",
"flesch_reading_ease", "flesch_kincaid_grade", "sentence_count", "avg_sentence_length",
"sentiment_polarity", "sentiment_subjectivity",
"bertscore_precision", "bertscore_recall", "bertscore_f1", "has_gold_standard"
]- Reduce
max_new_tokensin generation settings - Use CPU for embeddings:
model_kwargs={'device': 'cpu'} - Clear cache regularly:
torch.cuda.empty_cache()ortorch.mps.empty_cache()
- Ensure
questions_answers.txtexists with gold standard answers - Verify BERTScore model download:
bert-scorepackage - Check GPU/CPU availability for BERTScore computation
- Verify PDF documents in
guidelines/directory - Check question format in
questions.txt(49 questions expected) - Review gold standard answers in
questions_answers.txt - Validate CSV format with 16 metrics
- Run
regenerate_csvs.pyto convert old format CSVs - Check for
_enhancedsuffix files (these are legacy versions) - Verify
qa_outputs/directory exists and is writable - Analysis outputs are saved with simple names (no timestamps) - new runs overwrite previous results
questions.txt: 49 genetic counseling questionsquestions_answers.txt: Gold standard answers for BERTScore evaluationguidelines/*.pdf: Medical guidelines (NCCN, UpToDate, etc.)
qa_outputs/questions_answers_raw.csv: Raw chatbot results (16 metrics)qa_outputs/questions_answers_rag.csv: RAG chatbot results (16 metrics)qa_outputs/questions_answers_comparison.csv: Merged comparison with _raw/_rag suffixes
archive/: Contains older versions of scripts that have been updated- See
archive/README.mdfor details on what was archived and why
- Fork the repository
- Create a feature branch
- Make your changes
- Ensure all 16 evaluation metrics are properly implemented
- Test with both raw and RAG approaches
- Submit a pull request
[Add your license information here]
For questions or issues:
- Check the troubleshooting section
- Review the evaluation metrics documentation
- Verify BERTScore and gold standard setup
- Open an issue with detailed error information and system specifications
Note: This system is designed for research and development purposes in genetic counseling and testing. The evaluation metrics, including BERTScore comparison against gold standard medical answers, are designed to meet grant reporting and publication requirements. Always consult with healthcare professionals for actual medical advice.