AI-Powered Thesis Review Tool
VeritaScribe is an intelligent document analysis system that automatically reviews PDF thesis documents for quality issues including grammar errors, content plausibility problems, and citation format inconsistencies. Built with DSPy for LLM orchestration, Pydantic for structured data modeling, and PyMuPDF for PDF processing.
- Multi-Provider LLM Support: Works with OpenAI, OpenRouter, Anthropic Claude, and custom endpoints.
- Multi-Language Analysis: Automatic language detection with native support for English and German documents, including language-specific grammar rules and cultural context.
- Comprehensive Analysis: Grammar checking, content validation, and citation format verification.
- DSPy Prompt Optimization: Advanced few-shot learning system with bilingual training data to fine-tune analysis prompts for maximum accuracy.
- Smart PDF Processing: Extracts text while preserving layout context and location information.
- Structured Reporting: Generates detailed reports with error locations, severities, and suggestions.
- Visual Analytics & Annotated PDFs: Creates charts, visualizations, and annotated PDFs to highlight error patterns.
- Flexible Configuration: Customizable analysis parameters and provider selection.
- CLI Interface: Easy-to-use command-line interface with multiple analysis modes.
- Cost Optimization: Choose from 100+ models across providers for optimal cost/performance.
- Python 3.13 or higher
- API key for your chosen LLM provider:
- OpenAI: Get API key
- OpenRouter: Get API key (access to 100+ models)
- Anthropic: Get API key (for Claude models)
- Custom: Any OpenAI-compatible endpoint (local models, Azure, etc.)
-
Clone the repository:
git clone <repository-url> cd VeritaScribe
-
Install dependencies using uv:
uv sync
-
Configure your LLM provider:
# Copy the example environment file cp .env.example .env # Edit .env and configure your preferred provider # See "LLM Providers" section below for examples
-
Try the demo (creates and analyzes a sample document):
uv run python -m veritascribe demo
-
Analyze your thesis:
uv run python -m veritascribe analyze your_thesis.pdf
-
Generate an annotated PDF with highlighted errors and detailed annotations:
uv run python -m veritascribe analyze your_thesis.pdf --annotate
This creates an interactive PDF with:
- Color-coded error highlights (red/orange/yellow by severity)
- Sticky note annotations with suggestions and explanations
- Perfect for sharing with supervisors or collaborators
Performs comprehensive analysis of a thesis document.
uv run python -m veritascribe analyze [OPTIONS] PDF_PATH
Options:
--output, -o
: Output directory for reports (default:./analysis_output
)--citation-style, -c
: Expected citation style (default:APA
)--quick, -q
: Quick analysis mode (first 10 blocks only)--no-viz
: Skip generating visualization charts--annotate
: Generate an annotated PDF with highlighted errors--verbose, -v
: Enable verbose logging
Examples:
# Standard analysis with annotated PDF
uv run python -m veritascribe analyze thesis.pdf --output ./results --annotate
# Quick analysis with annotation for iterative writing
uv run python -m veritascribe analyze draft.pdf --quick --annotate
# Full analysis with specific citation style and annotations
uv run python -m veritascribe analyze thesis.pdf --citation-style MLA --annotate --verbose
Analyzes a subset of the document for quick feedback.
uv run python -m veritascribe quick [OPTIONS] PDF_PATH
Options:
--blocks, -b
: Number of text blocks to analyze (default: 5)
Creates and analyzes a demo thesis document.
uv run python -m veritascribe demo
Displays current configuration settings.
uv run python -m veritascribe config
Shows all supported LLM providers, models, and configuration examples.
uv run python -m veritascribe providers
Fine-tunes the analysis prompts using few-shot learning with bilingual training data for improved accuracy.
uv run python -m veritascribe optimize-prompts
This command:
- Creates optimized modules for English and German analysis
- Uses curated training examples for grammar, content, and citation checking
- Improves analysis accuracy through DSPy's BootstrapFewShot compilation
- Takes several minutes to complete but significantly enhances analysis quality
Verifies that all components are working correctly.
uv run python -m veritascribe test
VeritaScribe supports multiple LLM providers, giving you flexibility in model choice, cost optimization, and feature access.
- Models: GPT-4, GPT-4 Turbo, GPT-4o, GPT-3.5 Turbo
- Best for: High quality, reliability
- Setup:
LLM_PROVIDER=openai OPENAI_API_KEY=sk-your-key-here DEFAULT_MODEL=gpt-4
- Models: 100+ models including Claude, Llama, Mistral, Gemini
- Best for: Cost optimization, model variety
- Setup:
LLM_PROVIDER=openrouter OPENROUTER_API_KEY=sk-or-your-key-here DEFAULT_MODEL=anthropic/claude-3.5-sonnet
- Models: Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku
- Best for: Advanced reasoning, long context
- Setup:
LLM_PROVIDER=anthropic ANTHROPIC_API_KEY=sk-ant-your-key-here DEFAULT_MODEL=claude-3-5-sonnet-20241022
- Models: Any OpenAI-compatible API (Ollama, Azure, local models)
- Best for: Privacy, local deployment, custom setups
- Examples:
# Local Ollama LLM_PROVIDER=custom OPENAI_API_KEY=ollama OPENAI_BASE_URL=http://localhost:11434/v1 DEFAULT_MODEL=llama3.1:8b # Azure OpenAI LLM_PROVIDER=custom OPENAI_API_KEY=your-azure-key OPENAI_BASE_URL=https://your-resource.openai.azure.com/ DEFAULT_MODEL=gpt-4
Provider | Cost | Speed | Quality | Models | Privacy |
---|---|---|---|---|---|
OpenAI | High | Fast | Excellent | 6+ | Cloud |
OpenRouter | Variable | Variable | Excellent | 100+ | Cloud |
Anthropic | Medium | Fast | Excellent | 5+ | Cloud |
Custom | Variable | Variable | Variable | Unlimited | Configurable |
VeritaScribe uses environment variables for configuration. Copy .env.example
to .env
and customize.
VeritaScribe automatically detects document language and applies language-specific analysis:
- English: Full grammar checking, content validation, and citation verification
- German: Native grammar rules, cultural context awareness, and German academic conventions
The system automatically detects the primary language of your document and:
- Applies appropriate grammar rules and linguistic patterns
- Uses language-specific training data for enhanced accuracy
- Provides culturally-aware content validation
- Supports mixed-language documents with intelligent switching
# English thesis
uv run python -m veritascribe analyze english_thesis.pdf
# German thesis (automatically detected)
uv run python -m veritascribe analyze german_thesis.pdf
# Optimize prompts for both languages
uv run python -m veritascribe optimize-prompts
VeritaScribe generates several types of output:
A comprehensive Markdown report containing:
- Document summary and statistics
- Detailed error listings with locations
- Severity breakdown and recommendations
Structured data in JSON format for programmatic access:
- All detected errors with metadata
- Text block information
- Analysis statistics
Charts and graphs showing:
- Error distribution by type
- Error density per page
- Severity breakdown
An interactive PDF with errors visually highlighted and detailed annotations. Generated using --annotate
flag.
Features:
- Color-coded highlights directly on problematic text:
- π΄ Red: High-severity errors (critical issues requiring immediate attention)
- π Orange: Medium-severity errors (important improvements needed)
- π‘ Yellow: Low-severity errors (minor suggestions)
- Sticky note annotations with comprehensive error details:
- Error type and severity level
- Original problematic text excerpt
- Specific suggested corrections
- Detailed explanations of the issue
- AI confidence scores
- Smart positioning to prevent overlapping annotations
- Preserves original formatting while adding review layers
Perfect for: Collaborative review, supervisor feedback, iterative editing, and visual error identification.
The console output and Markdown report include details on token usage and the estimated cost of the analysis.
Run into issues? Check the Troubleshooting Guide for help.
VeritaScribe is built on a modular and extensible pipeline architecture, designed to process PDF documents, analyze their content using Large Language Models (LLMs), and generate comprehensive reports. The system leverages modern Python libraries like DSPy for LLM orchestration, Pydantic for data integrity, and PyMuPDF for efficient PDF handling.
The system is composed of several key components that work together in a coordinated workflow:
-
CLI (Command Line Interface) (
main.py
):- Built with Typer, providing a user-friendly command-line interface.
- Handles user commands (
analyze
,quick
,demo
, etc.), parses arguments, and initiates the analysis pipeline.
-
Configuration Management (
config.py
):- Uses Pydantic-Settings to manage configuration from environment variables and
.env
files. - Provides type-safe, centralized settings for LLM providers, API keys, analysis parameters, and processing options.
- Dynamically configures the DSPy environment based on the selected LLM provider.
- Uses Pydantic-Settings to manage configuration from environment variables and
-
PDF Processor (
pdf_processor.py
):- Utilizes PyMuPDF (fitz) for high-performance PDF parsing.
- Extracts text content along with its layout and location information (bounding boxes), which is crucial for accurate error reporting and annotation.
- Cleans and preprocesses text to handle common PDF artifacts and formatting issues.
-
LLM Analysis Modules (
llm_modules.py
):- The core of the analysis engine, built with DSPy (Declarative Self-improving Language Programs).
- Consists of specialized modules for different analysis types:
LinguisticAnalyzer
: Checks for grammar, spelling, and style errors.ContentValidator
: Assesses logical consistency and content plausibility.CitationChecker
: Verifies citation formats against specified styles (e.g., APA, MLA).
- Each module uses strongly-typed DSPy Signatures to ensure structured and predictable interactions with LLMs.
- Supports multi-language analysis by leveraging language detection and language-specific prompts and training data.
-
Data Models (
data_models.py
):- Employs Pydantic models to define clear, validated data structures for the entire application.
- Key models include
TextBlock
,BaseError
(with subclasses for different error types), andThesisAnalysisReport
. - Ensures data integrity and provides a consistent data flow between components.
-
Analysis Pipeline (
pipeline.py
):- Orchestrates the end-to-end analysis workflow.
- Coordinates the PDF processor, analysis modules, and report generator.
- Manages the flow of data from raw PDF to the final analysis report.
- Supports both sequential and parallel processing of text blocks for performance optimization, using Python's
concurrent.futures
.
-
Report Generator (
report_generator.py
):- Generates multiple output formats from the final
ThesisAnalysisReport
. - Creates detailed Markdown reports for human-readable summaries.
- Exports structured JSON data for programmatic use.
- Produces visualizations (e.g., error distribution charts) using Matplotlib.
- Generates annotated PDFs with highlighted errors and comments.
- Generates multiple output formats from the final
The typical workflow is as follows:
graph TD
A[User runs CLI command] --> B{Analysis Pipeline};
B --> C[PDF Processor: Extract TextBlocks];
C --> D{Analysis Orchestrator};
D --> E[Linguistic Analyzer];
D --> F[Content Validator];
D --> G[Citation Checker];
E --> H[LLM API Call];
F --> H;
G --> H;
H --> I[Parse & Validate LLM Response];
I --> J(AnalysisResult);
J --> K[Aggregate Results];
K --> L{ThesisAnalysisReport};
L --> M[Report Generator];
M --> N[Markdown Report];
M --> O[JSON Data];
M --> P[Visualizations];
M --> Q[Annotated PDF];
This modular design allows for easy extension, such as adding new analysis modules, supporting more output formats, or integrating different LLM providers. For more details, see the Architecture Guide.