A comprehensive toolkit for extracting, analyzing, and querying financial data from various document formats with integrated market data analysis and SEC filing integration.
This system consists of three main components:
- Financial Data Extractor: Extracts structured financial data from PDFs, spreadsheets, and scanned documents.
- Financial Chatbot: An AI-powered assistant that allows users to query the extracted financial data.
- Market Analysis Engine: Fetches real-time market data and SEC filings for comprehensive financial analysis.
- Multi-format Support: Process PDF files (both text-based and scanned) and spreadsheets (CSV, Excel)
- Advanced Table Detection: Identifies and extracts tables from documents
- OCR Capability: Extracts text from scanned documents
- Standardized Output: Normalizes financial terms and metrics
- Contextual Extraction: Captures both structured tables and relevant contextual information
- Notes & Footnotes: Extracts important notes and footnotes for comprehensive understanding
- AI-Powered Queries: Allows natural language questions about financial data
- Financial Analysis: Provides insights on financial metrics and ratios
- Source Attribution: References which parts of the document answers came from
- Vector Search: Uses semantic search to find the most relevant information
- Integrated Market Data: Automatically enriches responses with market comparisons
- Visualizations: Generates charts and graphs in response to relevant queries
- Report Generation: Build and download comprehensive financial analysis reports in PDF or Markdown
- Automatic Company Detection: Identifies the company from financial documents
- Real-time Market Data: Fetches current financial metrics and stock performance
- SEC Filing Download: Retrieves official SEC filings (10-K, 10-Q, 8-K, etc.)
- Peer Comparison: Compares company performance against industry peers
- Interactive Charts: Visual representations of financial metrics and performance
- Comprehensive Analysis: Combines document data, market information, and SEC filings
# Clone the repository
git clone https://github.com/yourusername/financial-document-analysis.git
cd financial-document-analysis
# Install dependencies
pip install -r requirements.txt
# Install additional system dependencies
# For macOS:
brew install ghostscript tesseract
# For Ubuntu/Debian:
apt-get install ghostscript tesseract-ocr
# For Windows:
# Download and install Ghostscript from: https://ghostscript.com/releases/
# Download and install Tesseract from: https://github.com/UB-Mannheim/tesseract/wikifrom enhanced_financial_data_extractor import extract_financial_data_rag
# Extract data from a document
result = extract_financial_data_rag("path/to/financial_report.pdf")
# Save the extracted data to JSON
from enhanced_financial_data_extractor import extract_and_save_financial_data_rag
output_path = extract_and_save_financial_data_rag("path/to/financial_report.pdf")
print(f"Extracted data saved to: {output_path}")# Run the Streamlit web app
streamlit run bot.py- Document Upload: Upload financial PDFs directly through the web interface
- AI Chat: Ask natural language questions about the financial document
- Source Citations: See which parts of the document the AI used to answer
- Report Generation: Compile important insights into a structured report
- Industry Comparison: Compare company performance with industry peers
- SEC Filing Integration: Download and analyze official SEC filings
- Interactive Visualizations: View charts and graphs on demand
- Data Export: Export extracted financial data as JSON
Access official SEC filings directly within the application:
- Supported Filing Types: 10-K, 10-Q, 8-K, S-1, DEF 14A, and more
- Automatic Processing: Extracts key sections from filings (Business Description, Risk Factors, etc.)
- Integrated Analysis: Combines SEC filings with document analysis for comprehensive insights
- Historical Filing Access: Retrieve multiple historical filings for trend analysis
- Seamless Knowledge Base Integration: SEC data is incorporated into the AI's knowledge base
To use the SEC functionality:
- After uploading a financial document, the system automatically detects the company ticker
- In the sidebar, select the filing type and number of filings to download
- Click "Download SEC Filings" to retrieve and process the data
- Ask questions that incorporate both your document and SEC filing information
The system allows you to build comprehensive financial analysis reports:
- Add important insights to the report with one click
- Automatically categorizes financial information by topic
- Preview the report structure directly in the UI
- Download the report as a PDF document with proper formatting
- Download the report as Markdown for easy editing
- Include source citations and page references
The financial data extractor produces a JSON structure with the following sections:
- metadata: Information about the source file and extraction process
- financial_data: Extracted tables with standardized column names
- contextual_text: Relevant text sections from the document
- notes: Footnotes and additional context from the document
The system seamlessly incorporates market data:
- Automatic Company Detection: Identifies the target company from your document
- Industry Peer Mapping: Suggests appropriate peer companies for comparison
- Live Financial Data: Fetches real-time financial metrics and ratios
- Visual Comparisons: Creates charts to visualize relative performance
- Stock Performance Analysis: Analyzes stock price trends against industry benchmarks
- Custom Peer Selection: Manually adjust company ticker and peer companies
- Python 3.8+
- Streamlit 1.20+
- See
requirements.txtfor detailed dependencies
MIT License
This API serves as an interface to the D2K financial analysis backend.
-
Install dependencies:
pip install -r requirements.txt -
Run the server:
python app.py
The API includes Swagger documentation for easy exploration and testing:
- Swagger UI: Access interactive API documentation at
/api/docswhen the server is running - Test Endpoints: Try out API endpoints directly from the Swagger UI
- Model Schemas: View request/response schemas for all endpoints
- API Descriptions: Get detailed information about each endpoint's functionality
- URL:
/api/health - Method:
GET - Response: Status of the API
- URL:
/api/analyze-document - Method:
POST - Form Data:
file: PDF file to analyzequery(optional): Query to run against the document
- Response: Analysis results
- URL:
/api/company-data/<ticker> - Method:
GET - URL Parameters:
ticker(company stock symbol) - Response: Company information and market data
- URL:
/api/document-query - Method:
POST - Form Data:
file: PDF file to analyzequery: Query to run against the documentticker(optional): Company ticker symbol for additional context
- Response: Query results based on document content
- URL:
/api/generate-report - Method:
POST - Query Parameters:
format(optional): Output format (markdownorpdf, default:markdown)
- Request Body: JSON with report data
- Response: Generated report in markdown or PDF format
- URL:
/api/extract-tickers - Method:
POST - Request Body: JSON with
queryfield - Response: Extracted ticker symbols from query text