Conversation
Uses pdf-extract for text extraction which handles CJK, ToUnicode CMap, font encoding, and other complex PDF text scenarios more reliably than the previous lopdf-based approach. Falls back gracefully to basic metadata extraction when lopdf parsing fails. BREAKING CHANGE: Changes internal PDF parsing mechanism from lopdf to pdf-extract while maintaining the same public API. feat(toc-processor): add multi-mode extraction with automatic degradation Introduces a three-mode TOC extraction pipeline with automatic fallback: 1. TocWithPageNumbers - when TOC with page numbers is available 2. TocWithoutPageNumbers - when TOC exists but lacks page numbers 3. NoToc - direct structure extraction from content using LLM Each mode degrades to the next when accuracy thresholds aren't met. feat(structure-extractor): add LLM-powered structure extraction for no-TOC docs Implements document structure extraction from page content when no TOC is available. Groups pages by token count and uses LLM analysis to identify hierarchical sections. Adds support for continuation across page groups with overlap handling. feat(toc-processor): add refinement for oversized TOC entries Adds capability to recursively split large TOC entries that span too many pages or exceed token limits. Uses the same structure extraction approach to identify sub-sections within oversized entries, improving granularity of document structure.
- Replace sequential LLM calls with concurrent processing using futures::join_all for better performance - Add concurrent page assignment verification in PageAssigner - Implement concurrent TOC entry verification in IndexVerifier - Add concurrent index repair functionality in IndexRepairer - Refactor methods to static versions for concurrent use - Improve performance of oversized entry refinement in TocProcessor
- Create index_pdf.rs example demonstrating PDF indexing capabilities - Implement automatic PDF format detection and hierarchical document parsing - Add support for environment variable configuration for LLM settings - Include detailed usage instructions with command-line examples - Integrate error handling and process exit codes for invalid inputs - Provide comprehensive metrics output including timing and processing stats - Add automatic workspace cleanup after indexing operations
Add support for configuring LLM settings through environment variables (LLM_API_KEY, LLM_MODEL, LLM_ENDPOINT) that override config file values. Update all examples to demonstrate both environment variable usage and default config file approaches with updated documentation. The changes affect all example files to provide consistent configuration methods and improve usability by allowing runtime configuration without modifying source code or configuration files. Fixes related to workspace cleanup and metric display formatting are also included as part of the refactoring.
- Add tracing_subscriber::fmt::init() to all examples for debug output - Modify parse functions to accept optional LLM client for enhanced PDF processing - Update PDF parser to use external LLM client for TOC extraction and structure analysis - Add with_llm_client constructors to TOC processing components - Improve error handling in event example by removing redundant error mapping - Update examples to use cleaner output formatting and better documentation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.