Skip to content

Dev#58

Merged
zTgx merged 5 commits intomainfrom
dev
Apr 13, 2026
Merged

Dev#58
zTgx merged 5 commits intomainfrom
dev

Conversation

@zTgx
Copy link
Copy Markdown
Contributor

@zTgx zTgx commented Apr 13, 2026

No description provided.

zTgx added 5 commits April 13, 2026 13:08
Uses pdf-extract for text extraction which handles CJK, ToUnicode CMap,
font encoding, and other complex PDF text scenarios more reliably than
the previous lopdf-based approach. Falls back gracefully to basic
metadata extraction when lopdf parsing fails.

BREAKING CHANGE: Changes internal PDF parsing mechanism from lopdf to
pdf-extract while maintaining the same public API.

feat(toc-processor): add multi-mode extraction with automatic degradation

Introduces a three-mode TOC extraction pipeline with automatic fallback:
1. TocWithPageNumbers - when TOC with page numbers is available
2. TocWithoutPageNumbers - when TOC exists but lacks page numbers
3. NoToc - direct structure extraction from content using LLM

Each mode degrades to the next when accuracy thresholds aren't met.

feat(structure-extractor): add LLM-powered structure extraction for no-TOC docs

Implements document structure extraction from page content when no TOC
is available. Groups pages by token count and uses LLM analysis to
identify hierarchical sections. Adds support for continuation across
page groups with overlap handling.

feat(toc-processor): add refinement for oversized TOC entries

Adds capability to recursively split large TOC entries that span too
many pages or exceed token limits. Uses the same structure extraction
approach to identify sub-sections within oversized entries, improving
granularity of document structure.
- Replace sequential LLM calls with concurrent processing using
  futures::join_all for better performance
- Add concurrent page assignment verification in PageAssigner
- Implement concurrent TOC entry verification in IndexVerifier
- Add concurrent index repair functionality in IndexRepairer
- Refactor methods to static versions for concurrent use
- Improve performance of oversized entry refinement in TocProcessor
- Create index_pdf.rs example demonstrating PDF indexing capabilities
- Implement automatic PDF format detection and hierarchical document parsing
- Add support for environment variable configuration for LLM settings
- Include detailed usage instructions with command-line examples
- Integrate error handling and process exit codes for invalid inputs
- Provide comprehensive metrics output including timing and processing stats
- Add automatic workspace cleanup after indexing operations
Add support for configuring LLM settings through environment
variables (LLM_API_KEY, LLM_MODEL, LLM_ENDPOINT) that override
config file values. Update all examples to demonstrate both
environment variable usage and default config file approaches
with updated documentation.

The changes affect all example files to provide consistent
configuration methods and improve usability by allowing
runtime configuration without modifying source code or
configuration files.

Fixes related to workspace cleanup and metric display formatting
are also included as part of the refactoring.
- Add tracing_subscriber::fmt::init() to all examples for debug output
- Modify parse functions to accept optional LLM client for enhanced PDF processing
- Update PDF parser to use external LLM client for TOC extraction and structure analysis
- Add with_llm_client constructors to TOC processing components
- Improve error handling in event example by removing redundant error mapping
- Update examples to use cleaner output formatting and better documentation
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
vectorless Ready Ready Preview, Comment Apr 13, 2026 7:59am

@zTgx zTgx merged commit 16b7f62 into main Apr 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant