Dev by zTgx · Pull Request #58 · vectorlessflow/vectorless

zTgx · 2026-04-13T07:59:38Z

No description provided.

Uses pdf-extract for text extraction which handles CJK, ToUnicode CMap, font encoding, and other complex PDF text scenarios more reliably than the previous lopdf-based approach. Falls back gracefully to basic metadata extraction when lopdf parsing fails. BREAKING CHANGE: Changes internal PDF parsing mechanism from lopdf to pdf-extract while maintaining the same public API. feat(toc-processor): add multi-mode extraction with automatic degradation Introduces a three-mode TOC extraction pipeline with automatic fallback: 1. TocWithPageNumbers - when TOC with page numbers is available 2. TocWithoutPageNumbers - when TOC exists but lacks page numbers 3. NoToc - direct structure extraction from content using LLM Each mode degrades to the next when accuracy thresholds aren't met. feat(structure-extractor): add LLM-powered structure extraction for no-TOC docs Implements document structure extraction from page content when no TOC is available. Groups pages by token count and uses LLM analysis to identify hierarchical sections. Adds support for continuation across page groups with overlap handling. feat(toc-processor): add refinement for oversized TOC entries Adds capability to recursively split large TOC entries that span too many pages or exceed token limits. Uses the same structure extraction approach to identify sub-sections within oversized entries, improving granularity of document structure.

- Replace sequential LLM calls with concurrent processing using futures::join_all for better performance - Add concurrent page assignment verification in PageAssigner - Implement concurrent TOC entry verification in IndexVerifier - Add concurrent index repair functionality in IndexRepairer - Refactor methods to static versions for concurrent use - Improve performance of oversized entry refinement in TocProcessor

- Create index_pdf.rs example demonstrating PDF indexing capabilities - Implement automatic PDF format detection and hierarchical document parsing - Add support for environment variable configuration for LLM settings - Include detailed usage instructions with command-line examples - Integrate error handling and process exit codes for invalid inputs - Provide comprehensive metrics output including timing and processing stats - Add automatic workspace cleanup after indexing operations

Add support for configuring LLM settings through environment variables (LLM_API_KEY, LLM_MODEL, LLM_ENDPOINT) that override config file values. Update all examples to demonstrate both environment variable usage and default config file approaches with updated documentation. The changes affect all example files to provide consistent configuration methods and improve usability by allowing runtime configuration without modifying source code or configuration files. Fixes related to workspace cleanup and metric display formatting are also included as part of the refactoring.

- Add tracing_subscriber::fmt::init() to all examples for debug output - Modify parse functions to accept optional LLM client for enhanced PDF processing - Update PDF parser to use external LLM client for TOC extraction and structure analysis - Add with_llm_client constructors to TOC processing components - Improve error handling in event example by removing redundant error mapping - Update examples to use cleaner output formatting and better documentation

vercel · 2026-04-13T07:59:43Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
vectorless	Ready	Preview, Comment	Apr 13, 2026 7:59am

zTgx added 5 commits April 13, 2026 13:08

zTgx merged commit 16b7f62 into main Apr 13, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev#58

Dev#58
zTgx merged 5 commits intomainfrom
dev

zTgx commented Apr 13, 2026

Uh oh!

vercel bot commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zTgx commented Apr 13, 2026

Uh oh!

vercel bot commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant