Conversation
BREAKING CHANGE: Remove support for DOCX and HTML document formats. This change removes the following dependencies and functionality: - DOCX parsing via zip and roxmltree crates - HTML parsing via scraper crate - Related parser modules and configurations - Format support from index context and client engine The supported formats are now limited to markdown and PDF only.
…kpoint support - Introduce merge_content option in ThinningConfig to control whether child content should be absorbed into parent nodes during thinning operations - Add SplitConfig for configuring large node splitting with max_tokens_per_node and pattern-based splitting options - Implement SplitStage to break oversized leaf nodes into smaller children using natural split points (headings, paragraphs) - Add checkpoint functionality with PipelineCheckpoint, CheckpointManager, and CheckpointContextData for resumable indexing pipelines - Update BuildStage to support content merging when removing small nodes - Modify pipeline execution order to include validate and split stages - Add checkpoint_dir option to PipelineOptions for persistent state management - Include comprehensive test coverage for new splitting and checkpoint features
- Add `is_leaf` field to PendingNode struct to track node type - Implement shortcut mechanism that uses original content as summary for nodes below token threshold to save API costs - Add `shortcut_threshold` configuration option with default of 200 tokens - Modify summary generation to consider leaf vs non-leaf node types - Update LLM prompts to provide different context for leaf (content-focused) vs branch (navigation-focused) nodes - Add metrics tracking for shortcut usage - Include caching support for node-type-aware summaries
BREAKING CHANGE: Parser modules have been relocated from `crate::parser` to `crate::index::parse`. Update imports accordingly: - `crate::parser::DocumentFormat` -> `crate::index::parse::DocumentFormat` - `crate::parser::RawNode` -> `crate::index::parse::RawNode` - `crate::parser::DocumentParser` trait removed from index pipeline The parsing functionality now lives under the index module to better reflect its purpose in the indexing pipeline. The parser module remains as a re-export shim for backward compatibility.
…example - Move DocumentFormat and RawNode imports from parser to index::parse - Remove parser module declaration from lib.rs as it's now consolidated - Delete parser mod.rs re-export shim that was moved to index::parse - Remove unused cli_tool example from Cargo.toml - Keep other examples like advanced, custom_config for testing purposes
Removed the test_split_leaf function from the split.rs test module as it was no longer needed and contained assertions that were not relevant to the current implementation.
- Remove unused Fingerprint import in engine.rs - Remove unused QueryResultItem from types import in engine.rs - Remove unused PathBuf in indexer.rs and IndexContext in indexer.rs - Remove unused RetrieveEvent in retriever.rs and SufficiencyLevel - Remove unused warn macro in workspace.rs - Remove unused ConfigError and various config types from config module - Remove unused LLM-related types from config/types module - Remove unused AsyncEventHandler and EventHandler traits from events - Remove unused ChangeDetectorState and NodeChange from incremental module - Remove unused pipeline types and incremental re-exports from index module - Remove unused MarkdownConfig from markdown parser - Remove unused PdfParserConfig and PdfMetadata from PDF parser - Remove TOC module documentation and unused types/components exports - Remove unused Path and SummaryCache from pipeline checkpoint - Remove unused AccessPattern from orchestrator - Remove unused DocumentTree and TreeNode from enhance stage - Remove unused imports and modules documentation from lib.rs - Remove unused imports and re-exports across multiple modules
…ules - Remove unused imports including HashMap, Duration, Instant, NonZeroUsize, Utc, warn tracing macro, CandidateNode, Codec traits, and various other unused types across multiple files - Clean up exported items by removing unused public re-exports from memo, metrics, retrieval, storage, and utils modules - Remove unused test modules and test functions that were not being executed - Delete commented out architecture diagrams and documentation sections - Remove unused dependencies like MemoryBackend, PathCache, PartialUpdater, and various unused configuration types
Remove direct exports of FullStrategy, LazyStrategy, and SelectiveStrategy in favor of consolidated exports from the strategy module.
- Format long method chains with proper indentation across multiple lines - Break down complex expressions into readable multi-line statements - Maintain consistent code formatting throughout the codebase - Improve readability by proper alignment of function parameters and calls
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.