Skip to content

Dev#41

Merged
zTgx merged 11 commits intomainfrom
dev
Apr 11, 2026
Merged

Dev#41
zTgx merged 11 commits intomainfrom
dev

Conversation

@zTgx
Copy link
Copy Markdown
Contributor

@zTgx zTgx commented Apr 11, 2026

No description provided.

zTgx added 11 commits April 11, 2026 17:06
BREAKING CHANGE: Remove support for DOCX and HTML document formats.
This change removes the following dependencies and functionality:
- DOCX parsing via zip and roxmltree crates
- HTML parsing via scraper crate
- Related parser modules and configurations
- Format support from index context and client engine

The supported formats are now limited to markdown and PDF only.
…kpoint support

- Introduce merge_content option in ThinningConfig to control whether child
  content should be absorbed into parent nodes during thinning operations
- Add SplitConfig for configuring large node splitting with max_tokens_per_node
  and pattern-based splitting options
- Implement SplitStage to break oversized leaf nodes into smaller children
  using natural split points (headings, paragraphs)
- Add checkpoint functionality with PipelineCheckpoint, CheckpointManager,
  and CheckpointContextData for resumable indexing pipelines
- Update BuildStage to support content merging when removing small nodes
- Modify pipeline execution order to include validate and split stages
- Add checkpoint_dir option to PipelineOptions for persistent state management
- Include comprehensive test coverage for new splitting and checkpoint features
- Add `is_leaf` field to PendingNode struct to track node type
- Implement shortcut mechanism that uses original content as summary
  for nodes below token threshold to save API costs
- Add `shortcut_threshold` configuration option with default of 200 tokens
- Modify summary generation to consider leaf vs non-leaf node types
- Update LLM prompts to provide different context for leaf (content-focused)
  vs branch (navigation-focused) nodes
- Add metrics tracking for shortcut usage
- Include caching support for node-type-aware summaries
BREAKING CHANGE: Parser modules have been relocated from `crate::parser`
to `crate::index::parse`. Update imports accordingly:

- `crate::parser::DocumentFormat` -> `crate::index::parse::DocumentFormat`
- `crate::parser::RawNode` -> `crate::index::parse::RawNode`
- `crate::parser::DocumentParser` trait removed from index pipeline

The parsing functionality now lives under the index module to better
reflect its purpose in the indexing pipeline. The parser module remains
as a re-export shim for backward compatibility.
…example

- Move DocumentFormat and RawNode imports from parser to index::parse
- Remove parser module declaration from lib.rs as it's now consolidated
- Delete parser mod.rs re-export shim that was moved to index::parse
- Remove unused cli_tool example from Cargo.toml
- Keep other examples like advanced, custom_config for testing purposes
Removed the test_split_leaf function from the split.rs test module
as it was no longer needed and contained assertions that were not
relevant to the current implementation.
- Remove unused Fingerprint import in engine.rs
- Remove unused QueryResultItem from types import in engine.rs
- Remove unused PathBuf in indexer.rs and IndexContext in indexer.rs
- Remove unused RetrieveEvent in retriever.rs and SufficiencyLevel
- Remove unused warn macro in workspace.rs
- Remove unused ConfigError and various config types from config module
- Remove unused LLM-related types from config/types module
- Remove unused AsyncEventHandler and EventHandler traits from events
- Remove unused ChangeDetectorState and NodeChange from incremental module
- Remove unused pipeline types and incremental re-exports from index module
- Remove unused MarkdownConfig from markdown parser
- Remove unused PdfParserConfig and PdfMetadata from PDF parser
- Remove TOC module documentation and unused types/components exports
- Remove unused Path and SummaryCache from pipeline checkpoint
- Remove unused AccessPattern from orchestrator
- Remove unused DocumentTree and TreeNode from enhance stage
- Remove unused imports and modules documentation from lib.rs
- Remove unused imports and re-exports across multiple modules
…ules

- Remove unused imports including HashMap, Duration, Instant, NonZeroUsize,
  Utc, warn tracing macro, CandidateNode, Codec traits, and various other
  unused types across multiple files
- Clean up exported items by removing unused public re-exports from memo,
  metrics, retrieval, storage, and utils modules
- Remove unused test modules and test functions that were not being executed
- Delete commented out architecture diagrams and documentation sections
- Remove unused dependencies like MemoryBackend, PathCache, PartialUpdater,
  and various unused configuration types
Remove direct exports of FullStrategy, LazyStrategy, and SelectiveStrategy
in favor of consolidated exports from the strategy module.
- Format long method chains with proper indentation across multiple lines
- Break down complex expressions into readable multi-line statements
- Maintain consistent code formatting throughout the codebase
- Improve readability by proper alignment of function parameters and calls
@zTgx zTgx merged commit 43d57f1 into main Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant