Skip to content
Merged

Dev #67

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions docs/docs/architecture.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
sidebar_position: 3
---

# Architecture

Vectorless transforms documents into hierarchical semantic trees and uses LLM-powered reasoning to navigate them. This page describes the end-to-end pipeline.

## High-Level Flow

```text
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Document │────▶│ Index │────▶│ Storage │
│ (PDF/MD) │ │ Pipeline │ │ (Disk) │
└──────────────┘ └──────────────┘ └──────┬───────┘
┌──────────────┐ ┌──────▼───────┐
│ Result │◀────│ Retrieval │
│ (Answer) │ │ Pipeline │
└──────────────┘ └──────────────┘
```

## Index Pipeline

The indexing pipeline processes documents through ordered stages:

| Stage | Priority | Description |
|-------|----------|-------------|
| **Parse** | 10 | Parse document into raw nodes (Markdown headings, PDF pages) |
| **Build** | 20 | Construct arena-based tree with thinning and content merge |
| **Validate** | 22 | Tree integrity checks |
| **Split** | 25 | Split oversized leaf nodes (>4000 tokens) |
| **Enhance** | 30 | Generate LLM summaries (Full, Selective, or Lazy strategy) |
| **Enrich** | 40 | Calculate metadata, page ranges, resolve cross-references |
| **Reasoning Index** | 45 | Build keyword-to-node mappings, synonym expansion, summary shortcuts |
| **Optimize** | 60 | Final tree optimization |

Each stage is independently configurable. The pipeline supports incremental re-indexing via content fingerprinting.

## Tree Structure

Each node in the tree contains:

```text
TreeNode
├── title — Section heading
├── content — Raw text (leaf nodes)
├── summary — LLM-generated summary
├── structure — Hierarchical index (e.g., "1.2.3")
├── depth — Tree depth (root = 0)
├── references[] — Resolved cross-references ("see Section 2.1" → NodeId)
├── token_count — Estimated token count
└── page_range — Start/end page (PDF)
```

## Retrieval Pipeline

The retrieval pipeline consists of four phases:

1. **Analyze** — Detect query complexity, extract keywords, decompose complex queries
2. **Plan** — Select retrieval strategy and search algorithm
3. **Search** — Execute tree traversal with Pilot guidance
4. **Evaluate** — Score, deduplicate, and aggregate results

### Pilot

The Pilot is the core intelligence component. It provides LLM-guided navigation at key decision points:

- **Fork points** — When multiple children exist, Pilot evaluates which path to follow
- **Backtracking** — When a path yields insufficient results, Pilot suggests alternatives
- **Binary pruning** — Quick relevance filter for nodes with many children

### Search Algorithms

| Algorithm | Description | Use Case |
|-----------|-------------|----------|
| **Beam Search** | Explores multiple paths with backtracking | General purpose (recommended) |
| **MCTS** | Monte Carlo Tree Search with UCT selection | Complex multi-hop queries |
| **Pure Pilot** | Greedy single-path, Pilot at every level | High-accuracy, higher token cost |
| **ToC Navigator** | Table-of-contents based location | Broad queries ("what is this about?") |

## Cross-Document Graph

When multiple documents are indexed, Vectorless automatically builds a relationship graph based on shared keywords and Jaccard similarity. This graph enables cross-document retrieval with score boosting.

## Zero Infrastructure

The entire system requires only an LLM API key. No vector database, no embedding models, no additional infrastructure. Trees and metadata are persisted to the local filesystem in the workspace directory.
94 changes: 94 additions & 0 deletions docs/docs/examples/batch-indexing.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
sidebar_position: 3
---

# Batch Indexing

Index multiple documents efficiently with progress tracking and error handling.

## Python

```python
import asyncio
from vectorless import Engine, IndexContext, IndexOptions

async def main():
engine = Engine(
workspace="./workspace",
api_key="sk-...",
model="gpt-4o",
)

# Index a directory of documents
result = await engine.index(
IndexContext.from_dir("./documents/")
)

print(f"Indexed {len(result.items)} documents")
print(f"Failures: {len(result.failed)}")

for item in result.items:
print(f" ✓ {item.name} ({item.format}) → {item.doc_id}")
if item.metrics:
m = item.metrics
print(f" Nodes: {m.nodes_processed}, "
f"Summaries: {m.summaries_generated}, "
f"Time: {m.total_time_ms}ms")

for fail in result.failed:
print(f" ✗ {fail.source}: {fail.error}")

# List all indexed documents
docs = await engine.list()
print(f"\nTotal indexed: {len(docs)} documents")

asyncio.run(main())
```

## Rust

```rust
use vectorless::client::{Engine, EngineBuilder, IndexContext};

#[tokio::main]
async fn main() -> vectorless::Result<()> {
let engine = EngineBuilder::new()
.with_workspace("./workspace")
.with_key("sk-...")
.with_model("gpt-4o")
.build()
.await?;

// Index a directory
let result = engine.index(IndexContext::from_dir("./documents/")).await?;

println!("Indexed {} documents", result.items.len());
println!("Failures: {}", result.failed.len());

for item in &result.items {
println!(" ✓ {} ({:?}) → {}", item.name, item.format, item.doc_id);
}

// List all documents
let docs = engine.list().await?;
println!("Total indexed: {} documents", docs.len());

Ok(())
}
```

## Error Handling

Each item in the result is either successful or failed. Failures don't prevent other documents from being indexed:

```python
result = await engine.index(IndexContext.from_paths(mixed_paths))

# Successful items
for item in result.items:
process(item)

# Failed items — handle gracefully
for fail in result.failed:
print(f"Failed: {fail.source} — {fail.error}")
```
88 changes: 88 additions & 0 deletions docs/docs/examples/multi-document.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
sidebar_position: 2
---

# Multi-Document Retrieval

Query across multiple indexed documents using the cross-document strategy with graph-based score boosting.

## Python

```python
import asyncio
from vectorless import (
Engine, IndexContext, QueryContext,
IndexOptions, StrategyPreference
)

async def main():
engine = Engine(
workspace="./workspace",
api_key="sk-...",
model="gpt-4o",
)

# Index multiple documents
docs = ["./report-q1.pdf", "./report-q2.pdf", "./report-q3.pdf"]
doc_ids = []

for path in docs:
result = await engine.index(IndexContext.from_path(path))
doc_ids.append(result.doc_id)
print(f"Indexed: {path} → {result.doc_id}")

# Check the cross-document graph
graph = await engine.get_graph()
if graph:
print(f"\nGraph: {graph.node_count()} docs, {graph.edge_count()} edges")
for doc_id in doc_ids:
neighbors = graph.get_neighbors(doc_id)
for edge in neighbors:
print(f" {doc_id[:8]}... → {edge.target_doc_id[:8]}... ({edge.weight:.2f})")

# Query across all documents
result = await engine.query(
QueryContext("Compare quarterly revenue trends")
.with_doc_ids(doc_ids)
.with_strategy(StrategyPreference.CROSS_DOCUMENT)
)

for item in result.items:
print(f"\n[{item.doc_id[:8]}...] Score: {item.score:.2f}")
print(item.content[:300])

# Or query entire workspace
result = await engine.query(
QueryContext("What documents discuss risk factors?")
.with_workspace()
)

print(f"\nFound in {len(result.items)} document(s)")

# Cleanup
for doc_id in doc_ids:
await engine.remove(doc_id)

asyncio.run(main())
```

## Key Concepts

### Document Graph

After indexing, documents are connected in a graph based on shared keywords. The graph enables:

- **Score boosting** — High-confidence results in one document boost neighbor documents
- **Relationship discovery** — Automatically find related documents
- **Cross-referencing** — Results from connected documents are surfaced together

### Merge Strategies

The cross-document strategy supports multiple merge modes:

| Strategy | Description |
|----------|-------------|
| **TopK** | Return top-K results across all documents |
| **BestPerDocument** | Best result from each document |
| **WeightedByRelevance** | Weight by each document's best score |
| **GraphBoosted** | Use graph connections to boost scores |
88 changes: 88 additions & 0 deletions docs/docs/examples/quick-query.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
sidebar_position: 1
---

# Quick Query Example

This example demonstrates the basic index-and-query workflow with both Python and Rust.

## Python

```python
import asyncio
from vectorless import Engine, IndexContext, QueryContext, StrategyPreference

async def main():
# 1. Create engine
engine = Engine(
workspace="./data",
api_key="sk-...",
model="gpt-4o",
)

# 2. Index a document
result = await engine.index(IndexContext.from_path("./report.pdf"))
doc_id = result.doc_id
print(f"Indexed document: {doc_id}")

# 3. Simple keyword query
answer = await engine.query(
QueryContext("revenue")
.with_doc_id(doc_id)
.with_strategy(StrategyPreference.KEYWORD)
)
print(f"Keyword result: {answer.single().content[:200]}")

# 4. Complex reasoning query
answer = await engine.query(
QueryContext("What are the main factors affecting performance?")
.with_doc_id(doc_id)
.with_strategy(StrategyPreference.HYBRID)
)
print(f"Score: {answer.single().score:.2f}")
print(f"Hybrid result: {answer.single().content[:200]}")

# 5. Cleanup
await engine.remove(doc_id)

asyncio.run(main())
```

## Rust

```rust
use vectorless::client::{Engine, EngineBuilder, IndexContext, QueryContext};
use vectorless::StrategyPreference;

#[tokio::main]
async fn main() -> vectorless::Result<()> {
// 1. Create engine
let engine = EngineBuilder::new()
.with_workspace("./data")
.with_key("sk-...")
.with_model("gpt-4o")
.build()
.await?;

// 2. Index a document
let result = engine.index(IndexContext::from_path("./report.pdf")).await?;
let doc_id = result.doc_id().unwrap().to_string();
println!("Indexed document: {}", doc_id);

// 3. Query with hybrid strategy
let answer = engine.query(
QueryContext::new("What are the main factors affecting performance?")
.with_doc_id(&doc_id)
).await?;

if let Some(item) = answer.single() {
println!("Score: {:.2}", item.score);
println!("{}", item.content);
}

// 4. Cleanup
engine.remove(&doc_id).await?;

Ok(())
}
```
Loading