Document Processing

The document processing pipeline allows you to upload knowledge base documents and use them as the source for AI-generated test cases.

Supported Formats

Format	Extraction Method
`.pdf`	Text layer extracted via UglyToad.PdfPig (not OCR — scanned PDFs without a text layer are not supported)
`.txt`	Plain text, read as-is
`.md`	Markdown, read as plain text

File size limit: 10 MB per file

Processing Pipeline

Upload file
    │
    ▼
Extract text (PDF → PdfPig, TXT/MD → direct read)
    │
    ▼
Normalize (strip extra whitespace, fix paragraph breaks)
    │
    ▼
Chunk (sliding window with configurable size and overlap)
    │
    ▼
Estimate token count per chunk
    │
    ▼
Compute content hash (SHA-256 for deduplication)
    │
    ▼
Index chunks in Lucene.NET (BM25)
    │
    ▼
Store Document + Chunk records in SQLite
    │
    ▼
Ready for question generation

Chunking Strategy

The application uses a sliding window approach:

Chunks are split at paragraph or sentence boundaries where possible
An overlap between consecutive chunks preserves context at chunk borders
Each chunk stores an estimated token count for LLM budget planning

This ensures that information spanning a paragraph boundary is not lost when chunks are processed independently.

Uploading Documents

Via the Setup Wizard

Step 2 of the Setup Wizard includes an integrated document upload step.

Via the Web UI (Documents Page)

Navigate to Documents in the sidebar
Click Upload Document
Select a PDF, TXT, or MD file
The file is processed automatically in the background
The documents list shows file name, type, size, chunk count, and upload date

Via the REST API

POST /api/documents
Content-Type: multipart/form-data

file=<binary>

Via HTTP/HTTPS URL Import (Web UI)

Navigate to Documents in the sidebar
Paste a public URL into the Import from URL field (below the file upload button)
Click Import from URL
The content is downloaded, chunked, and stored as a TXT document

Restrictions:

URL must use http:// or https:// scheme
Loopback addresses (localhost, 127.0.0.1, ::1) are blocked (SSRF protection)
30-second download timeout
Content is treated as plain text regardless of MIME type

Via the CLI (`generate`)

The CLI generate command reads a local document file directly — no upload step required. See CLI Reference.

Managing Documents

Action	Steps
View all documents	Documents page → table listing
Delete a document	Documents page → Delete button (Admin only)
View chunk count	Shown in the documents table

Deleting a document removes the file, database records, and associated Lucene index entries. Test cases generated from the document are not automatically deleted.

Deduplication

When a file is uploaded, SHA-256 is computed over the file content. If an identical hash already exists in the database, the upload is rejected with a "duplicate document" error. This prevents bloating the index and database with repeated uploads of the same content.

BM25 Full-Text Search

Lucene.NET provides BM25 (Okapi BM25) ranked retrieval over all chunks. This is used internally to:

Retrieve the most relevant chunks for a given question during generation
Surface relevant context snippets in the UI (future feature)

Limitations

Limitation	Detail
No OCR	Scanned/image-only PDFs are not supported
Text layer required	PDFs must have an embedded text layer
No DOCX/XLSX	Only PDF, TXT, MD are supported in v1
No vector search	Semantic similarity search is a planned enhancement
Single-language	Best results with English text

Document Processing

Document Processing

Supported Formats

Processing Pipeline

Chunking Strategy

Uploading Documents

Via the Setup Wizard

Via the Web UI (Documents Page)

Via the REST API

Via HTTP/HTTPS URL Import (Web UI)

Via the CLI (generate)

Managing Documents

Deduplication

BM25 Full-Text Search

Limitations

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Navigation

Getting Started

Core Features

Configuration

Authentication

Reference

Deployment

Clone this wiki locally

Via the CLI (`generate`)