Skip to content

Document Processing

Holger Imbery edited this page Feb 23, 2026 · 2 revisions

Document Processing

The document processing pipeline allows you to upload knowledge base documents and use them as the source for AI-generated test cases.


Supported Formats

Format Extraction Method
.pdf Text layer extracted via UglyToad.PdfPig (not OCR — scanned PDFs without a text layer are not supported)
.txt Plain text, read as-is
.md Markdown, read as plain text

File size limit: 10 MB per file


Processing Pipeline

Upload file
    │
    ▼
Extract text (PDF → PdfPig, TXT/MD → direct read)
    │
    ▼
Normalize (strip extra whitespace, fix paragraph breaks)
    │
    ▼
Chunk (sliding window with configurable size and overlap)
    │
    ▼
Estimate token count per chunk
    │
    ▼
Compute content hash (SHA-256 for deduplication)
    │
    ▼
Index chunks in Lucene.NET (BM25)
    │
    ▼
Store Document + Chunk records in SQLite
    │
    ▼
Ready for question generation

Chunking Strategy

The application uses a sliding window approach:

  • Chunks are split at paragraph or sentence boundaries where possible
  • An overlap between consecutive chunks preserves context at chunk borders
  • Each chunk stores an estimated token count for LLM budget planning

This ensures that information spanning a paragraph boundary is not lost when chunks are processed independently.


Uploading Documents

Via the Setup Wizard

Step 2 of the Setup Wizard includes an integrated document upload step.

Via the Web UI (Documents Page)

  1. Navigate to Documents in the sidebar
  2. Click Upload Document
  3. Select a PDF, TXT, or MD file
  4. The file is processed automatically in the background
  5. The documents list shows file name, type, size, chunk count, and upload date

Via the REST API

POST /api/documents
Content-Type: multipart/form-data

file=<binary>

Via HTTP/HTTPS URL Import (Web UI)

  1. Navigate to Documents in the sidebar
  2. Paste a public URL into the Import from URL field (below the file upload button)
  3. Click Import from URL
  4. The content is downloaded, chunked, and stored as a TXT document

Restrictions:

  • URL must use http:// or https:// scheme
  • Loopback addresses (localhost, 127.0.0.1, ::1) are blocked (SSRF protection)
  • 30-second download timeout
  • Content is treated as plain text regardless of MIME type

Via the CLI (generate)

The CLI generate command reads a local document file directly — no upload step required. See CLI Reference.


Managing Documents

Action Steps
View all documents Documents page → table listing
Delete a document Documents page → Delete button (Admin only)
View chunk count Shown in the documents table

Deleting a document removes the file, database records, and associated Lucene index entries. Test cases generated from the document are not automatically deleted.


Deduplication

When a file is uploaded, SHA-256 is computed over the file content. If an identical hash already exists in the database, the upload is rejected with a "duplicate document" error. This prevents bloating the index and database with repeated uploads of the same content.


BM25 Full-Text Search

Lucene.NET provides BM25 (Okapi BM25) ranked retrieval over all chunks. This is used internally to:

  • Retrieve the most relevant chunks for a given question during generation
  • Surface relevant context snippets in the UI (future feature)

Limitations

Limitation Detail
No OCR Scanned/image-only PDFs are not supported
Text layer required PDFs must have an embedded text layer
No DOCX/XLSX Only PDF, TXT, MD are supported in v1
No vector search Semantic similarity search is a planned enhancement
Single-language Best results with English text

See Also

Clone this wiki locally