-
Notifications
You must be signed in to change notification settings - Fork 0
Document Processing
The document processing pipeline allows you to upload knowledge base documents and use them as the source for AI-generated test cases.
| Format | Extraction Method |
|---|---|
.pdf |
Text layer extracted via UglyToad.PdfPig (not OCR — scanned PDFs without a text layer are not supported) |
.txt |
Plain text, read as-is |
.md |
Markdown, read as plain text |
File size limit: 10 MB per file
Upload file
│
▼
Extract text (PDF → PdfPig, TXT/MD → direct read)
│
▼
Normalize (strip extra whitespace, fix paragraph breaks)
│
▼
Chunk (sliding window with configurable size and overlap)
│
▼
Estimate token count per chunk
│
▼
Compute content hash (SHA-256 for deduplication)
│
▼
Index chunks in Lucene.NET (BM25)
│
▼
Store Document + Chunk records in SQLite
│
▼
Ready for question generation
The application uses a sliding window approach:
- Chunks are split at paragraph or sentence boundaries where possible
- An overlap between consecutive chunks preserves context at chunk borders
- Each chunk stores an estimated token count for LLM budget planning
This ensures that information spanning a paragraph boundary is not lost when chunks are processed independently.
Step 2 of the Setup Wizard includes an integrated document upload step.
- Navigate to Documents in the sidebar
- Click Upload Document
- Select a PDF, TXT, or MD file
- The file is processed automatically in the background
- The documents list shows file name, type, size, chunk count, and upload date
POST /api/documents
Content-Type: multipart/form-data
file=<binary>- Navigate to Documents in the sidebar
- Paste a public URL into the Import from URL field (below the file upload button)
- Click Import from URL
- The content is downloaded, chunked, and stored as a
TXTdocument
Restrictions:
- URL must use
http://orhttps://scheme - Loopback addresses (
localhost,127.0.0.1,::1) are blocked (SSRF protection) - 30-second download timeout
- Content is treated as plain text regardless of MIME type
The CLI generate command reads a local document file directly — no upload step required. See CLI Reference.
| Action | Steps |
|---|---|
| View all documents | Documents page → table listing |
| Delete a document | Documents page → Delete button (Admin only) |
| View chunk count | Shown in the documents table |
Deleting a document removes the file, database records, and associated Lucene index entries. Test cases generated from the document are not automatically deleted.
When a file is uploaded, SHA-256 is computed over the file content. If an identical hash already exists in the database, the upload is rejected with a "duplicate document" error. This prevents bloating the index and database with repeated uploads of the same content.
Lucene.NET provides BM25 (Okapi BM25) ranked retrieval over all chunks. This is used internally to:
- Retrieve the most relevant chunks for a given question during generation
- Surface relevant context snippets in the UI (future feature)
| Limitation | Detail |
|---|---|
| No OCR | Scanned/image-only PDFs are not supported |
| Text layer required | PDFs must have an embedded text layer |
| No DOCX/XLSX | Only PDF, TXT, MD are supported in v1 |
| No vector search | Semantic similarity search is a planned enhancement |
| Single-language | Best results with English text |
- Question Generation — using chunks to create test cases
- Setup Wizard — integrated upload in the wizard