feat: add AI commands (ocr, search, ai-scrape) to browse CLI and Skills#930
Open
yoeven wants to merge 2 commits intogarrytan:mainfrom
Open
feat: add AI commands (ocr, search, ai-scrape) to browse CLI and Skills#930yoeven wants to merge 2 commits intogarrytan:mainfrom
yoeven wants to merge 2 commits intogarrytan:mainfrom
Conversation
Adds three new read commands powered by Interfaze AI that fill gaps in the browse CLI's data extraction capabilities: - `$B ocr` — extract text from screenshots/images with per-word bounding boxes and confidence scores via Interfaze's specialized OCR model - `$B search` — web search with structured citations and full precontext metadata, no external service (SerpAPI etc.) needed - `$B ai-scrape` — structured data extraction from any URL using a JSON schema, including sites behind auth walls and bot protection that Playwright can't reach (e.g. LinkedIn returns empty via headless Chromium but Interfaze's server-side browser engine extracts full data) Also adds `$B interfaze-setup` for API key configuration, following the existing ~/.gstack/ config pattern. Uses the Vercel AI SDK (@ai-sdk/openai-compatible) for structured output support and precontext metadata extraction. All commands degrade gracefully with setup instructions when no API key is present. Zero breaking changes — all existing commands are unchanged. Tested: 659 unit/integration tests pass, all three commands verified end-to-end against the live Interfaze API (OCR with bounding boxes, search with citations, ai-scrape on auth-walled LinkedIn page).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat: add Interfaze AI commands (ocr, search, ai-scrape) to browse CLI and Skills
Summary
The browse CLI can screenshot pages and extract DOM text, but it can't read text from images, search the web, or reach sites behind auth walls/bot protection. This PR adds three new commands powered by Interfaze AI that fill those gaps:
$B ocr— extract text from screenshots, image files, PDF documents, or page elements with per-word bounding boxes and confidence scores$B search— web search with structured citations and raw search metadata in precontext$B ai-scrape— structured data extraction from any URL using a JSON schema, including auth-walled and bot-protected sitesBefore vs After
1. Auth-walled sites (LinkedIn, Glassdoor, etc.)
Before — Playwright headless gets blocked:
After — Interfaze's server-side browser engine gets through:
{ "name": "Y Combinator", "description": "Y Combinator is a startup accelerator that launches roughly 400 companies twice a year.", "industry": "Venture Capital and Private Equity", "headquarters": "San Francisco, US" }Plus tens of thousands of characters of raw scraped content in precontext — posts, follower counts, employee info — all behind LinkedIn's auth wall.
2. OCR — images and PDF documents with spatial metadata
Before: Take screenshot → paste into Claude/GPT → burn ~$3–15/MTok on vision tokens → get plain text back, no coordinates. PDFs? No built-in support at all.
After — one command for images, screenshots, and PDFs:
{ "text": "Hacker News\nnew | past | comments | ask | show | jobs | submit...", "precontext": [{ "name": "ocr", "result": { "sections": [{ "lines": [{ "text": "Hacker News", "bounds": { "top_left": {"x": 18, "y": 4}, "width": 126, "height": 18 }, "average_confidence": 0.95, "words": [{ "text": "Hacker", "confidence": 0.97, "bounds": {"..."} }] }] }] } }] }PDFs work the same way — multi-page documents with per-word bounding boxes:
{ "text": "Agentic Context Engineering\nSmall Language Models are the Future of Agentic AI...", "precontext": [{ "name": "ocr", "result": { "extracted_text": "Agentic Context Engineering...", "sections": [{ "lines": [{ "text": "Agentic Context Engineering", "average_confidence": 0.99 }] }], "width": 1190, "height": 1684, "total_pages": 12 } }] }Per-word bounding boxes + confidence scores — useful for layout verification in QA, document processing, and invoice/receipt extraction.
3. Web search
Before: Navigate to Google in headless → get blocked by CAPTCHAs. Or pay ~$50/mo for SerpAPI. Or ask the LLM (which hallucinates URLs).
After — real URLs, structured citations, no extra service:
4. Cost comparison
Interfaze includes caching, browser engine, and sandbox at no extra cost.
What changed
browse/src/interfaze-auth.tsbrowse/src/interfaze-client.tsbrowse/src/interfaze-schema.tsbrowse/src/read-commands.tsocr,search,ai-scrapecommand handlersbrowse/src/meta-commands.tsinterfaze-setupinteractive key configurationbrowse/src/commands.tsbrowse/src/server.tsbrowserManagertohandleReadCommandbrowse/test/interfaze-schema.test.tsbrowse/test/*.test.tshandleReadCommandsignaturescripts/resolvers/utility.ts$B ocrin QA methodologyscripts/resolvers/design.ts$B ocrfor typography extraction**/SKILL.md.tmplpackage.jsonai,@ai-sdk/openai-compatible,zoddependenciesDesign decisions
metadataExtractorfor Interfaze'sprecontextfieldTest plan
bun test)$B ocr /tmp/page.png— extracts text with bounding boxes from live Interfaze API$B ocr /tmp/page.png --json— returns precontext with per-word coordinates + confidence$B ocr /tmp/doc.pdf --json— extracts text from PDF with bounding boxes and page count$B search <query>— returns 5 structured results with URLs and snippets$B ai-scrape <url> --schema '{"name":"string","description":"string"}'— returns structured JSON