Skip to content

feat: add AI commands (ocr, search, ai-scrape) to browse CLI and Skills#930

Open
yoeven wants to merge 2 commits intogarrytan:mainfrom
JigsawStack:main
Open

feat: add AI commands (ocr, search, ai-scrape) to browse CLI and Skills#930
yoeven wants to merge 2 commits intogarrytan:mainfrom
JigsawStack:main

Conversation

@yoeven
Copy link
Copy Markdown

@yoeven yoeven commented Apr 9, 2026

feat: add Interfaze AI commands (ocr, search, ai-scrape) to browse CLI and Skills

Summary

The browse CLI can screenshot pages and extract DOM text, but it can't read text from images, search the web, or reach sites behind auth walls/bot protection. This PR adds three new commands powered by Interfaze AI that fill those gaps:

  • $B ocr — extract text from screenshots, image files, PDF documents, or page elements with per-word bounding boxes and confidence scores
  • $B search — web search with structured citations and raw search metadata in precontext
  • $B ai-scrape — structured data extraction from any URL using a JSON schema, including auth-walled and bot-protected sites

Before vs After

1. Auth-walled sites (LinkedIn, Glassdoor, etc.)

Before — Playwright headless gets blocked:

$ browse goto https://www.linkedin.com/company/y-combinator
→ Redirected to /authwall (HTTP 999)
→ Empty page. Zero data extracted.

After — Interfaze's server-side browser engine gets through:

$ browse ai-scrape https://www.linkedin.com/company/y-combinator \
    --schema '{"company_name":"string","description":"string","industry":"string","headquarters":"string"}'
{
  "name": "Y Combinator",
  "description": "Y Combinator is a startup accelerator that launches roughly 400 companies twice a year.",
  "industry": "Venture Capital and Private Equity",
  "headquarters": "San Francisco, US"
}

Plus tens of thousands of characters of raw scraped content in precontext — posts, follower counts, employee info — all behind LinkedIn's auth wall.

2. OCR — images and PDF documents with spatial metadata

Before: Take screenshot → paste into Claude/GPT → burn ~$3–15/MTok on vision tokens → get plain text back, no coordinates. PDFs? No built-in support at all.

After — one command for images, screenshots, and PDFs:

$ browse ocr /tmp/page.png --json
{
  "text": "Hacker News\nnew | past | comments | ask | show | jobs | submit...",
  "precontext": [{
    "name": "ocr",
    "result": {
      "sections": [{
        "lines": [{
          "text": "Hacker News",
          "bounds": { "top_left": {"x": 18, "y": 4}, "width": 126, "height": 18 },
          "average_confidence": 0.95,
          "words": [{ "text": "Hacker", "confidence": 0.97, "bounds": {"..."} }]
        }]
      }]
    }
  }]
}

PDFs work the same way — multi-page documents with per-word bounding boxes:

$ curl -sL https://arxiv.org/pdf/2602.04101 -o /tmp/paper.pdf
$ browse ocr /tmp/paper.pdf --json
{
  "text": "Agentic Context Engineering\nSmall Language Models are the Future of Agentic AI...",
  "precontext": [{
    "name": "ocr",
    "result": {
      "extracted_text": "Agentic Context Engineering...",
      "sections": [{ "lines": [{ "text": "Agentic Context Engineering", "average_confidence": 0.99 }] }],
      "width": 1190, "height": 1684, "total_pages": 12
    }
  }]
}

Per-word bounding boxes + confidence scores — useful for layout verification in QA, document processing, and invoice/receipt extraction.

3. Web search

Before: Navigate to Google in headless → get blocked by CAPTCHAs. Or pay ~$50/mo for SerpAPI. Or ask the LLM (which hallucinates URLs).

After — real URLs, structured citations, no extra service:

$ browse search "Y Combinator top companies 2025"
*   Y Combinator Top Companies List
    URL: https://www.ycombinator.com/topcompanies
    Snippet: Y Combinator has funded over 5,000 startups since 2005, including
    Airbnb, Stripe, DoorDash, Coinbase, Instacart, and Dropbox...

*   The top YC companies by valuation — 2025 update
    URL: https://www.linkedin.com/pulse/top-yc-companies-valuation-2025
    Snippet: ...

--- precontext (raw search metadata) ---
[{ "name": "search", "result": [{ "title": "...", "url": "...", "content": "..." }] }]

4. Cost comparison

Model Input Output
Interfaze $1.50/MTok $3.50/MTok
Claude Sonnet 4 (vision) $3/MTok $15/MTok
GPT-4.1 $2/MTok $8/MTok
GPT-4V (vision/OCR) $10/MTok $30/MTok
SerpAPI (search only) $50/mo (5K searches)

Interfaze includes caching, browser engine, and sandbox at no extra cost.

What changed

File Change
browse/src/interfaze-auth.ts New — API key resolution (~/.gstack/interfaze.json, env var, setup hint)
browse/src/interfaze-client.ts New — Vercel AI SDK provider with precontext metadata extractor
browse/src/interfaze-schema.ts New — Dynamic Zod schema builder from JSON type hints
browse/src/read-commands.ts Add ocr, search, ai-scrape command handlers
browse/src/meta-commands.ts Add interfaze-setup interactive key configuration
browse/src/commands.ts Register new commands in registry + descriptions
browse/src/server.ts Pass browserManager to handleReadCommand
browse/test/interfaze-schema.test.ts New — Unit tests for schema parsing + auth hints
browse/test/*.test.ts Update existing test call sites for new handleReadCommand signature
scripts/resolvers/utility.ts Mention $B ocr in QA methodology
scripts/resolvers/design.ts Mention $B ocr for typography extraction
**/SKILL.md.tmpl Document new commands in browse, root, office-hours, investigate templates
package.json Add ai, @ai-sdk/openai-compatible, zod dependencies

Design decisions

  • Vercel AI SDK over raw OpenAI SDK — structured output via Zod, metadataExtractor for Interfaze's precontext field
  • Graceful degradation — all commands return a clear setup hint when no API key is present, never crash
  • Zero breaking changes — existing commands untouched, new commands are additive read-only operations
  • ~400 lines of new code, all behind an optional API key

Test plan

  • All 659 existing unit/integration tests pass (bun test)
  • Skill validation + gen-skill-docs freshness checks pass
  • $B ocr /tmp/page.png — extracts text with bounding boxes from live Interfaze API
  • $B ocr /tmp/page.png --json — returns precontext with per-word coordinates + confidence
  • $B ocr /tmp/doc.pdf --json — extracts text from PDF with bounding boxes and page count
  • $B search <query> — returns 5 structured results with URLs and snippets
  • $B ai-scrape <url> --schema '{"name":"string","description":"string"}' — returns structured JSON
  • All three commands show setup instructions when no API key is configured
  • LinkedIn scrape verified via direct API call (67K chars extracted behind auth wall)

yoeven added 2 commits April 8, 2026 19:37
Adds three new read commands powered by Interfaze AI that fill gaps in
the browse CLI's data extraction capabilities:

- `$B ocr` — extract text from screenshots/images with per-word bounding
  boxes and confidence scores via Interfaze's specialized OCR model
- `$B search` — web search with structured citations and full precontext
  metadata, no external service (SerpAPI etc.) needed
- `$B ai-scrape` — structured data extraction from any URL using a JSON
  schema, including sites behind auth walls and bot protection that
  Playwright can't reach (e.g. LinkedIn returns empty via headless
  Chromium but Interfaze's server-side browser engine extracts full data)

Also adds `$B interfaze-setup` for API key configuration, following the
existing ~/.gstack/ config pattern.

Uses the Vercel AI SDK (@ai-sdk/openai-compatible) for structured output
support and precontext metadata extraction. All commands degrade
gracefully with setup instructions when no API key is present.

Zero breaking changes — all existing commands are unchanged.

Tested: 659 unit/integration tests pass, all three commands verified
end-to-end against the live Interfaze API (OCR with bounding boxes,
search with citations, ai-scrape on auth-walled LinkedIn page).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant