feat: add AI commands (ocr, search, ai-scrape) to browse CLI and Skills by yoeven · Pull Request #930 · garrytan/gstack

yoeven · 2026-04-09T03:51:01Z

feat: add Interfaze AI commands (ocr, search, ai-scrape) to browse CLI and Skills

Summary

The browse CLI can screenshot pages and extract DOM text, but it can't read text from images, search the web, or reach sites behind auth walls/bot protection. This PR adds three new commands powered by Interfaze AI that fill those gaps:

$B ocr — extract text from screenshots, image files, PDF documents, or page elements with per-word bounding boxes and confidence scores
$B search — web search with structured citations and raw search metadata in precontext
$B ai-scrape — structured data extraction from any URL using a JSON schema, including auth-walled and bot-protected sites

Before vs After

1. Auth-walled sites (LinkedIn, Glassdoor, etc.)

Before — Playwright headless gets blocked:

$ browse goto https://www.linkedin.com/company/y-combinator
→ Redirected to /authwall (HTTP 999)
→ Empty page. Zero data extracted.

After — Interfaze's server-side browser engine gets through:

$ browse ai-scrape https://www.linkedin.com/company/y-combinator \
    --schema '{"company_name":"string","description":"string","industry":"string","headquarters":"string"}'

{
  "name": "Y Combinator",
  "description": "Y Combinator is a startup accelerator that launches roughly 400 companies twice a year.",
  "industry": "Venture Capital and Private Equity",
  "headquarters": "San Francisco, US"
}

Plus tens of thousands of characters of raw scraped content in precontext — posts, follower counts, employee info — all behind LinkedIn's auth wall.

2. OCR — images and PDF documents with spatial metadata

Before: Take screenshot → paste into Claude/GPT → burn ~$3–15/MTok on vision tokens → get plain text back, no coordinates. PDFs? No built-in support at all.

After — one command for images, screenshots, and PDFs:

$ browse ocr /tmp/page.png --json

{
  "text": "Hacker News\nnew | past | comments | ask | show | jobs | submit...",
  "precontext": [{
    "name": "ocr",
    "result": {
      "sections": [{
        "lines": [{
          "text": "Hacker News",
          "bounds": { "top_left": {"x": 18, "y": 4}, "width": 126, "height": 18 },
          "average_confidence": 0.95,
          "words": [{ "text": "Hacker", "confidence": 0.97, "bounds": {"..."} }]
        }]
      }]
    }
  }]
}

PDFs work the same way — multi-page documents with per-word bounding boxes:

$ curl -sL https://arxiv.org/pdf/2602.04101 -o /tmp/paper.pdf
$ browse ocr /tmp/paper.pdf --json

{
  "text": "Agentic Context Engineering\nSmall Language Models are the Future of Agentic AI...",
  "precontext": [{
    "name": "ocr",
    "result": {
      "extracted_text": "Agentic Context Engineering...",
      "sections": [{ "lines": [{ "text": "Agentic Context Engineering", "average_confidence": 0.99 }] }],
      "width": 1190, "height": 1684, "total_pages": 12
    }
  }]
}

Per-word bounding boxes + confidence scores — useful for layout verification in QA, document processing, and invoice/receipt extraction.

3. Web search

Before: Navigate to Google in headless → get blocked by CAPTCHAs. Or pay ~$50/mo for SerpAPI. Or ask the LLM (which hallucinates URLs).

After — real URLs, structured citations, no extra service:

$ browse search "Y Combinator top companies 2025"

*   Y Combinator Top Companies List
    URL: https://www.ycombinator.com/topcompanies
    Snippet: Y Combinator has funded over 5,000 startups since 2005, including
    Airbnb, Stripe, DoorDash, Coinbase, Instacart, and Dropbox...

*   The top YC companies by valuation — 2025 update
    URL: https://www.linkedin.com/pulse/top-yc-companies-valuation-2025
    Snippet: ...

--- precontext (raw search metadata) ---
[{ "name": "search", "result": [{ "title": "...", "url": "...", "content": "..." }] }]

4. Cost comparison

Model	Input	Output
Interfaze	$1.50/MTok	$3.50/MTok
Claude Sonnet 4 (vision)	$3/MTok	$15/MTok
GPT-4.1	$2/MTok	$8/MTok
GPT-4V (vision/OCR)	$10/MTok	$30/MTok
SerpAPI (search only)	$50/mo (5K searches)	—

Interfaze includes caching, browser engine, and sandbox at no extra cost.

What changed

File	Change
`browse/src/interfaze-auth.ts`	New — API key resolution (~/.gstack/interfaze.json, env var, setup hint)
`browse/src/interfaze-client.ts`	New — Vercel AI SDK provider with precontext metadata extractor
`browse/src/interfaze-schema.ts`	New — Dynamic Zod schema builder from JSON type hints
`browse/src/read-commands.ts`	Add `ocr`, `search`, `ai-scrape` command handlers
`browse/src/meta-commands.ts`	Add `interfaze-setup` interactive key configuration
`browse/src/commands.ts`	Register new commands in registry + descriptions
`browse/src/server.ts`	Pass `browserManager` to `handleReadCommand`
`browse/test/interfaze-schema.test.ts`	New — Unit tests for schema parsing + auth hints
`browse/test/*.test.ts`	Update existing test call sites for new `handleReadCommand` signature
`scripts/resolvers/utility.ts`	Mention `$B ocr` in QA methodology
`scripts/resolvers/design.ts`	Mention `$B ocr` for typography extraction
`**/SKILL.md.tmpl`	Document new commands in browse, root, office-hours, investigate templates
`package.json`	Add `ai`, `@ai-sdk/openai-compatible`, `zod` dependencies

Design decisions

Vercel AI SDK over raw OpenAI SDK — structured output via Zod, metadataExtractor for Interfaze's precontext field
Graceful degradation — all commands return a clear setup hint when no API key is present, never crash
Zero breaking changes — existing commands untouched, new commands are additive read-only operations
~400 lines of new code, all behind an optional API key

Test plan

All 659 existing unit/integration tests pass (bun test)
Skill validation + gen-skill-docs freshness checks pass
$B ocr /tmp/page.png — extracts text with bounding boxes from live Interfaze API
$B ocr /tmp/page.png --json — returns precontext with per-word coordinates + confidence
$B ocr /tmp/doc.pdf --json — extracts text from PDF with bounding boxes and page count
$B search <query> — returns 5 structured results with URLs and snippets
$B ai-scrape <url> --schema '{"name":"string","description":"string"}' — returns structured JSON
All three commands show setup instructions when no API key is configured
LinkedIn scrape verified via direct API call (67K chars extracted behind auth wall)

Adds three new read commands powered by Interfaze AI that fill gaps in the browse CLI's data extraction capabilities: - `$B ocr` — extract text from screenshots/images with per-word bounding boxes and confidence scores via Interfaze's specialized OCR model - `$B search` — web search with structured citations and full precontext metadata, no external service (SerpAPI etc.) needed - `$B ai-scrape` — structured data extraction from any URL using a JSON schema, including sites behind auth walls and bot protection that Playwright can't reach (e.g. LinkedIn returns empty via headless Chromium but Interfaze's server-side browser engine extracts full data) Also adds `$B interfaze-setup` for API key configuration, following the existing ~/.gstack/ config pattern. Uses the Vercel AI SDK (@ai-sdk/openai-compatible) for structured output support and precontext metadata extraction. All commands degrade gracefully with setup instructions when no API key is present. Zero breaking changes — all existing commands are unchanged. Tested: 659 unit/integration tests pass, all three commands verified end-to-end against the live Interfaze API (OCR with bounding boxes, search with citations, ai-scrape on auth-walled LinkedIn page).

yoeven added 2 commits April 8, 2026 19:37

added ocr fix and pdf understanding

18979e3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add AI commands (ocr, search, ai-scrape) to browse CLI and Skills#930

feat: add AI commands (ocr, search, ai-scrape) to browse CLI and Skills#930
yoeven wants to merge 2 commits intogarrytan:mainfrom
JigsawStack:main

yoeven commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yoeven commented Apr 9, 2026

feat: add Interfaze AI commands (ocr, search, ai-scrape) to browse CLI and Skills

Summary

Before vs After

1. Auth-walled sites (LinkedIn, Glassdoor, etc.)

2. OCR — images and PDF documents with spatial metadata

3. Web search

4. Cost comparison

What changed

Design decisions

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant