
π Transform any web interface SPAs, dynamic dashboards, or complex content layers into semantically structured, LLM-optimized Markdown with human-level intelligence. Outperforms FireCrawl, Jina Reader, and other paid solutions while running entirely on your local machine.
π Modern Bun-powered workspace with comprehensive scraper, Hono API server, and MCP integration π

Feature | SniffHunt | FireCrawl (/extract) | Jina Reader | Others |
---|---|---|---|---|
Cost | π Free & Open Source | π° $99-799/month | π° usage-based | π° $50-1000/month |
Privacy | π 100% Local | βοΈ Cloud-based | βοΈ Cloud-based | βοΈ Cloud-based |
AI Intelligence | π§ Cognitive DOM modeling | β‘ Basic extraction | π Text-only | π Limited |
Interactive Content | β Full UI interaction | β Static only | β Static only | β Limited |
LLM Optimization | π― Purpose-built | π Generic output | π Generic output | π Basic |
While FireCrawl, Jina & Others uses basic text extraction, SniffHunt employs cognitive modeling to understand context and semantics.
Handles complex SPAs and dynamic interfaces that cause other tools to fail completely.
Navigates tabs, modals, and dropdowns like a human user, not just scraping static HTML.
Generates markdown specifically formatted for optimal LLM consumption and context understanding.
Runs entirely in your local environment, unlike cloud-based tools that process your data externally.
What You'll Learn: How to install and configure SniffHunt, start the API server and web interface, set up MCP integration for AI tools, and see basic usage examples for each component.
Before we begin, make sure you have:
- Bun >= 1.2.15 (Install Bun)
- Google Gemini API Key (Get free key from Google AI Studio)
β οΈ API Key Required: You'll need a Google Gemini API key for the AI-powered content extraction. The free tier is generous and perfect for getting started.
git clone https://github.com/mpmeetpatel/sniffhunt-scraper.git
cd sniffhunt-scraper
bun install
This installs all dependencies for the entire workspace including all apps.
cp .env.example .env
Edit the .env
file and add your Gemini API key:
# Required
GOOGLE_GEMINI_KEY=your_actual_api_key_here
# Optional (You can provide multiple keys here to avoid rate limiting & load balancing)
GOOGLE_GEMINI_KEY1=your_alternative_key_1
GOOGLE_GEMINI_KEY2=your_alternative_key_2
GOOGLE_GEMINI_KEY3=your_alternative_key_3
# Optional (defaults shown)
PORT=8080
MAX_RETRY_COUNT=2
RETRY_DELAY=1000
PAGE_TIMEOUT=10000
CORS_ORIGIN=*
Choose your preferred way to use SniffHunt:
Perfect for interactive use and web application integration.
bun run dev:server
The server will start on http://localhost:8080
# In a new terminal
bun run dev:web
Open http://localhost:6001
in your browser for the beautiful web interface.
# Test the API
curl -X POST http://localhost:8080/scrape-sync \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "mode": "normal"}'
Integrate SniffHunt directly with Claude Desktop, Cursor, or other MCP-compatible AI tools.
bun run setup:mcp
This builds the MCP server and makes it globally available.
Add this to your MCP client configuration (e.g., Cursor, Windsurf, VSCode, Claude Desktop):
{
"mcpServers": {
"sniffhunt-scraper": {
"command": "npx",
"args": ["-y", "sniffhunt-scraper-mcp-server"],
"env": {
"GOOGLE_GEMINI_KEY": "your-api-key-here"
}
}
}
}
Restart your AI client and try asking:
Scrape https://anu-vue.netlify.app/guide/components/alert.html & grab the 'Outlined Alert Code snippets'
The AI will automatically use SniffHunt to extract the content!
React-based web interface with modern UI and real-time scraping capabilities.
- Complete Setup: Follow the Quick Start Guide for initial setup
- API Server Running: The web interface requires the API server
- Environment Configured: Ensure
.env
file is properly set up
# Step 1: Start the API server (Terminal 1)
bun run dev:server
# Step 2: Start the web interface (Terminal 2)
bun run dev:web
Access Points:
- π Web Interface:
http://localhost:6001
- π API Server:
http://localhost:8080
Perfect for automation, scripting, and one-off extractions.
# Scrape any website
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html
# Output saved as:
# scraped.raw.md or scraped.md (based on mode and query automatically generated name)
# scraped.html
# Use normal mode for static sites
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --mode normal
# Use beast mode for complex sites
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --query "Grab the Outlined Alert Code snippets" --mode beast
# Add semantic query for focused extraction & Custom output filename
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --query "Grab the Outlined Alert Code snippets" --output my-content
Model Context Protocol server for AI integrations.
Model Context Protocol (MCP) is a standardized way for AI applications to access external tools and data sources. SniffHunt's MCP server allows AI models to scrape and extract web content as part of their reasoning process.
π‘ MCP Benefits:
- Direct AI Integration: Use scraping within AI conversations
- Tool Calling: AI models can scrape websites automatically
- Context Enrichment: Provide real-time web data to AI models
- Standardized Interface: Works with any MCP-compatible AI client
Before setting up MCP integration, ensure you have:
- SniffHunt Installed: Complete the Quick Start Guide first
- API Key Configured: Google Gemini API key in your
.env
file - MCP Client: Claude Desktop, Cursor, or another MCP-compatible AI tool
# Build and setup MCP server from root directory
bun run setup:mcp
This command:
- Builds the MCP server with all scraping capabilities
- Publishes globally via
npx
for any AI client to use (Locally only, will not publish to npm or any other package registry) - Creates the binary that MCP clients can execute
What happens internally:
- Compiles the MCP server from
apps/mcp/src/
- Builds dependencies and scraper functionality
- Makes
sniffhunt-scraper-mcp-server
available globally
Add this configuration to your MCP client:
{
"mcpServers": {
"sniffhunt-scraper": {
"command": "npx",
"args": ["-y", "sniffhunt-scraper-mcp-server"],
"env": {
"GOOGLE_GEMINI_KEY": "your_actual_api_key_here"
}
}
}
}
Important Notes:
- Replace
your_actual_api_key_here
with your real Google Gemini API key - Environment variables are passed directly to the MCP server process
After adding the configuration:
- Close your AI client completely
- Restart the application
- Verify the MCP server is loaded (look for SniffHunt tools in your AI client)
Your AI client should now have access to SniffHunt scraping capabilities. Test by asking:
π‘ Try These Examples:
- Can you scrape https://news.ycombinator.com and summarize the top stories?
- Can you scrape https://anu-vue.netlify.app/guide/components/alert.html and grab code snippets for outlined alerts?
The AI will automatically use SniffHunt to fetch and process the content!
Scrape and extract content from any website.
Parameters:
url
(required): Target URL to scrapemode
(optional):normal
orbeast
(default: beast)userQuery
(optional): Natural language description of desired content
Example Usage in AI Chat:
User: "Can you scrape https://news.ycombinator.com and get the top 5 stories?"
AI: I'll scrape Hacker News for you and extract the top stories.
[Uses scrape_website tool with url="https://news.ycombinator.com" and userQuery="top 5 stories"]
The MCP tool returns data in the standard MCP format. The actual response structure:
{
"content": [
{
"type": "text",
"text": {
"success": true,
"url": "https://example.com",
"mode": "beast",
"processingTime": 2.34,
"markdownLength": 12450,
"htmlLength": 45230,
"hasEnhancedError": false,
"enhancedErrorMessage": null,
"markdown": "# Page Title\\n\\nExtracted content in markdown format...",
"html": "<html>Raw HTML content...</html>"
}
}
]
}
Response Fields:
success
: Boolean indicating if scraping was successfulurl
: The scraped URLmode
: Scraping mode used (normal
orbeast
)processingTime
: Time taken for scraping in secondsmarkdownLength
: Length of extracted markdown contenthtmlLength
: Length of raw HTML contenthasEnhancedError
: Boolean indicating if enhanced error info is availableenhancedErrorMessage
: Human-readable error message (if any)markdown
: Cleaned, structured content in markdown formathtml
: Raw HTML content from the page
Hono-based API server with streaming and sync endpoints.
Before starting the server, ensure you have:
- Environment Setup: A
.env
file in the root directory with your Google Gemini API key - Dependencies Installed: Run
bun install
from the root directory
# Start from root directory (automatically loads .env)
bun run dev:server
Benefits:
- Automatically loads environment variables from root
.env
- Consistent with other workspace commands
- No need to navigate to subdirectories
# Alternative: Start from server directory
cd apps/server
bun dev
The server will start on http://localhost:8080
by default.
# Health check
curl http://localhost:8080/health
# Expected response:
{
"status": "healthy",
"service": "SniffHunt Scraper API",
"version": "1.0.0",
"timestamp": "xxxxx"
}
Returns API health status and configuration validation.
Response:
{
"status": "healthy",
"service": "SniffHunt Scraper API",
"version": "1.0.0",
"timestamp": "xxxxx"
}
Real-time streaming extraction with progress updates.
Request Body:
{
"url": "https://anu-vue.netlify.app/guide/components/alert.html",
"mode": "normal" | "beast",
"query": "natural language content description"
}
Example:
curl -N http://localhost:8080/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://anu-vue.netlify.app/guide/components/alert.html", "mode": "beast"}'
Response: Server-Sent Events (SSE) stream with real-time updates.
Standard synchronous extraction for simple integrations.
Request Body:
{
"url": "https://anu-vue.netlify.app/guide/components/alert.html",
"mode": "normal" | "beast",
"query": "natural language content description"
}
Parameters:
url
(required): Target URL for content extractionmode
(optional): Extraction strategynormal
: Standard content extraction (default)beast
: Interactive interface handling with AI intelligence
query
(optional): Natural language description for semantic filtering
Response Format:
{
"success": true,
"content": "# Extracted Content\n\nMarkdown-formatted content here...",
"metadata": {
"title": "Page Title",
"url": "https://anu-vue.netlify.app/guide/components/alert.html",
"mode": "beast",
"extractionTime": 3.2,
"contentLength": 15420
}
}
- Best for: Static content, blogs, documentation
- Performance: Fast extraction
- Capabilities: Basic content extraction (still better than paid services even in normal mode)
- Best for: SPAs, dynamic dashboards, interactive interfaces
- Performance: Intelligent extraction with AI processing
- Capabilities:
- UI interaction (clicks, scrolls, navigation)
- Modal and popup handling
- Dynamic content loading
- Semantic content understanding
π‘ Mode Selection: Use
normal
mode for standard websites andbeast
mode for complex web applications that require interaction or have dynamic content runtime content.
Use the query
parameter to extract specific content like this:
# Extract Avatar Code snippets
curl -X POST http://localhost:8080/scrape-sync \
-H "Content-Type: application/json" \
-d '{
"url": "https://anu-vue.netlify.app/guide/components/avatar.html",
"mode": "beast",
"query": "Grab the Avatar Code snippets"
}'
# Extract API reference and code examples
curl -X POST http://localhost:8080/scrape-sync \
-H "Content-Type: application/json" \
-d '{
"url": "https://anu-vue.netlify.app/guide/components/alert.html",
"mode": "normal",
"query": "Grab API reference and code examples"
}'
Command-line interface for direct web scraping operations.
Before starting the CLI scraper, ensure you have:
- Environment Setup: A
.env
file in the root directory with your Google Gemini API key - Dependencies Installed: Run
bun install
from the root directory
# Recommended: Run from root directory
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html
# Alternative: Run from scraper directory
cd apps/scraper
bun cli.js https://anu-vue.netlify.app/guide/components/alert.html
# Scrape a basic website
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html
# Output will be saved to a markdown file
# Output: example-com-20240115-143022.md
# Scrape with beast mode for complex sites
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --mode beast
# Scrape with custom query for semantic filtering
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --query "Grab the Outlined Alert Code snippets"
# Combine options
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --mode beast --query "Grab the Outlined Alert Code snippets" --output custom-name
The target URL to scrape. Must be the first argument.
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html
Choose the scraping strategy:
normal
(default): Fast extraction for static contentbeast
: AI-powered extraction for interactive content
# Normal mode (default)
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --mode normal
# Beast mode for SPAs and dynamic content
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --mode beast
Natural language description of desired content for semantic filtering.
# Extract specific content
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --query "Grab the Outlined Alert Code snippets"
Specify custom output filename.
# Custom filename
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --output my-content
Display help information and available options.
bun run cli:scraper --help
# Full syntax
bun run cli:scraper <URL> [OPTIONS]
# Options:
# -m, --mode <mode> Scraping mode: normal|beast (default: normal)
# -q, --query <query> Natural language content filter
# -o, --output <file> Output filename (default: auto-generated)
# -h, --help Show help information
React-based web interface with modern UI and real-time scraping.
Before starting the web interface:
- Complete Setup: Follow the Quick Start Guide for initial setup
- API Server Running: The web interface requires the API server
- Environment Configured: Ensure
.env
file is properly set up
# Step 1: Start the API server (Terminal 1)
bun run dev:server
# Step 2: Start the web interface (Terminal 2)
bun run dev:web
Access Points:
- π Web Interface:
http://localhost:6001
- π API Server:
http://localhost:8080
Benefits:
- Consistent workspace environment
- Automatic environment variable loading
- Coordinated development setup
# Alternative: Start from web directory
cd apps/web
bun dev
Note: This method also loads the root .env
file automatically.
- API Server Health: Visit
http://localhost:8080/health
- Web Interface: Open
http://localhost:6001
in your browser - Test Scraping: Try scraping
https://example.com
with normal mode
Type or paste the website URL you want to scrape (for example):
https://anu-vue.netlify.app/guide/components/alert.html
Normal Mode - For standard websites
Beast Mode - For complex applications
Use natural language to specify what content you want (for example):
- "Grab code snippets & API Reference"
Click the "Extract Content" button and watch real-time progress:
- π Connecting - Establishing connection to target site
- β³ Loading - Page loading and rendering
- π§ Analyzing - AI-powered content understanding (Beast mode)
- β‘ Extracting - Converting to markdown format
- β Complete - Content ready for use
# Check if API server is running
curl http://localhost:8080/health
# Should return:
{
"status": "healthy",
"service": "SniffHunt Scraper API",
"version": "1.0.0"
}
curl -X POST http://localhost:8080/scrape-sync \
-H "Content-Type: application/json" \
-d '{
"url": "https://anu-vue.netlify.app/guide/components/alert.html",
"mode": "normal",
"query": "Grab the Outlined Alert Code snippets"
}'
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --query "Grab the Outlined Alert Code snippets"
- Open
http://localhost:6001
- Enter URL:
https://anu-vue.netlify.app/guide/components/alert.html
- Select mode: "Normal"
- Add query: "Grab the Outlined Alert Code snippets"
- Click "Extract Content"
Error: "API key not configured"
Solution: Ensure GOOGLE_GEMINI_KEY
is set in your .env
file with a valid Gemini API key.
Error: "Port 8080 already in use"
Solution: Close any other process that is using port 8080 or change the port in your .env
file:
PORT=6001
Error: "Browser not found"
Solution: Install Playwright browser dependencies (run this command in the root directory of the project):
cd apps/scraper && bunx playwright-core install --with-deps --only-shell chromium
- Content Research: Extract structured data from any website
- AI Workflows: Provide real-time web content to LLM applications
- Data Mining: Automated content extraction for analysis
- Documentation: Convert web content to markdown for documentation
- API Integration: RESTful endpoints for programmatic access
- π Bug Reports: GitHub Issues
- π¬ Discussions: GitHub Discussions
- π§ Direct Support: For enterprise integrations and custom requirements
-
β Star the Repository: GitHub
-
πΌ Upvote on peerlist:
-
β Support Development: Buy Me Coffee
- Personal Use: Free
- Commercial Use: β (contact for licensing)
- Code redistribution/reselling: β
- License - see LICENSE file for details.
Privacy & Compliance: SniffHunt is a true privacy-first solution that runs entirely on your infrastructure, ensuring your data never leaves your control.