DeepQuery is a web scraping and search tool built with Node.js (ES Modules) that provides multi-engine search capabilities (Bing, DuckDuckGo), image/news/video search, and screenshot capture functionality.
- Web Search: Multi-engine search with automatic failover
- Image Search: Bing image extraction
- News Search: News collection with metadata (title, source, date)
- Video Search: Video extraction with thumbnails
- Screenshots: Puppeteer-based screenshot capture (single, batch, element-specific)
- LRU Cache: In-memory caching with configurable TTL
- Proxy Rotation: Multi-proxy support with automatic blacklisting
DeepQuery/
βββ src/
β βββ index.mjs # Main entry point - exports all public APIs
β βββ config.mjs # Centralized configuration
β βββ screenshot.mjs # Legacy screenshot script (standalone)
β βββ utils/
β β βββ engineManager.mjs # Search engine orchestration
β β βββ parser.mjs # HTML parsing for all engines
β β βββ cache.mjs # LRU cache implementation
β β βββ httpClient.mjs # Axios client factory
β β βββ proxyManager.mjs # Proxy rotation & blacklisting
β β βββ helpers.mjs # Utility functions (retry, isBlocked)
β β βββ screenshot.mjs # Screenshot utilities (Puppeteer)
β βββ tests/
β βββ data-search.mjs
β βββ image-search.mjs
β βββ news-search.mjs
β βββ video-search.mjs
β βββ screenshot-simples.mjs
β βββ screenshot-batch.mjs
β βββ screenshot-element.mjs
βββ assets/
β βββ agents.json # User-Agent strings for rotation
βββ package.json
Purpose: Public API exports and cache handling wrapper
Exports:
dataSearch(query, options)- Web searchimageSearch(query, options)- Image searchnewsSearch(query, options)- News searchvideoSearch(query, options)- Video searchscreenshot(url, options)- Single screenshotscreenshotBatch(urls, options)- Batch screenshotsscreenshotElement(url, selector, options)- Element screenshot
Cache Pattern:
function buildKey(type, query, lang, limit) {
return `${type}:${query}:${lang}:${limit}`;
}
async function handleCache(key, fn, options) {
const { cache, ttl, namespace } = options;
if (cache) {
const cached = getCache(key, namespace);
if (cached) return cached;
}
const result = await fn();
if (cache) {
setCache(key, result, { ttl, namespace });
}
return result;
}Options Pattern (all search functions):
{
lang: "pt-BR", // Default from CONFIG.search.defaultLang
limit: 10, // Default from CONFIG.search.defaultLimit
cache: true, // Enable/disable caching
ttl: 60000 // Cache TTL in ms (default: CONFIG.cache.defaultTTL)
}Purpose: Centralized configuration for all modules
Structure:
export const CONFIG = {
engines: {
list: ["bing", "duckduckgo"],
endpoints: {
bing: {
search: (query, lang) => `https://www.bing.com/search?q=${query}&setlang=${lang}`,
images: (query, lang) => `https://www.bing.com/images/search?q=${query}&setlang=${lang}`,
news: (query, lang) => `https://www.bing.com/news/search?q=${query}&setlang=${lang}`,
video: (query, lang) => `https://www.bing.com/videos/search?q=${query}&setlang=${lang}`
},
duckduckgo: {
search: (query) => `https://html.duckduckgo.com/html/?q=${query}`
}
}
},
screenshot: {
executablePath: "/data/data/com.termux/files/usr/bin/chromium-browser",
defaultWidth: 1280,
defaultHeight: 800,
defaultTimeout: 60000,
defaultDelay: 3000
},
http: {
timeout: 8000,
retries: 3,
baseDelay: 500
},
cache: {
defaultTTL: 60000,
maxItems: 500
},
proxy: {
blockTTL: 60000
},
search: {
defaultLang: "pt-BR",
defaultLimit: 10
}
};Purpose: Manages search engine selection, failover, and result aggregation
Key Functions:
runSearch(query, lang)- Web search with engine failoverrunImageSearch(query, lang)- Bing image searchrunNewsSearch(query, lang)- Bing news searchrunVideoSearch(query, lang)- Bing video search
Engine Failover Logic:
export async function runSearch(query, lang) {
const engines = _.shuffle(ENGINES); // Randomize engine order
for (const engine of engines) {
try {
const html = await retry(async (attempt) => {
const client = createClient(lang);
// ... fetch from engine endpoint
if (isBlocked(data, status)) {
reportBadProxy(proxyUsed);
throw new Error(`blocked:${engine}:attempt${attempt}`);
}
return data;
});
const results = engine === "bing" ? parseBing(html) : parseDuck(html);
if (results.length === 0) {
throw new Error(`empty:${engine}`);
}
return results;
} catch (err) {
console.warn(`[engine fail] ${engine} -> ${err.message}`);
}
}
return []; // All engines failed
}Purpose: Extract structured data from search engine HTML responses
Parsers:
parseBing(html)- Extracts {title, link, description, favicon}parseDuck(html)- Extracts {title, link, description}parseImages(html)- Extracts {image, title}parseNews(html)- Extracts {title, link, source, time}parseVideos(html)- Extracts {title, link, thumbnail}
Example - Bing Parser:
export function parseBing(html) {
const $ = cheerio.load(html);
const results = [];
$(".b_algo").each((i, el) => {
const title = $(el).find("h2").text().trim();
const link = $(el).find("h2 a").attr("href");
const description = $(el).find(".b_caption p").text().trim();
if (title && link) {
results.push({
title,
link,
description,
favicon: `https://www.google.com/s2/favicons?domain=${link}`
});
}
});
return results;
}Purpose: In-memory caching with TTL and LRU eviction
Key Features:
- Namespace support (
data:,image:,news:,video:) - TTL-based expiration
- LRU eviction when maxItems exceeded
- Auto-cleanup every 30 seconds
API:
getCache(key, namespace = "default") // Returns cached value or null
setCache(key, value, { ttl, namespace }) // Sets cached value
clearCache(namespace = null) // Clears cache (all or by namespace)Internal Structure:
const store = new Map();
// Entry format: { value, expiresAt: timestamp | null }Purpose: Creates configured Axios instances with proxy and User-Agent rotation
Features:
- Automatic proxy injection from
proxyManager - Random User-Agent from
assets/agents.json - Configurable timeout and headers
- Does not validate status (allows error handling in caller)
Usage:
const client = createClient(lang);
const response = await client.get(url);
// response.data contains HTMLHeaders:
User-Agent: Random from agents.jsonAccept-Language: Based onlangparameterAccept: text/html,application/xhtml+xmlConnection: keep-aliveCache-Control: no-cachePragma: no-cache
Purpose: Proxy rotation with automatic blacklisting
Current State: Proxy list is empty by default (manual configuration required)
API:
getProxy() // Returns next available proxy URL
reportBadProxy(proxy) // Blacklists proxy for blockTTL durationBlacklist Logic:
- Proxies are blacklisted for
CONFIG.proxy.blockTTL(60s default) isBlocked(proxy)checks if proxy is currently blacklistedgetNextProxy()cycles through proxies, skipping blocked ones
Purpose: Common helper functions for retry logic and block detection
Functions:
isBlocked(html, status) - Detects CAPTCHA/blocks:
const patterns = [
"captcha", "verify you are human", "access denied",
"temporarily blocked", "unusual traffic", "/sorry/index",
"cf-challenge", "attention required"
];
// Returns true if status >= 400, html.length < 800, or patterns matchretry(fn, options) - Exponential backoff retry:
// Retries fn() up to `retries` times
// Delay: baseDelay * 2^attempt + random(300ms)
// Default: 3 retries, 500ms base delayPurpose: Puppeteer-based screenshot capture
Functions:
screenshot(url, options):
{
outputPath: "./screenshot-{timestamp}.png",
width: 1280,
height: 800,
fullPage: true,
waitUntil: "domcontentloaded",
timeout: 60000,
delay: 3000,
userAgent: "Mozilla/5.0...",
executablePath: "/data/data/com.termux/files/usr/bin/chromium-browser"
}screenshotBatch(urls, options):
- Iterates through URLs sequentially
- Adds
batchDelaybetween captures - Returns array of results with success/error status
screenshotElement(url, selector, options):
- Waits for selector to be available
- Captures only the matched element
User calls dataSearch(query, options)
β
index.mjs: buildKey() creates cache key
β
index.mjs: handleCache() checks cache
β (cache miss)
engineManager.mjs: runSearch()
β
httpClient.mjs: createClient(lang) β Axios instance with proxy + UA
β
helpers.mjs: retry() wrapper
β
Engine endpoint (Bing/DuckDuckGo)
β
helpers.mjs: isBlocked() check
β (if blocked)
proxyManager.mjs: reportBadProxy()
β (if success)
parser.mjs: parseBing() or parseDuck()
β
index.mjs: setCache() stores result
β
Returns results array
User calls screenshot(url, options)
β
Puppeteer.launch() with Chromium
β
page.newPage() β setViewport() β setUserAgent()
β
page.goto(url, { waitUntil, timeout })
β
Optional delay
β
page.screenshot({ path, fullPage })
β
browser.close()
β
Returns { success, path, url, timestamp }
- Use ES Modules (
import/export) - Named exports preferred over default exports
- Import order: Node.js built-in β npm packages β local modules
- Catch errors at engine level, continue to next engine
- Log warnings with
console.warn()for non-critical failures - Throw errors for critical failures (screenshot, cache operations)
- Files: kebab-case (
engineManager.mjs) - Functions: camelCase (
runSearch,parseBing) - Constants: UPPER_SNAKE_CASE (
CONFIG,DEFAULT_OPTIONS) - Classes: PascalCase (none currently used)
- JSDoc comments for public functions
- Inline comments for complex logic
- Console logs with emoji prefixes (πΈ, β
, β,
β οΈ )
import { functionName } from "../index.mjs";
const res = await functionName("query", { options });
console.log(res);# Individual tests
node src/tests/data-search.mjs
node src/tests/image-search.mjs
node src/tests/news-search.mjs
node src/tests/video-search.mjs
node src/tests/screenshot-simples.mjs
node src/tests/screenshot-batch.mjs
node src/tests/screenshot-element.mjs
# All tests (via npm)
npm test- Agents loaded from
assets/agents.json - Random selection per request via
_.sample() - Prevents simple bot detection
- Automatic blacklisting on block detection
- TTL-based unblocking (60s default)
- Prevents repeated requests through blocked proxies
- Pattern matching for CAPTCHA/block pages
- Status code checking (>= 400)
- Response size validation (< 800 chars = suspicious)
| Package | Purpose |
|---|---|
axios |
HTTP client for web requests |
cheerio |
HTML parsing (jQuery-like syntax) |
lodash |
Utility functions (_.shuffle, _.sample) |
playwright |
Listed but not currently used |
puppeteer-core |
Screenshot capture (headless Chromium) |
Required for screenshot functionality:
Termux (Android):
pkg install chromiumUbuntu/Debian:
sudo apt install chromium-browsermacOS:
brew install chromiumPUPPETEER_EXECUTABLE_PATH: Override Chromium path
import { dataSearch } from "./src/index.mjs";
const results = await dataSearch("openai", {
lang: "en-US",
limit: 5,
cache: true,
ttl: 120000
});import { dataSearch, screenshot } from "./src/index.mjs";
async function searchAndCapture(query) {
const results = await dataSearch(query, { limit: 5 });
if (results.length > 0) {
const shot = await screenshot(results[0].link, {
outputPath: `./result-${Date.now()}.png`,
fullPage: true,
delay: 3000
});
return { search: results, screenshot: shot };
}
}import { dataSearch, screenshotBatch } from "./src/index.mjs";
async function captureTopResults(query, limit = 3) {
const results = await dataSearch(query, { limit });
const urls = results.map(r => r.link);
const screenshots = await screenshotBatch(urls, {
batchDelay: 3000,
fullPage: false
});
return results.map((r, i) => ({
...r,
screenshot: screenshots[i]?.path || null
}));
}- Check
./AGENTS.mdfirst - All core rules and architecture are centralized here - Follow existing patterns - Match coding style, naming conventions, and module structure
- Update tests - Add/modify tests in
src/tests/for new functionality - Update config - Add new configuration options to
src/config.mjsif needed - Document changes - Add JSDoc comments for public functions
- Search engines: Add endpoint to
config.mjs, parser toparser.mjs, orchestration toengineManager.mjs - Cache namespaces: Use consistent naming (
data:,image:,news:,video:, etc.) - New utilities: Create module in
src/utils/, export fromsrc/index.mjs - Configuration: Never hardcode values - use
CONFIGfromconfig.mjs
- Check logs - Look for
[engine fail],β Erro,β οΈwarnings - Verify cache - Test with
cache: falseto rule out cache issues - Proxy issues - Check
proxyManager.mjsblacklist status - Block detection - Review
helpers.mjs:isBlocked()patterns
@CLAUDE.md- Claude-specific instruction (points to AGENTS.md)@GEMINI.md- Gemini-specific instruction (points to AGENTS.md)@README.md- User-facing documentation with usage examples@package.json- Project metadata and dependencies@assets/agents.json- User-Agent strings for rotation
- AGENTS.md is the source of truth - CLAUDE.md and GEMINI.md both reference this file
- Do not duplicate context - If AGENTS.md is loaded, avoid redundant documentation
- Proxy list is empty - Must be manually configured in
proxyManager.mjs - Playwright unused - Listed in dependencies but not currently used
- Termux-optimized - Default Chromium path configured for Termux environment
Version: 1.0.0
Author: Alisson (mrx_dev)
License: GPL-3.0