AGENTS.md - DeepQuery Project Documentation

📌 Project Overview

DeepQuery is a web scraping and search tool built with Node.js (ES Modules) that provides multi-engine search capabilities (Bing, DuckDuckGo), image/news/video search, and screenshot capture functionality.

Core Features

Web Search: Multi-engine search with automatic failover
Image Search: Bing image extraction
News Search: News collection with metadata (title, source, date)
Video Search: Video extraction with thumbnails
Screenshots: Puppeteer-based screenshot capture (single, batch, element-specific)
LRU Cache: In-memory caching with configurable TTL
Proxy Rotation: Multi-proxy support with automatic blacklisting

🏗️ Project Architecture

DeepQuery/
├── src/
│   ├── index.mjs           # Main entry point - exports all public APIs
│   ├── config.mjs          # Centralized configuration
│   ├── screenshot.mjs      # Legacy screenshot script (standalone)
│   ├── utils/
│   │   ├── engineManager.mjs   # Search engine orchestration
│   │   ├── parser.mjs          # HTML parsing for all engines
│   │   ├── cache.mjs           # LRU cache implementation
│   │   ├── httpClient.mjs      # Axios client factory
│   │   ├── proxyManager.mjs    # Proxy rotation & blacklisting
│   │   ├── helpers.mjs         # Utility functions (retry, isBlocked)
│   │   └── screenshot.mjs      # Screenshot utilities (Puppeteer)
│   └── tests/
│       ├── data-search.mjs
│       ├── image-search.mjs
│       ├── news-search.mjs
│       ├── video-search.mjs
│       ├── screenshot-simples.mjs
│       ├── screenshot-batch.mjs
│       └── screenshot-element.mjs
├── assets/
│   └── agents.json         # User-Agent strings for rotation
└── package.json

📦 Module Responsibilities

`src/index.mjs` - Main Entry Point

Purpose: Public API exports and cache handling wrapper

Exports:

dataSearch(query, options) - Web search
imageSearch(query, options) - Image search
newsSearch(query, options) - News search
videoSearch(query, options) - Video search
screenshot(url, options) - Single screenshot
screenshotBatch(urls, options) - Batch screenshots
screenshotElement(url, selector, options) - Element screenshot

Cache Pattern:

function buildKey(type, query, lang, limit) {
  return `${type}:${query}:${lang}:${limit}`;
}

async function handleCache(key, fn, options) {
  const { cache, ttl, namespace } = options;
  if (cache) {
    const cached = getCache(key, namespace);
    if (cached) return cached;
  }
  const result = await fn();
  if (cache) {
    setCache(key, result, { ttl, namespace });
  }
  return result;
}

Options Pattern (all search functions):

{
  lang: "pt-BR",           // Default from CONFIG.search.defaultLang
  limit: 10,               // Default from CONFIG.search.defaultLimit
  cache: true,             // Enable/disable caching
  ttl: 60000               // Cache TTL in ms (default: CONFIG.cache.defaultTTL)
}

`src/config.mjs` - Configuration Hub

Purpose: Centralized configuration for all modules

Structure:

export const CONFIG = {
  engines: {
    list: ["bing", "duckduckgo"],
    endpoints: {
      bing: {
        search: (query, lang) => `https://www.bing.com/search?q=${query}&setlang=${lang}`,
        images: (query, lang) => `https://www.bing.com/images/search?q=${query}&setlang=${lang}`,
        news: (query, lang) => `https://www.bing.com/news/search?q=${query}&setlang=${lang}`,
        video: (query, lang) => `https://www.bing.com/videos/search?q=${query}&setlang=${lang}`
      },
      duckduckgo: {
        search: (query) => `https://html.duckduckgo.com/html/?q=${query}`
      }
    }
  },
  screenshot: {
    executablePath: "/data/data/com.termux/files/usr/bin/chromium-browser",
    defaultWidth: 1280,
    defaultHeight: 800,
    defaultTimeout: 60000,
    defaultDelay: 3000
  },
  http: {
    timeout: 8000,
    retries: 3,
    baseDelay: 500
  },
  cache: {
    defaultTTL: 60000,
    maxItems: 500
  },
  proxy: {
    blockTTL: 60000
  },
  search: {
    defaultLang: "pt-BR",
    defaultLimit: 10
  }
};

`src/utils/engineManager.mjs` - Search Orchestration

Purpose: Manages search engine selection, failover, and result aggregation

Key Functions:

runSearch(query, lang) - Web search with engine failover
runImageSearch(query, lang) - Bing image search
runNewsSearch(query, lang) - Bing news search
runVideoSearch(query, lang) - Bing video search

Engine Failover Logic:

export async function runSearch(query, lang) {
  const engines = _.shuffle(ENGINES); // Randomize engine order
  
  for (const engine of engines) {
    try {
      const html = await retry(async (attempt) => {
        const client = createClient(lang);
        // ... fetch from engine endpoint
        if (isBlocked(data, status)) {
          reportBadProxy(proxyUsed);
          throw new Error(`blocked:${engine}:attempt${attempt}`);
        }
        return data;
      });
      
      const results = engine === "bing" ? parseBing(html) : parseDuck(html);
      
      if (results.length === 0) {
        throw new Error(`empty:${engine}`);
      }
      return results;
    } catch (err) {
      console.warn(`[engine fail] ${engine} -> ${err.message}`);
    }
  }
  return []; // All engines failed
}

`src/utils/parser.mjs` - HTML Parsing

Purpose: Extract structured data from search engine HTML responses

Parsers:

parseBing(html) - Extracts {title, link, description, favicon}
parseDuck(html) - Extracts {title, link, description}
parseImages(html) - Extracts {image, title}
parseNews(html) - Extracts {title, link, source, time}
parseVideos(html) - Extracts {title, link, thumbnail}

Example - Bing Parser:

export function parseBing(html) {
  const $ = cheerio.load(html);
  const results = [];
  
  $(".b_algo").each((i, el) => {
    const title = $(el).find("h2").text().trim();
    const link = $(el).find("h2 a").attr("href");
    const description = $(el).find(".b_caption p").text().trim();
    
    if (title && link) {
      results.push({
        title,
        link,
        description,
        favicon: `https://www.google.com/s2/favicons?domain=${link}`
      });
    }
  });
  
  return results;
}

`src/utils/cache.mjs` - LRU Cache

Purpose: In-memory caching with TTL and LRU eviction

Key Features:

Namespace support (data:, image:, news:, video:)
TTL-based expiration
LRU eviction when maxItems exceeded
Auto-cleanup every 30 seconds

API:

getCache(key, namespace = "default")  // Returns cached value or null
setCache(key, value, { ttl, namespace })  // Sets cached value
clearCache(namespace = null)  // Clears cache (all or by namespace)

Internal Structure:

const store = new Map();
// Entry format: { value, expiresAt: timestamp | null }

`src/utils/httpClient.mjs` - HTTP Client Factory

Purpose: Creates configured Axios instances with proxy and User-Agent rotation

Features:

Automatic proxy injection from proxyManager
Random User-Agent from assets/agents.json
Configurable timeout and headers
Does not validate status (allows error handling in caller)

Usage:

const client = createClient(lang);
const response = await client.get(url);
// response.data contains HTML

Headers:

User-Agent: Random from agents.json
Accept-Language: Based on lang parameter
Accept: text/html,application/xhtml+xml
Connection: keep-alive
Cache-Control: no-cache
Pragma: no-cache

`src/utils/proxyManager.mjs` - Proxy Management

Purpose: Proxy rotation with automatic blacklisting

Current State: Proxy list is empty by default (manual configuration required)

API:

getProxy()           // Returns next available proxy URL
reportBadProxy(proxy)  // Blacklists proxy for blockTTL duration

Blacklist Logic:

Proxies are blacklisted for CONFIG.proxy.blockTTL (60s default)
isBlocked(proxy) checks if proxy is currently blacklisted
getNextProxy() cycles through proxies, skipping blocked ones

`src/utils/helpers.mjs` - Utility Functions

Purpose: Common helper functions for retry logic and block detection

Functions:

isBlocked(html, status) - Detects CAPTCHA/blocks:

const patterns = [
  "captcha", "verify you are human", "access denied",
  "temporarily blocked", "unusual traffic", "/sorry/index",
  "cf-challenge", "attention required"
];
// Returns true if status >= 400, html.length < 800, or patterns match

retry(fn, options) - Exponential backoff retry:

// Retries fn() up to `retries` times
// Delay: baseDelay * 2^attempt + random(300ms)
// Default: 3 retries, 500ms base delay

`src/utils/screenshot.mjs` - Screenshot Utilities

Purpose: Puppeteer-based screenshot capture

Functions:

screenshot(url, options):

{
  outputPath: "./screenshot-{timestamp}.png",
  width: 1280,
  height: 800,
  fullPage: true,
  waitUntil: "domcontentloaded",
  timeout: 60000,
  delay: 3000,
  userAgent: "Mozilla/5.0...",
  executablePath: "/data/data/com.termux/files/usr/bin/chromium-browser"
}

screenshotBatch(urls, options):

Iterates through URLs sequentially
Adds batchDelay between captures
Returns array of results with success/error status

screenshotElement(url, selector, options):

Waits for selector to be available
Captures only the matched element

🔄 Data Flow

Search Request Flow

User calls dataSearch(query, options)
    ↓
index.mjs: buildKey() creates cache key
    ↓
index.mjs: handleCache() checks cache
    ↓ (cache miss)
engineManager.mjs: runSearch()
    ↓
httpClient.mjs: createClient(lang) → Axios instance with proxy + UA
    ↓
helpers.mjs: retry() wrapper
    ↓
Engine endpoint (Bing/DuckDuckGo)
    ↓
helpers.mjs: isBlocked() check
    ↓ (if blocked)
proxyManager.mjs: reportBadProxy()
    ↓ (if success)
parser.mjs: parseBing() or parseDuck()
    ↓
index.mjs: setCache() stores result
    ↓
Returns results array

Screenshot Flow

User calls screenshot(url, options)
    ↓
Puppeteer.launch() with Chromium
    ↓
page.newPage() → setViewport() → setUserAgent()
    ↓
page.goto(url, { waitUntil, timeout })
    ↓
Optional delay
    ↓
page.screenshot({ path, fullPage })
    ↓
browser.close()
    ↓
Returns { success, path, url, timestamp }

📋 Coding Conventions

Module Pattern

Use ES Modules (import/export)
Named exports preferred over default exports
Import order: Node.js built-in → npm packages → local modules

Error Handling

Catch errors at engine level, continue to next engine
Log warnings with console.warn() for non-critical failures
Throw errors for critical failures (screenshot, cache operations)

Naming Conventions

Files: kebab-case (engineManager.mjs)
Functions: camelCase (runSearch, parseBing)
Constants: UPPER_SNAKE_CASE (CONFIG, DEFAULT_OPTIONS)
Classes: PascalCase (none currently used)

Documentation

JSDoc comments for public functions
Inline comments for complex logic
Console logs with emoji prefixes (📸, ✅, ❌, ⚠️)

🧪 Testing Patterns

Test File Structure

import { functionName } from "../index.mjs";

const res = await functionName("query", { options });
console.log(res);

Running Tests

# Individual tests
node src/tests/data-search.mjs
node src/tests/image-search.mjs
node src/tests/news-search.mjs
node src/tests/video-search.mjs
node src/tests/screenshot-simples.mjs
node src/tests/screenshot-batch.mjs
node src/tests/screenshot-element.mjs

# All tests (via npm)
npm test

🔐 Security Considerations

User-Agent Rotation

Agents loaded from assets/agents.json
Random selection per request via _.sample()
Prevents simple bot detection

Proxy Blacklisting

Automatic blacklisting on block detection
TTL-based unblocking (60s default)
Prevents repeated requests through blocked proxies

Block Detection

Pattern matching for CAPTCHA/block pages
Status code checking (>= 400)
Response size validation (< 800 chars = suspicious)

🛠️ Dependencies

Package	Purpose
`axios`	HTTP client for web requests
`cheerio`	HTML parsing (jQuery-like syntax)
`lodash`	Utility functions (`_.shuffle`, `_.sample`)
`playwright`	Listed but not currently used
`puppeteer-core`	Screenshot capture (headless Chromium)

📝 Environment Requirements

Chromium/Chrome

Required for screenshot functionality:

Termux (Android):

pkg install chromium

Ubuntu/Debian:

sudo apt install chromium-browser

macOS:

brew install chromium

Environment Variables

PUPPETEER_EXECUTABLE_PATH: Override Chromium path

🚀 Usage Examples

Basic Search

import { dataSearch } from "./src/index.mjs";

const results = await dataSearch("openai", {
  lang: "en-US",
  limit: 5,
  cache: true,
  ttl: 120000
});

Search + Screenshot

import { dataSearch, screenshot } from "./src/index.mjs";

async function searchAndCapture(query) {
  const results = await dataSearch(query, { limit: 5 });
  
  if (results.length > 0) {
    const shot = await screenshot(results[0].link, {
      outputPath: `./result-${Date.now()}.png`,
      fullPage: true,
      delay: 3000
    });
    return { search: results, screenshot: shot };
  }
}

Batch Operations

import { dataSearch, screenshotBatch } from "./src/index.mjs";

async function captureTopResults(query, limit = 3) {
  const results = await dataSearch(query, { limit });
  const urls = results.map(r => r.link);
  
  const screenshots = await screenshotBatch(urls, {
    batchDelay: 3000,
    fullPage: false
  });
  
  return results.map((r, i) => ({
    ...r,
    screenshot: screenshots[i]?.path || null
  }));
}

🎯 Agent Instructions

When Modifying Code

Check ./AGENTS.md first - All core rules and architecture are centralized here
Follow existing patterns - Match coding style, naming conventions, and module structure
Update tests - Add/modify tests in src/tests/ for new functionality
Update config - Add new configuration options to src/config.mjs if needed
Document changes - Add JSDoc comments for public functions

When Adding Features

Search engines: Add endpoint to config.mjs, parser to parser.mjs, orchestration to engineManager.mjs
Cache namespaces: Use consistent naming (data:, image:, news:, video:, etc.)
New utilities: Create module in src/utils/, export from src/index.mjs
Configuration: Never hardcode values - use CONFIG from config.mjs

When Debugging

Check logs - Look for [engine fail], ❌ Erro, ⚠️ warnings
Verify cache - Test with cache: false to rule out cache issues
Proxy issues - Check proxyManager.mjs blacklist status
Block detection - Review helpers.mjs:isBlocked() patterns

📚 Related Files

@CLAUDE.md - Claude-specific instruction (points to AGENTS.md)
@GEMINI.md - Gemini-specific instruction (points to AGENTS.md)
@README.md - User-facing documentation with usage examples
@package.json - Project metadata and dependencies
@assets/agents.json - User-Agent strings for rotation

⚠️ Important Notes

AGENTS.md is the source of truth - CLAUDE.md and GEMINI.md both reference this file
Do not duplicate context - If AGENTS.md is loaded, avoid redundant documentation
Proxy list is empty - Must be manually configured in proxyManager.mjs
Playwright unused - Listed in dependencies but not currently used
Termux-optimized - Default Chromium path configured for Termux environment

Version: 1.0.0
Author: Alisson (mrx_dev)
License: GPL-3.0

FilesExpand file tree

AGENTS.md

Latest commit

History