Skip to content

Latest commit

Β 

History

History
605 lines (491 loc) Β· 16.1 KB

File metadata and controls

605 lines (491 loc) Β· 16.1 KB

AGENTS.md - DeepQuery Project Documentation

πŸ“Œ Project Overview

DeepQuery is a web scraping and search tool built with Node.js (ES Modules) that provides multi-engine search capabilities (Bing, DuckDuckGo), image/news/video search, and screenshot capture functionality.

Core Features

  • Web Search: Multi-engine search with automatic failover
  • Image Search: Bing image extraction
  • News Search: News collection with metadata (title, source, date)
  • Video Search: Video extraction with thumbnails
  • Screenshots: Puppeteer-based screenshot capture (single, batch, element-specific)
  • LRU Cache: In-memory caching with configurable TTL
  • Proxy Rotation: Multi-proxy support with automatic blacklisting

πŸ—οΈ Project Architecture

DeepQuery/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ index.mjs           # Main entry point - exports all public APIs
β”‚   β”œβ”€β”€ config.mjs          # Centralized configuration
β”‚   β”œβ”€β”€ screenshot.mjs      # Legacy screenshot script (standalone)
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”œβ”€β”€ engineManager.mjs   # Search engine orchestration
β”‚   β”‚   β”œβ”€β”€ parser.mjs          # HTML parsing for all engines
β”‚   β”‚   β”œβ”€β”€ cache.mjs           # LRU cache implementation
β”‚   β”‚   β”œβ”€β”€ httpClient.mjs      # Axios client factory
β”‚   β”‚   β”œβ”€β”€ proxyManager.mjs    # Proxy rotation & blacklisting
β”‚   β”‚   β”œβ”€β”€ helpers.mjs         # Utility functions (retry, isBlocked)
β”‚   β”‚   └── screenshot.mjs      # Screenshot utilities (Puppeteer)
β”‚   └── tests/
β”‚       β”œβ”€β”€ data-search.mjs
β”‚       β”œβ”€β”€ image-search.mjs
β”‚       β”œβ”€β”€ news-search.mjs
β”‚       β”œβ”€β”€ video-search.mjs
β”‚       β”œβ”€β”€ screenshot-simples.mjs
β”‚       β”œβ”€β”€ screenshot-batch.mjs
β”‚       └── screenshot-element.mjs
β”œβ”€β”€ assets/
β”‚   └── agents.json         # User-Agent strings for rotation
└── package.json

πŸ“¦ Module Responsibilities

src/index.mjs - Main Entry Point

Purpose: Public API exports and cache handling wrapper

Exports:

  • dataSearch(query, options) - Web search
  • imageSearch(query, options) - Image search
  • newsSearch(query, options) - News search
  • videoSearch(query, options) - Video search
  • screenshot(url, options) - Single screenshot
  • screenshotBatch(urls, options) - Batch screenshots
  • screenshotElement(url, selector, options) - Element screenshot

Cache Pattern:

function buildKey(type, query, lang, limit) {
  return `${type}:${query}:${lang}:${limit}`;
}

async function handleCache(key, fn, options) {
  const { cache, ttl, namespace } = options;
  if (cache) {
    const cached = getCache(key, namespace);
    if (cached) return cached;
  }
  const result = await fn();
  if (cache) {
    setCache(key, result, { ttl, namespace });
  }
  return result;
}

Options Pattern (all search functions):

{
  lang: "pt-BR",           // Default from CONFIG.search.defaultLang
  limit: 10,               // Default from CONFIG.search.defaultLimit
  cache: true,             // Enable/disable caching
  ttl: 60000               // Cache TTL in ms (default: CONFIG.cache.defaultTTL)
}

src/config.mjs - Configuration Hub

Purpose: Centralized configuration for all modules

Structure:

export const CONFIG = {
  engines: {
    list: ["bing", "duckduckgo"],
    endpoints: {
      bing: {
        search: (query, lang) => `https://www.bing.com/search?q=${query}&setlang=${lang}`,
        images: (query, lang) => `https://www.bing.com/images/search?q=${query}&setlang=${lang}`,
        news: (query, lang) => `https://www.bing.com/news/search?q=${query}&setlang=${lang}`,
        video: (query, lang) => `https://www.bing.com/videos/search?q=${query}&setlang=${lang}`
      },
      duckduckgo: {
        search: (query) => `https://html.duckduckgo.com/html/?q=${query}`
      }
    }
  },
  screenshot: {
    executablePath: "/data/data/com.termux/files/usr/bin/chromium-browser",
    defaultWidth: 1280,
    defaultHeight: 800,
    defaultTimeout: 60000,
    defaultDelay: 3000
  },
  http: {
    timeout: 8000,
    retries: 3,
    baseDelay: 500
  },
  cache: {
    defaultTTL: 60000,
    maxItems: 500
  },
  proxy: {
    blockTTL: 60000
  },
  search: {
    defaultLang: "pt-BR",
    defaultLimit: 10
  }
};

src/utils/engineManager.mjs - Search Orchestration

Purpose: Manages search engine selection, failover, and result aggregation

Key Functions:

  • runSearch(query, lang) - Web search with engine failover
  • runImageSearch(query, lang) - Bing image search
  • runNewsSearch(query, lang) - Bing news search
  • runVideoSearch(query, lang) - Bing video search

Engine Failover Logic:

export async function runSearch(query, lang) {
  const engines = _.shuffle(ENGINES); // Randomize engine order
  
  for (const engine of engines) {
    try {
      const html = await retry(async (attempt) => {
        const client = createClient(lang);
        // ... fetch from engine endpoint
        if (isBlocked(data, status)) {
          reportBadProxy(proxyUsed);
          throw new Error(`blocked:${engine}:attempt${attempt}`);
        }
        return data;
      });
      
      const results = engine === "bing" ? parseBing(html) : parseDuck(html);
      
      if (results.length === 0) {
        throw new Error(`empty:${engine}`);
      }
      return results;
    } catch (err) {
      console.warn(`[engine fail] ${engine} -> ${err.message}`);
    }
  }
  return []; // All engines failed
}

src/utils/parser.mjs - HTML Parsing

Purpose: Extract structured data from search engine HTML responses

Parsers:

  • parseBing(html) - Extracts {title, link, description, favicon}
  • parseDuck(html) - Extracts {title, link, description}
  • parseImages(html) - Extracts {image, title}
  • parseNews(html) - Extracts {title, link, source, time}
  • parseVideos(html) - Extracts {title, link, thumbnail}

Example - Bing Parser:

export function parseBing(html) {
  const $ = cheerio.load(html);
  const results = [];
  
  $(".b_algo").each((i, el) => {
    const title = $(el).find("h2").text().trim();
    const link = $(el).find("h2 a").attr("href");
    const description = $(el).find(".b_caption p").text().trim();
    
    if (title && link) {
      results.push({
        title,
        link,
        description,
        favicon: `https://www.google.com/s2/favicons?domain=${link}`
      });
    }
  });
  
  return results;
}

src/utils/cache.mjs - LRU Cache

Purpose: In-memory caching with TTL and LRU eviction

Key Features:

  • Namespace support (data:, image:, news:, video:)
  • TTL-based expiration
  • LRU eviction when maxItems exceeded
  • Auto-cleanup every 30 seconds

API:

getCache(key, namespace = "default")  // Returns cached value or null
setCache(key, value, { ttl, namespace })  // Sets cached value
clearCache(namespace = null)  // Clears cache (all or by namespace)

Internal Structure:

const store = new Map();
// Entry format: { value, expiresAt: timestamp | null }

src/utils/httpClient.mjs - HTTP Client Factory

Purpose: Creates configured Axios instances with proxy and User-Agent rotation

Features:

  • Automatic proxy injection from proxyManager
  • Random User-Agent from assets/agents.json
  • Configurable timeout and headers
  • Does not validate status (allows error handling in caller)

Usage:

const client = createClient(lang);
const response = await client.get(url);
// response.data contains HTML

Headers:

  • User-Agent: Random from agents.json
  • Accept-Language: Based on lang parameter
  • Accept: text/html,application/xhtml+xml
  • Connection: keep-alive
  • Cache-Control: no-cache
  • Pragma: no-cache

src/utils/proxyManager.mjs - Proxy Management

Purpose: Proxy rotation with automatic blacklisting

Current State: Proxy list is empty by default (manual configuration required)

API:

getProxy()           // Returns next available proxy URL
reportBadProxy(proxy)  // Blacklists proxy for blockTTL duration

Blacklist Logic:

  • Proxies are blacklisted for CONFIG.proxy.blockTTL (60s default)
  • isBlocked(proxy) checks if proxy is currently blacklisted
  • getNextProxy() cycles through proxies, skipping blocked ones

src/utils/helpers.mjs - Utility Functions

Purpose: Common helper functions for retry logic and block detection

Functions:

isBlocked(html, status) - Detects CAPTCHA/blocks:

const patterns = [
  "captcha", "verify you are human", "access denied",
  "temporarily blocked", "unusual traffic", "/sorry/index",
  "cf-challenge", "attention required"
];
// Returns true if status >= 400, html.length < 800, or patterns match

retry(fn, options) - Exponential backoff retry:

// Retries fn() up to `retries` times
// Delay: baseDelay * 2^attempt + random(300ms)
// Default: 3 retries, 500ms base delay

src/utils/screenshot.mjs - Screenshot Utilities

Purpose: Puppeteer-based screenshot capture

Functions:

screenshot(url, options):

{
  outputPath: "./screenshot-{timestamp}.png",
  width: 1280,
  height: 800,
  fullPage: true,
  waitUntil: "domcontentloaded",
  timeout: 60000,
  delay: 3000,
  userAgent: "Mozilla/5.0...",
  executablePath: "/data/data/com.termux/files/usr/bin/chromium-browser"
}

screenshotBatch(urls, options):

  • Iterates through URLs sequentially
  • Adds batchDelay between captures
  • Returns array of results with success/error status

screenshotElement(url, selector, options):

  • Waits for selector to be available
  • Captures only the matched element

πŸ”„ Data Flow

Search Request Flow

User calls dataSearch(query, options)
    ↓
index.mjs: buildKey() creates cache key
    ↓
index.mjs: handleCache() checks cache
    ↓ (cache miss)
engineManager.mjs: runSearch()
    ↓
httpClient.mjs: createClient(lang) β†’ Axios instance with proxy + UA
    ↓
helpers.mjs: retry() wrapper
    ↓
Engine endpoint (Bing/DuckDuckGo)
    ↓
helpers.mjs: isBlocked() check
    ↓ (if blocked)
proxyManager.mjs: reportBadProxy()
    ↓ (if success)
parser.mjs: parseBing() or parseDuck()
    ↓
index.mjs: setCache() stores result
    ↓
Returns results array

Screenshot Flow

User calls screenshot(url, options)
    ↓
Puppeteer.launch() with Chromium
    ↓
page.newPage() β†’ setViewport() β†’ setUserAgent()
    ↓
page.goto(url, { waitUntil, timeout })
    ↓
Optional delay
    ↓
page.screenshot({ path, fullPage })
    ↓
browser.close()
    ↓
Returns { success, path, url, timestamp }

πŸ“‹ Coding Conventions

Module Pattern

  • Use ES Modules (import/export)
  • Named exports preferred over default exports
  • Import order: Node.js built-in β†’ npm packages β†’ local modules

Error Handling

  • Catch errors at engine level, continue to next engine
  • Log warnings with console.warn() for non-critical failures
  • Throw errors for critical failures (screenshot, cache operations)

Naming Conventions

  • Files: kebab-case (engineManager.mjs)
  • Functions: camelCase (runSearch, parseBing)
  • Constants: UPPER_SNAKE_CASE (CONFIG, DEFAULT_OPTIONS)
  • Classes: PascalCase (none currently used)

Documentation

  • JSDoc comments for public functions
  • Inline comments for complex logic
  • Console logs with emoji prefixes (πŸ“Έ, βœ…, ❌, ⚠️)

πŸ§ͺ Testing Patterns

Test File Structure

import { functionName } from "../index.mjs";

const res = await functionName("query", { options });
console.log(res);

Running Tests

# Individual tests
node src/tests/data-search.mjs
node src/tests/image-search.mjs
node src/tests/news-search.mjs
node src/tests/video-search.mjs
node src/tests/screenshot-simples.mjs
node src/tests/screenshot-batch.mjs
node src/tests/screenshot-element.mjs

# All tests (via npm)
npm test

πŸ” Security Considerations

User-Agent Rotation

  • Agents loaded from assets/agents.json
  • Random selection per request via _.sample()
  • Prevents simple bot detection

Proxy Blacklisting

  • Automatic blacklisting on block detection
  • TTL-based unblocking (60s default)
  • Prevents repeated requests through blocked proxies

Block Detection

  • Pattern matching for CAPTCHA/block pages
  • Status code checking (>= 400)
  • Response size validation (< 800 chars = suspicious)

πŸ› οΈ Dependencies

Package Purpose
axios HTTP client for web requests
cheerio HTML parsing (jQuery-like syntax)
lodash Utility functions (_.shuffle, _.sample)
playwright Listed but not currently used
puppeteer-core Screenshot capture (headless Chromium)

πŸ“ Environment Requirements

Chromium/Chrome

Required for screenshot functionality:

Termux (Android):

pkg install chromium

Ubuntu/Debian:

sudo apt install chromium-browser

macOS:

brew install chromium

Environment Variables

  • PUPPETEER_EXECUTABLE_PATH: Override Chromium path

πŸš€ Usage Examples

Basic Search

import { dataSearch } from "./src/index.mjs";

const results = await dataSearch("openai", {
  lang: "en-US",
  limit: 5,
  cache: true,
  ttl: 120000
});

Search + Screenshot

import { dataSearch, screenshot } from "./src/index.mjs";

async function searchAndCapture(query) {
  const results = await dataSearch(query, { limit: 5 });
  
  if (results.length > 0) {
    const shot = await screenshot(results[0].link, {
      outputPath: `./result-${Date.now()}.png`,
      fullPage: true,
      delay: 3000
    });
    return { search: results, screenshot: shot };
  }
}

Batch Operations

import { dataSearch, screenshotBatch } from "./src/index.mjs";

async function captureTopResults(query, limit = 3) {
  const results = await dataSearch(query, { limit });
  const urls = results.map(r => r.link);
  
  const screenshots = await screenshotBatch(urls, {
    batchDelay: 3000,
    fullPage: false
  });
  
  return results.map((r, i) => ({
    ...r,
    screenshot: screenshots[i]?.path || null
  }));
}

🎯 Agent Instructions

When Modifying Code

  1. Check ./AGENTS.md first - All core rules and architecture are centralized here
  2. Follow existing patterns - Match coding style, naming conventions, and module structure
  3. Update tests - Add/modify tests in src/tests/ for new functionality
  4. Update config - Add new configuration options to src/config.mjs if needed
  5. Document changes - Add JSDoc comments for public functions

When Adding Features

  1. Search engines: Add endpoint to config.mjs, parser to parser.mjs, orchestration to engineManager.mjs
  2. Cache namespaces: Use consistent naming (data:, image:, news:, video:, etc.)
  3. New utilities: Create module in src/utils/, export from src/index.mjs
  4. Configuration: Never hardcode values - use CONFIG from config.mjs

When Debugging

  1. Check logs - Look for [engine fail], ❌ Erro, ⚠️ warnings
  2. Verify cache - Test with cache: false to rule out cache issues
  3. Proxy issues - Check proxyManager.mjs blacklist status
  4. Block detection - Review helpers.mjs:isBlocked() patterns

πŸ“š Related Files

  • @CLAUDE.md - Claude-specific instruction (points to AGENTS.md)
  • @GEMINI.md - Gemini-specific instruction (points to AGENTS.md)
  • @README.md - User-facing documentation with usage examples
  • @package.json - Project metadata and dependencies
  • @assets/agents.json - User-Agent strings for rotation

⚠️ Important Notes

  1. AGENTS.md is the source of truth - CLAUDE.md and GEMINI.md both reference this file
  2. Do not duplicate context - If AGENTS.md is loaded, avoid redundant documentation
  3. Proxy list is empty - Must be manually configured in proxyManager.mjs
  4. Playwright unused - Listed in dependencies but not currently used
  5. Termux-optimized - Default Chromium path configured for Termux environment

Version: 1.0.0
Author: Alisson (mrx_dev)
License: GPL-3.0