Advanced Web Scraper API

A robust and flexible API for web scraping with advanced capabilities for handling complex scenarios such as CAPTCHA solving, human behavior emulation, proxy management, and multi-step navigation.

Project Status

Current Phase: Initial Implementation (40% Complete)

The project has progressed from planning to initial implementation. Core components have been implemented, including browser automation, human behavior emulation, CAPTCHA detection, proxy management, navigation engine, and data extraction. The API foundation is in place with functional endpoints for scraping.

Next steps include implementing storage adapters for different destinations, enhancing CAPTCHA solving strategies, adding comprehensive testing, and expanding API documentation.

Features

CAPTCHA Solving
- reCAPTCHA v2/v3 detection and solving
- hCaptcha detection and solving
- Image CAPTCHA handling
- Cloudflare challenge bypass
Human Behavior Emulation
- Natural mouse movements using Bezier curves
- Variable typing speeds with occasional typos
- Realistic scrolling behavior
- Configurable behavior profiles (fast, average, careful)
Proxy Management
- Support for HTTP, HTTPS, SOCKS4, and SOCKS5 proxies
- Intelligent rotation with session persistence
- Health checking and performance monitoring
- Automatic fallback mechanisms
Website Navigation & Crawling
- Multi-step navigation flows with state management
- Form filling with human-like input
- Conditional logic and branching
- Pagination handling
- Deep crawling with filters
Data Extraction
- Multiple selector strategies (CSS, XPath, regex, function)
- Structured data extraction with nested fields
- Data cleaning and transformation pipeline
- Schema-based validation and type conversion
- Declarative configuration approach
LLM Integration
- OpenRouter adapter for advanced language model capabilities
- Support for multiple model providers through OpenRouter
- Configurable model selection and parameters
- Automatic API key management

Configuration Templates

A library of example JSON configurations demonstrating various scraping scenarios and API features is available in the config-templates/ directory. These templates serve as practical examples and starting points for your own configurations.

Project Structure

/src
  /api                 # API layer
    /middleware        # Express middleware
    /routes            # API route definitions
    /controllers       # Route controllers (coming soon)
    /validators        # Request validation (coming soon)
  
  /core                # Core functionality
    /browser           # Browser automation
    /captcha           # CAPTCHA detection and solving
    /human             # Human behavior emulation
    /proxy             # Proxy management
  
  /navigation          # Navigation engine
  /extraction          # Data extraction engine
    /selectors         # Selector strategies
  /storage             # Data storage
    /adapters          # Storage adapters for different destinations
  
  /utils               # Utility functions
  /config              # Configuration management
  /types               # TypeScript type definitions
  
  app.ts               # Express application setup
  server.ts            # Server entry point

Tech Stack

Node.js with TypeScript (ESM modules)
Express.js for API endpoints
Playwright for browser automation
MongoDB for data storage
Redis for caching and job queues

Core Components

Browser Automation

The browser automation module uses Playwright to control browsers with a pool management system for efficient resource usage:

// Example: Getting a browser from the pool
const browserPool = BrowserPool.getInstance();
const browser = await browserPool.getBrowser({
  browserType: 'chromium',
  headless: true,
});

// Create a context with specific options
const context = await browserPool.createContext(browser, {
  viewport: { width: 1920, height: 1080 },
  userAgent: 'Mozilla/5.0 ...',
});

// Release the browser when done
browserPool.releaseBrowser(browser);

Human Behavior Emulation

The human behavior emulation module provides realistic human-like interactions:

// Example: Using the behavior emulator
const behaviorEmulator = new BehaviorEmulator(page, {
  profile: 'average', // 'fast', 'average', or 'careful'
});

// Move mouse to an element with natural motion
await behaviorEmulator.moveMouseToElement('#search-button');

// Click with realistic timing
await behaviorEmulator.clickElement('#search-button');

// Type text with variable speed and occasional typos
await behaviorEmulator.typeText('search query', {
  mistakes: true,
  variableSpeed: true,
});

// Scroll naturally
await behaviorEmulator.scroll('down', 500);

// Add a "thinking" pause
await behaviorEmulator.think();

CAPTCHA Detection and Solving

The CAPTCHA module detects and solves various types of CAPTCHAs:

// Example: Detecting and solving CAPTCHAs
const captchaDetector = new CaptchaDetector();
const captchaSolver = new CaptchaSolver(page);

// Detect CAPTCHA
const detection = await CaptchaDetector.detect(page);

if (detection.detected) {
  console.log(`Detected ${detection.type} CAPTCHA`);
  
  // Solve the CAPTCHA
  const result = await captchaSolver.solve({
    useExternalService: true,
    timeout: 60000,
  });
  
  if (result.success) {
    console.log('CAPTCHA solved successfully');
  }
}

Proxy Management

The proxy management module handles proxy rotation, health checking, and performance monitoring:

// Example: Using the proxy manager
const proxyManager = ProxyManager.getInstance();

// Get a proxy with specific options
const proxy = await proxyManager.getProxy({
  country: 'US',
  type: 'http',
  session: 'session1',
});

// Report proxy result
proxyManager.reportProxyResult(proxy, true, 250); // success, response time

// Get proxy statistics
const stats = proxyManager.getProxyStats();
console.log(`Healthy proxies: ${stats.healthy}/${stats.total}`);

Navigation Engine

The navigation engine executes multi-step navigation flows. It has recently been refactored to use a handler-based architecture (Strategy/Factory pattern), where each step type (goto, click, input, etc.) has its own dedicated handler class. This makes the engine more modular and extensible.

// Example: Using the navigation engine
const navigationEngine = new NavigationEngine(page, {
  timeout: 30000,
  solveCaptcha: true,
});

// Execute a navigation flow
const result = await navigationEngine.executeFlow(
  'https://example.com/search',
  [
    {
      type: 'input',
      selector: '#search-input',
      value: 'search term',
      humanInput: true,
    },
    {
      type: 'click',
      selector: '#search-button',
      waitFor: '.results-container',
    },
    {
      type: 'extract',
      name: 'searchResults',
      selector: '.result-item',
      fields: {
        title: '.result-title',
        url: '.result-link@href',
      },
    },
  ],
  { keywords: 'initial context' }
);

console.log(`Navigation completed with ${result.stepsExecuted} steps`);
console.log('Extracted data:', result.result);

Getting Started

Prerequisites

Node.js (v18.x or later)
MongoDB (coming soon)
Redis (coming soon)

Installation

Clone the repository:

git clone https://github.com/yourusername/adv-web-scraper-api.git
cd adv-web-scraper-api

Install dependencies:
```
npm install
```
Create a .env file based on the provided .env.example:
```
cp .env.example .env
```
Update the .env file with your configuration settings.

Development

Start the development server with hot reloading:

npm run dev

Building for Production

Build the TypeScript code:

npm run build

Start the production server:

npm start

API Documentation

Health Check Endpoint

# Check API health
curl -X GET http://localhost:3000/health

# Check API v1 health
curl -X GET http://localhost:3000/api/v1/health

Scraping Endpoints

# Scrape data from a URL
curl -X POST http://localhost:3000/api/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://loyalleads.co.uk",
    "fields": {
      "title": {
        "type": "css",
        "name": "title",
        "selector": "title"
      },
      "products": {
        "selector": ".product-item",
        "type": "css",
        "multiple": true,
        "fields": {
          "name": {
            "type": "css",
            "name": "name",
            "selector": ".product-name"
          },
          "price": {
            "type": "css",
            "name": "price",
            "selector": ".product-price"
          },
          "description": {
            "type": "css",
            "name": "description",
            "selector": ".product-description"
          }
        }
      }
    },
    "options": {
      "javascript": true,
      "proxy": true,
      "humanEmulation": true
    }
  }'

# Example response:
# {
#    "success": true,
#    "message": "Extraction completed successfully",
#    "data": {
#        "id": "extract_6ca58c6f-f199-4b3e-bb7b-32dbad81c076",
#        "url": "https://loyalleads.co.uk",
#        "status": "completed",
#        "data": {
#            "title": "Loyalleads is a premier software development agency based in Cardiff, renowned for its expertise in creating sophisticated web applications, bespoke websites, and effective online marketing flows. With a strong focus on both B2B and B2C markets, Loyalleads has established a reputation for delivering high-quality, tailored solutions that meet the unique needs of each client.",
#            "products": []
#        },
#        "stats": {
#            "startTime": "2025-04-05T18:58:10.675Z",
#            "endTime": "2025-04-05T18:58:17.220Z",
#            "duration": 6545,
#            "pagesProcessed": 1,
#            "itemsExtracted": 1
#        },
#        "timestamp": "2025-04-05T18:58:17.220Z"
#    },
#    "timestamp": "2025-04-05T18:58:17.220Z"
#}

# Get the result of a scraping job
curl -X GET http://localhost:3000/api/v1/scrape/extract_6ca58c6f-f199-4b3e-bb7b-32dbad81c076

# Example response:
# {
#    "success": true,
#    "message": "Extraction completed successfully",
#    "data": {
#        "id": "extract_6ca58c6f-f199-4b3e-bb7b-32dbad81c076",
#        "url": "https://loyalleads.co.uk",
#        "status": "completed",
#        "data": {
#            "title": "Loyalleads is a premier software development agency based in Cardiff, renowned for its expertise in creating sophisticated web applications, bespoke websites, and effective online marketing flows. With a strong focus on both B2B and B2C markets, Loyalleads has established a reputation for delivering high-quality, tailored solutions that meet the unique needs of each client.",
#            "products": []
#        },
#        "stats": {
#            "startTime": "2025-04-05T18:58:10.675Z",
#            "endTime": "2025-04-05T18:58:17.220Z",
#            "duration": 6545,
#            "pagesProcessed": 1,
#            "itemsExtracted": 1
#        },
#        "timestamp": "2025-04-05T18:58:17.220Z"
#    },
#    "timestamp": "2025-04-05T18:58:17.220Z"
#}

CAPTCHA Endpoints

# Solve a CAPTCHA challenge
curl -X POST http://localhost:3000/api/v1/captcha/solve \
  -H "Content-Type: application/json" \
  -d '{
    "captchaType": "recaptcha_v2",
    "url": "https://example.com/login",
    "siteKey": "6LeIxAcTAAAAAJcZVRqyHh71UMIEGNQ_MXjiZKhI",
    "options": {
      "useExternalService": true,
      "service": "twocaptcha"
    }
  }'

# Example response:
# {
#   "success": true,
#   "message": "CAPTCHA solved successfully",
#   "data": {
#     "captchaType": "recaptcha_v2",
#     "url": "https://example.com/login",
#     "solved": true,
#     "token": "03AGdBq24PBCbwiDRgC3...rqAHSjxEc3wNkJqVNCjnk",
#     "timestamp": "2025-05-04T17:31:56.789Z"
#   }
# }

# Get available CAPTCHA solving services
curl -X GET http://localhost:3000/api/v1/captcha/services

# Example response:
# {
#   "success": true,
#   "message": "CAPTCHA services retrieved",
#   "data": {
#     "services": [
#       {
#         "name": "internal",
#         "enabled": true,
#         "types": ["recaptcha", "hcaptcha", "image"]
#       },
#       {
#         "name": "2captcha",
#         "enabled": false,
#         "types": ["recaptcha", "hcaptcha", "image", "funcaptcha"]
#       }
#     ]
#   }
# }

Proxy Endpoints

# Get proxy status and statistics
curl -X GET http://localhost:3000/api/v1/proxy

# Test a specific proxy
curl -X POST http://localhost:3000/api/v1/proxy/test \
  -H "Content-Type: application/json" \
  -d '{
    "host": "203.0.113.1",
    "port": 8080,
    "type": "http",
    "username": "user",
    "password": "pass"
  }'

# Example response:
# {
#   "success": true,
#   "message": "Proxy test successful",
#   "data": {
#     "host": "203.0.113.1",
#     "port": 8080,
#     "type": "http",
#     "working": true,
#     "responseTime": 250,
#     "ip": "203.0.113.1",
#     "country": "US",
#     "timestamp": "2025-05-04T17:32:56.789Z"
#   }
# }

# Rotate to a new proxy
curl -X POST http://localhost:3000/api/v1/proxy/rotate \
  -H "Content-Type: application/json" \
  -d '{
    "country": "US",
    "type": "http",
    "session": "session1"
  }'

# Example response:
# {
#   "success": true,
#   "message": "Proxy rotation successful",
#   "data": {
#     "host": "203.0.113.2",
#     "port": 8080,
#     "type": "http",
#     "country": "US",
#     "session": "session1",
#     "timestamp": "2025-05-04T17:33:56.789Z"
#   }
# }

Navigation Endpoints

Navigation Step Types

The navigation engine supports the following step types for multi-step navigation flows:

Step Type	Description	Required Parameters	Optional Parameters
`goto`	Navigate to a URL	`value` (URL)	`waitFor`, `timeout`
`click`	Click on an element	`selector`	`waitFor`, `timeout`
`input`	Enter text into an input field	`selector`, `value`	`clearInput`, `humanInput`, `waitFor`, `timeout`
`select`	Select an option from a dropdown	`selector`, `value`	`waitFor`, `timeout`
`wait`	Wait for a condition	`value` (selector or time in ms) or `waitFor`	`timeout`
`extract`	Extract data from the page	`selector`, `name`	`fields`, `list`, `attribute`
`condition`	Conditional branching	`condition`	`thenSteps`, `elseSteps`
`paginate`	Handle pagination	`selector`	`maxPages`, `extractSteps`, `waitFor`, `timeout`

Navigation Step Parameters

Parameter	Type	Description
`selector`	`string`	CSS selector to locate elements
`value`	`string` or `number`	Value to use (URL, text input, wait time, etc.)
`waitFor`	`string` or `number`	Selector to wait for, time in ms, or special values like `navigation` or `networkidle`
`timeout`	`number`	Timeout in milliseconds for the step
`humanInput`	`boolean`	Whether to simulate human typing (with variable speed and occasional typos)
`clearInput`	`boolean`	Whether to clear the input field before typing
`name`	`string`	Name to use for extracted data in the result context
`fields`	`object`	Field definitions for structured data extraction
`list`	`boolean`	Whether to extract a list of items
`attribute`	`string`	Element attribute to extract (e.g., `href`, `src`)
`condition`	`string` or `function`	Condition to evaluate (selector or function)
`thenSteps`	`array`	Steps to execute if condition is true
`elseSteps`	`array`	Steps to execute if condition is false
`maxPages`	`number`	Maximum number of pages to paginate through
`extractSteps`	`array`	Steps to execute on each page during pagination

Template Variables

You can use template variables in your navigation steps by enclosing them in double curly braces: {{variableName}}. These variables will be replaced with values from the variables object provided in the request.

For more detailed examples of navigation configurations, see the Navigation Examples documentation.

# Execute a multi-step navigation flow
curl -X POST http://localhost:3000/api/v1/navigate \
  -H "Content-Type: application/json" \
  -d '{
    "startUrl": "https://example.com/search",
    "steps": [
      {
        "type": "input",
        "selector": "#ta_searchInput",
        "value": "{{keyword}}",
        "humanInput": true
      },
      {
        "type": "click",
        "selector": ".dsrm_button",
        "waitFor": "#Search_propertySearchCriteria__Wn7r_"
      },
      {
        "type": "extract",
        "name": "searchResults",
        "selector": ".Search_titleContainer__HQ9QA",
        "fields": {
          "title": ".result-title",
          "url": ".result-link@href"
        }
      }
    ],
    "variables": {
      "keywords": "CF40 2RS"
    },
    "options": {
      "solveCaptcha": true,
      "proxyRegion": "US",
      "timeout": 60000
    }
  }'

Explanation of the Navigation Steps

This example demonstrates a multi-step navigation flow that:

Starts the navigation: Opens a browser and navigates to "https://example.com/search"
Enters search text:
- Finds the search input field with ID "ta_searchInput"
- Types "CF40 2RS" (from the variables object) into the field
- Uses human-like typing with variable speed and occasional typos (because humanInput is true)
Clicks the search button:
- Finds and clicks on an element with class "dsrm_button" (likely the search button)
- Waits for an element with ID "Search_propertySearchCriteria__Wn7r_" to appear before proceeding
- This ensures the search results have loaded before moving to the next step
Extracts search results:
- Finds elements matching the selector ".Search_titleContainer__HQ9QA"
- For each matching element, extracts:
  - The title (text content of elements matching ".result-title")
  - The URL (href attribute of elements matching ".result-link")
- Stores the extracted data in the result context under the name "searchResults"
Additional options:
- Automatically solves any CAPTCHAs encountered during navigation
- Uses a proxy server from the US region
- Sets a timeout of 60 seconds for each step

The navigation engine handles all the browser automation, human-like behavior, and CAPTCHA solving behind the scenes, making it easy to automate complex web interactions with a simple declarative configuration.

# Example response:
# {
#   "success": true,
#   "message": "Navigation job started",
#   "data": {
#     "id": "nav_1683213456789",
#     "startUrl": "https://example.com/search",
#     "stepsExecuted": 3,
#     "status": "completed",
#     "timestamp": "2025-05-04T17:34:56.789Z"
#   }
# }

# Get the result of a navigation job
curl -X GET http://localhost:3000/api/v1/navigate/nav_1234567890

# Example response:
# {
#   "success": true,
#   "message": "Navigation result retrieved",
#   "data": {
#     "id": "nav_1234567890",
#     "status": "completed",
#     "result": {
#       "pagesVisited": 3,
#       "dataExtracted": {
#         "title": "Example Page Title",
#         "items": [
#           { "name": "Item 1", "price": "$10.99" },
#           { "name": "Item 2", "price": "$24.99" },
#           { "name": "Item 3", "price": "$5.99" }
#         ]
#       }
#     },
#     "timestamp": "2025-05-04T17:35:56.789Z"
#   }
# }

# Start a crawling job
curl -X POST http://localhost:3000/api/v1/navigate/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "startUrl": "https://example.com/products",
    "maxPages": 5,
    "selectors": {
      "links": ".product-link@href",
      "nextPage": ".pagination .next@href"
    },
    "filters": {
      "urlPattern": "product-detail",
      "maxDepth": 2
    }
  }'

# Example response:
# {
#   "success": true,
#   "message": "Crawling job started",
#   "data": {
#     "id": "crawl_1683213456789",
#     "startUrl": "https://example.com/products",
#     "maxPages": 5,
#     "status": "started",
#     "timestamp": "2025-05-04T17:36:56.789Z"
#   }
# }

Data Extraction

The API supports multiple extraction strategies:

CSS selectors
XPath selectors
Regex patterns
Function-based selectors

For detailed regex extraction documentation, see Regex Extraction Guide

Data Extraction Configuration

Field Configuration Options

The data extraction engine supports a variety of field configuration options:

Option	Type	Description
`type`	`string`	Selector type: `css`, `xpath`, `regex`, or `function`
`name`	`string`	Name of the field
`selector`	`string`	CSS or XPath selector to locate elements
`attribute`	`string` (optional)	Element attribute to extract (e.g., `href`, `src`). If not provided, innerText is used
`multiple`	`boolean` (optional)	If true, returns an array of matches. Default: `false`
`transform`	`string` or `function` (optional)	Transformation to apply to extracted value (e.g., `trim`, `toLowerCase`)
`dataType`	`string` (optional)	Data type conversion: `string`, `number`, `boolean`, `date`, `array`, `object`
`required`	`boolean` (optional)	If true, extraction fails if field is not found. Default: `false`
`defaultValue`	`any` (optional)	Value to use if extraction fails
`pattern`	`string` (for regex type)	Regular expression pattern to match
`flags`	`string` (for regex type, optional)	Regular expression flags (e.g., `g`, `i`, `m`)
`group`	`number` (for regex type, optional)	Capture group to extract. Default: `0` (entire match)
`source`	`string` (for regex type, optional)	Source to apply regex to (e.g., `html`, `text`, or a CSS/XPath selector)
`function`	`string` or `function` (for function type)	Function to execute for extraction

Extraction Options

The extraction engine supports the following options:

Option	Type	Description
`timeout`	`number` (optional)	Page load timeout in milliseconds. Default: `30000`
`waitForSelector`	`string` (optional)	Selector to wait for before extraction
`waitForTimeout`	`number` (optional)	Timeout for waitForSelector in milliseconds. Default: `5000`
`javascript`	`boolean` (optional)	Whether to enable JavaScript. Default: `true`
`proxy`	`boolean` (optional)	Whether to use a proxy. Default: `false`
`humanEmulation`	`boolean` (optional)	Whether to emulate human behavior. Default: `false`
`solveCaptcha`	`boolean` (optional)	Whether to attempt to solve CAPTCHAs. Default: `false`
`headers`	`object` (optional)	Custom HTTP headers to send with requests
`cookies`	`object` (optional)	Custom cookies to set before navigation
`browser`	`object` (optional)	Browser configuration options
`browser.type`	`string` (optional)	Browser type: `chromium`, `firefox`, or `webkit`. Default: `chromium`
`browser.headless`	`boolean` (optional)	Whether to run in headless mode. Default: `true`
`browser.args`	`string[]` (optional)	Additional browser arguments

Example Usage

Simple Scraping

// POST /api/v1/scrape
{
  "url": "https://example.com/products",
  "selectors": {
    "products": {
      "selector": ".product-item",
      "type": "list",
      "fields": {
        "name": ".product-name",
        "price": ".product-price",
        "description": ".product-description"
      }
    }
  },
  "options": {
    "javascript": true,
    "proxy": true,
    "humanEmulation": true
  }
}

Multi-step Navigation

// POST /api/v1/navigate
{
  "startUrl": "https://example.com/search",
  "steps": [
    {
      "type": "input",
      "selector": "#search-input",
      "value": "{{keywords}}",
      "humanInput": true
    },
    {
      "type": "click",
      "selector": "#search-button",
      "waitFor": ".results-container"
    },
    {
      "type": "extract",
      "name": "searchResults",
      "selector": ".result-item",
      "fields": {
        "title": ".result-title",
        "url": ".result-link@href"
      }
    },
    {
      "type": "paginate",
      "selector": ".next-page",
      "maxPages": 3
    }
  ],
  "variables": {
    "keywords": "example search term"
  },
  "options": {
    "solveCaptcha": true,
    "proxyRegion": "US",
    "timeout": 60000
  }
}

Storage System

The storage system provides a flexible and extensible way to store, retrieve, update, and delete extraction results. It follows the Strategy pattern, allowing the application to switch between different storage implementations without changing the client code.

Storage Adapters

The following storage adapters are available:

Memory Storage: Stores extraction results in memory using a Map. Useful for development and testing.
File Storage: Stores extraction results as JSON files in the file system. Provides persistence between server restarts.
MongoDB Storage: Stores extraction results in a MongoDB database. Provides scalable and queryable storage.
Redis Storage: Stores extraction results in Redis. Provides high-performance caching and storage with optional expiration.
API Storage: Stores extraction results in an external API. Provides integration with external systems.

Storage Factory and Service

The storage system includes a Storage Factory that creates and manages storage adapters, and a Storage Service that provides a unified interface for storage operations with support for primary and backup storage.

// Example: Using the Storage Service
import { StorageService } from './storage/index.js';

// Get the storage service instance
const storageService = StorageService.getInstance({
  primaryAdapter: 'mongodb',
  primaryAdapterOptions: {
    uri: 'mongodb://localhost:27017/web-scraper',
  },
  backupAdapter: 'file',
  backupAdapterOptions: {
    directory: './data/backup',
  },
  useBackup: true,
});

// Initialize the storage service
await storageService.initialize();

// Store an extraction result
const id = await storageService.store(extractionResult);

// Retrieve an extraction result
const result = await storageService.retrieve(id);

// Update an extraction result
const success = await storageService.update(id, { status: 'updated' });

// List extraction results
const results = await storageService.list({ limit: 10, offset: 0 });

// Delete an extraction result
const deleted = await storageService.delete(id);

Session Manager Module

Created a new SessionManager class that handles storing and retrieving browser sessions
Implemented methods for saving, retrieving, and applying sessions
Added session expiration and cleanup mechanisms
Integrated with the existing storage system for persistence

CAPTCHA Solver Integration

Updated the CaptchaSolver to check for existing sessions before attempting to solve CAPTCHAs
Added functionality to save sessions after successfully solving CAPTCHAs
Implemented a method to mark sessions as having solved CAPTCHAs

Navigation Engine Integration

Modified the NavigationEngine to apply existing sessions before navigation
Added options to control session usage in navigation flows
Implemented automatic session saving after successful navigation

API Endpoints for Session Management

Created new API routes for managing sessions
Implemented endpoints to list, delete, and clear sessions
Added proper error handling and validation

Configuration Options

Updated the configuration system to include session-related settings
Added environment variables for controlling session behavior
Implemented session TTL (time-to-live) configuration

This implementation provides several benefits:

Significantly reduces the need for CAPTCHA solving when repeatedly scraping the same website
Preserves cookies and localStorage data between scraping sessions
Provides a clean API for managing sessions
Integrates seamlessly with the existing architecture
Follows the established patterns and practices of the project

For more details, see the Storage Module Documentation.

Current Limitations

External CAPTCHA solving service integration is placeholder only
Unit and integration tests are not yet implemented
Documentation is still in progress

Roadmap

Short Term (1-2 weeks)
- Enhance CAPTCHA solving with external service integration
- Add unit tests for core components
- Implement request validation with JSON schema
Medium Term (2-4 weeks)
- Implement MongoDB and Redis integration
- Add request validation with JSON schema
- Create comprehensive API documentation
Long Term (1-2 months)
- Add performance optimization
- Implement distributed scraping capabilities
- Create advanced examples and tutorials

License

This project is licensed under the MIT License - see the LICENSE file for details.

Ethical Use

This tool is intended for legitimate web scraping purposes such as data analysis, research, and integration. Always:

Respect robots.txt directives
Implement reasonable rate limiting
Identify your scraper appropriately
Comply with websites' terms of service
Respect copyright and data privacy laws

The developers of this tool are not responsible for any misuse or legal consequences resulting from improper use.

Environment Variables

The following environment variables are required for the API to function properly:

# Browser Configuration
BROWSER_HEADLESS=true
BROWSER_TIMEOUT=30000

# Proxy Configuration
PROXY_ENABLED=false
PROXY_API_KEY=your_proxy_api_key

# LLM Configuration
OPENROUTER_API_KEY=your_openrouter_api_key
OPENROUTER_MODEL=openai/gpt-3.5-turbo  # Default model, can be overridden per request

LLM Configuration

The OpenRouter adapter supports various models through the OpenRouter API. You can configure the default model and other parameters in your environment variables or override them per request:

// Example: Using the OpenRouter adapter
const llmAdapter = new OpenRouterAdapter({
  apiKey: process.env.OPENROUTER_API_KEY,
  model: 'openai/gpt-3.5-turbo',  // Optional, defaults to env var
  temperature: 0.7,              // Optional
  maxTokens: 1000,               // Optional
});

// Make a completion request
const response = await llmAdapter.complete({
  prompt: 'Your prompt here',
  model: 'anthropic/claude-2',   // Optional, overrides default
  temperature: 0.8,              // Optional, overrides default
});

Supported models include:

OpenAI models (gpt-3.5-turbo, gpt-4, etc.)
Anthropic models (claude-2, claude-instant)
Google models (palm-2)
And many others through OpenRouter's unified API

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
.github/workflows		.github/workflows
config-templates		config-templates
docs		docs
media		media
memory-bank		memory-bank
src		src
.dockerignore		.dockerignore
.env.docker		.env.docker
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierrc		.prettierrc
Dockerfile		Dockerfile
README.md		README.md
debug-server.js		debug-server.js
deploy.sh		deploy.sh
docker-compose.yml		docker-compose.yml
error_screenshot_selector_2025-04-28T09-55-15-572Z.png		error_screenshot_selector_2025-04-28T09-55-15-572Z.png
error_screenshot_selector_2025-04-28T10-02-11-978Z.png		error_screenshot_selector_2025-04-28T10-02-11-978Z.png
error_screenshot_selector_2025-04-29T11-09-17-085Z.png		error_screenshot_selector_2025-04-29T11-09-17-085Z.png
googlemaps.html		googlemaps.html
index.html		index.html
index_mid.html		index_mid.html
index_old.html		index_old.html
modify-navigation-routes.js		modify-navigation-routes.js
package-lock.json		package-lock.json
package.json		package.json
post-process-rightmove.js		post-process-rightmove.js
proxies.json		proxies.json
rightmove-config copy.json		rightmove-config copy.json
run-extraction-test.sh		run-extraction-test.sh
run-server.sh		run-server.sh
script.js		script.js
simple-test.ts		simple-test.ts
test-esm.ts		test-esm.ts
test-server.ts		test-server.ts
test-upload.txt		test-upload.txt
tsconfig.json		tsconfig.json

sherifButt/adv-web-scraper-api

Folders and files

Latest commit

History

Repository files navigation