A high-performance web scraper for extracting car listing data from Redbook.com.au. This tool uses Playwright with stealth plugins, proxy rotation, and CAPTCHA solving to efficiently scrape vehicle information at scale.
- 🚗 Comprehensive Data Extraction: Scrapes vehicle details including title, price, year, make, model, odometer, engine, transmission, fuel type, and VIN
- 🎭 Stealth Browsing: Uses Playwright with stealth plugins to avoid bot detection
- 🔄 Proxy Rotation: Integrates with BrightData residential proxies for IP rotation
- 🤖 CAPTCHA Handling: Automatic detection and solving of DataDome challenges
- ⚡ Concurrent Scraping: Parallel processing with configurable concurrency limits
- 📊 CSV Export: Saves all scraped data to CSV format for easy analysis
- 🔁 Resume Capability: Tracks seen URLs to avoid duplicates and enable resuming
- 📝 Logging: Comprehensive logging with Winston for monitoring and debugging
- Node.js 16+ (ES modules support required)
- npm or yarn
- BrightData Account (for residential proxy access)
- 2Captcha API Key (optional, for CAPTCHA solving)
- Clone the repository:
git clone <repository-url>
cd redbook-scraper- Install dependencies:
npm install- Install Playwright browsers:
npx playwright install chromiumEdit config.js to customize scraping parameters:
export const CONFIG = {
TARGET_LISTINGS: 125000, // Target number of listings to scrape
CONCURRENCY: 8, // Number of concurrent requests
DELAY_MIN: 3000, // Minimum delay between requests (ms)
DELAY_MAX: 7000, // Maximum delay between requests (ms)
MAX_RETRIES: 3, // Maximum retry attempts
OUTPUT_FILE: "data/cars.csv", // Output CSV file path
LOG_FILE: "logs/scraper.log", // Log file path
PROXY_PROVIDER: "brightdata", // Proxy provider
BRIGHTDATA_ZONE: "your-proxy-url",
BRIGHTDATA_USER: "your-username",
BRIGHTDATA_PASS: "your-password",
CAPTCHA_KEY: "", // 2Captcha API key (optional)
MAKES: ["Toyota", "Ford", ...], // Car makes to scrape
YEARS_START: [2010, 2015, 2020], // Year range start points
YEARS_END: [2014, 2019, 2025], // Year range end points
};- Sign up for a BrightData account
- Create a residential proxy zone
- Update
BRIGHTDATA_ZONE,BRIGHTDATA_USER, andBRIGHTDATA_PASSinconfig.js
For automatic CAPTCHA solving:
- Sign up at 2Captcha.com
- Add your API key to
CAPTCHA_KEYinconfig.js
Start the scraper:
npm startOr directly:
node scrape.jsTest a single query before running the full scraper:
node test.jsThis will:
- Test the query building logic
- Check DataDome detection
- Verify proxy connectivity
- Save a screenshot for debugging
redbook-scraper/
├── scrape.js # Main scraping logic
├── test.js # Testing script
├── config.js # Configuration file
├── package.json # Dependencies and scripts
├── Dockerfile # Docker configuration
├── utils/
│ ├── browser.js # Stealth browser setup
│ ├── proxy.js # Proxy rotation logic
│ ├── solver.js # CAPTCHA solving
│ └── logger.js # Logging configuration
├── data/ # Output directory (auto-created)
│ ├── cars.csv # Scraped data
│ └── seen.txt # Tracked URLs
└── logs/ # Log files (auto-created)
└── scraper.log # Application logs
- Query Generation: Builds Lucene queries for different make/year combinations
- Results Page Scraping: Navigates through paginated results pages
- Detail Extraction: Visits each listing URL and extracts vehicle details
- Data Storage: Writes extracted data to CSV and tracks seen URLs
- Rate Limiting: Implements random delays to avoid detection
- Error Handling: Retries failed requests and logs errors
Build and run with Docker:
docker build -t redbook-scraper .
docker run redbook-scraperThe scraper generates a CSV file (data/cars.csv) with the following columns:
- Title: Vehicle title
- Price: Listing price
- Year: Model year
- Make: Car manufacturer
- Model: Car model
- Odometer: Mileage/kilometers
- Engine: Engine size/type
- Transmission: Transmission type
- Fuel Type: Fuel type (petrol, diesel, etc.)
- VIN: Vehicle Identification Number
- URL: Source listing URL
If you encounter csv-writer version errors:
- The project uses
csv-writer@^1.6.0 - Delete
package-lock.jsonandnode_modules, then runnpm installagain
- Verify BrightData credentials are correct
- Check proxy zone is active and has available IPs
- Test proxy connectivity with
test.js
- If DataDome blocks persist, add a 2Captcha API key
- Increase delays in
config.jsto reduce detection - Check logs for specific error messages
- Ensure Playwright browsers are installed:
npx playwright install chromium - For Docker, browsers are installed automatically
- Concurrency: Default 8 concurrent requests (adjustable in config)
- Rate Limiting: 3-7 second delays between requests
- Target: Designed to scrape ~125,000 listings
- Resume: Automatically resumes from last position using
seen.txt
- Ensure you comply with Redbook.com.au's Terms of Service
- Respect rate limits and robots.txt
- Use scraped data responsibly
- Consider reaching out to Redbook for official API access