taRantula

taRantula is an R package designed for robust, large-scale web scraping. It combines the flexibility of Selenium with the speed of httr, backed by a persistent DuckDB storage engine to ensure data integrity.

Key Features

Hybrid Scraping Engine: Seamlessly switch between Selenium 4 (for JS-heavy sites) and httr (for high-speed static content).
Persistent Storage: All results are written directly to a DuckDB backend, allowing for SQL-based querying and zero data loss.
Selenium Grid Ready: Optimized for containerized Hub/Node architectures and high-memory environments.
Fault Tolerance: Features a snapshotting mechanism to resume interrupted jobs from the last stable state.
Parallel Processing: Scales across multiple workers using the future framework.
Regex Data Mining: High-performance extraction of emails, VAT/UID numbers, and custom patterns directly from your collected data.

Configuration (`params_manager`)

The package uses a robust, R6-based configuration system with strict type validation:

paramsScraper(): General web crawling and JS rendering settings.
paramsGoogleSearch(): Specialized config for Google Search API and rate-limit handling.
YAML Support: Easily export or import configurations for reproducible scraping pipelines.

Compliance and Safety

Robots.txt Enforcement: Automated checking with internal caching to respect site owner preferences.
Graceful Termination: Signaling mechanisms ensure workers exit cleanly without corrupting the database.
Redirect Detection: Logs and tracks URL changes from request to final browser state.

Installation

# Install from GitHub
remotes::install_github("statistikat/taRantula")

Quick Start

Below is a basic example of how to initialize a scraping job using the Selenium engine and DuckDB storage.

For advanced users looking to run this in a containerized environment, please refer to the Intro Vignette: Docker-based Selenium Setup.

library(taRantula)

# 1. Setup Configuration
cfg <- paramsScraper()
cfg$set("selenium$host", "localhost")
cfg$set("selenium$port", 4444L)
cfg$set("storage$path", "scraping_results.duckdb")

# 2. Initialize the Scraper
scraper <- UrlScraper$new(config = cfg)

# 3. Define URLs and Run
urls <- c("[https://example.com](https://example.com)", "[https://r-project.org](https://r-project.org)")
scraper$run(urls)

# 4. Extract Data (e.g., Email addresses)
emails <- scraper$regex_extract(pattern = "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+")

# 5. Graceful Stop
scraper$stop()

Production Deployment

For production environments, the package includes docker-compose templates to spin up a Selenium Grid alongside your R environment. Detailed instructions are available in the documentation vignettes.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
R		R
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
_pkgdown.yml		_pkgdown.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

taRantula

Key Features

Configuration (`params_manager`)

Compliance and Safety

Installation

Quick Start

Production Deployment

About

Uh oh!

Releases

Packages

Languages

statistikat/taRantula

Folders and files

Latest commit

History

Repository files navigation

taRantula

Key Features

Configuration (params_manager)

Compliance and Safety

Installation

Quick Start

Production Deployment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Configuration (`params_manager`)

Packages