Fast Crawler is a Fast API synchronous endpoint for extracting web content using Crawl4AI, with support for structured data extraction, browser automation, and advanced crawling features.
docker run -d --name fast-crawler -p 8000:8000 -e API_TOKEN=your_secret_token ghcr.io/inem0o/fast-crawler:latest
crawler-api:
image: ghcr.io/inem0o/fast-crawler:latest
ports:
- target: 8000
published: 8000
environment:
API_TOKEN: "your_secret_token"
You can access the swagger documentation here http://localhost:8000/docs
The API uses a static token-based authentication system. All requests must include this token in the headers to be authorized.
The API token must be provided in the X-Token
header for all requests:
curl -X POST "http://localhost:8000/crawl" \
-H "X-Token: your_secret_token" \
-H "Content-Type: application/json" \
-d '{ ... }'
The /crawl
endpoint is a POST endpoint that allows you to extract content from any web page. It provides extensive configuration options for browser behavior, content extraction, and page interactions.
Key features:
- Full browser automation with Chrome/Chromium
- Configurable viewport and network settings
- Advanced content extraction strategies
- Screenshot and PDF generation
- Cache management
- Structured data extraction without LLM
The request must be a POST request with a JSON body containing the following structure:
{
"url": "https://example.com", // must be a valid HTTP/HTTPS URL.
"browser": { // Browser settings (optional)
// All settings have defaults
},
"config": { // Crawler settings (optional)
// All settings have defaults
}
}
X-Token
: Your API authentication tokenContent-Type
: Must beapplication/json
The url
field
Control how the browser behaves during crawling:
{
"browser": {
"viewport_width": 1920,
"viewport_height": 1080,
"proxy": "http://user:pass@proxy:8080",
"proxy_config": {
"server": "proxy:8080",
"username": "user",
"password": "pass"
},
"ignore_https_errors": true,
"java_script_enabled": true,
"use_persistent_context": false,
"user_data_dir": "/path/to/user/data",
"cookies": [
{
"name": "session",
"value": "abc123",
"url": "https://example.com"
}
],
"headers": {
"Accept-Language": "en-US,en;q=0.9",
"DNT": "1"
},
"user_agent": "Mozilla/5.0 (X11; Linux x86_64) ...",
"light_mode": false,
"text_mode": false,
"use_managed_browser": false,
"extra_args": ["--disable-extensions", "--disable-gpu"]
}
}
Control how the content is extracted and processed:
{
"config": {
"extraction_schema": {
"name": "Product List",
"baseSelector": "div.product-card",
"baseFields": [
{
"name": "product_id",
"type": "attribute",
"attribute": "data-product-id"
}
],
"fields": [
{
"name": "title",
"selector": "h2.product-title",
"type": "text"
}
]
},
"word_count_threshold": 200,
"css_selector": "article.main-content",
"excluded_selector": ".ads, .cookie-banner",
"excluded_tags": ["script", "style", "nav"],
"only_text": false,
"prettiify": false,
"keep_data_attributes": false,
"remove_forms": false,
"wait_until": "networkidle",
"page_timeout": 30000,
"screenshot": true,
"pdf": false,
"cache_mode": "enabled"
}
}
The response is a JSON object with the following structure:
{
"url": "https://example.com",
"result": {
"success": true,
"status_code": 200,
"html": "...",
"cleaned_html": "...",
"markdown": "...",
"text": "...",
"title": "...",
"description": "...",
"media": {
"images": [
{
"src": "...",
"alt": "...",
"width": 800,
"height": 600
}
]
},
"links": {
"internal": [
{
"href": "...",
"text": "..."
}
],
"external": []
},
"metadata": {},
"headers": {},
"screenshot": "base64...", // If requested
"pdf": "base64...", // If requested
"extracted_content": "..." // If extraction_schema provided
}
}
url
: The URL that was crawledresult
: Object containing all extracted datasuccess
: Boolean indicating if the crawl was successfulstatus_code
: HTTP status code of the pagehtml
: Original HTML contentcleaned_html
: HTML after cleaning and processingmarkdown
: Content converted to markdowntext
: Plain text contenttitle
: Page titledescription
: Page meta descriptionmedia
: Object containing found media (images, videos)links
: Object containing found links (internal/external)metadata
: Additional page metadataheaders
: HTTP response headersscreenshot
: Base64 encoded screenshot (if requested)pdf
: Base64 encoded PDF (if requested)extracted_content
: Structured data if extraction_schema was provided
The API uses standard HTTP status codes and returns detailed error messages:
{
"detail": "Invalid API token"
}
{
"detail": [
{
"loc": ["body", "url"],
"msg": "invalid or missing URL scheme",
"type": "value_error.url.scheme"
}
]
}
{
"detail": "Error crawling URL: timeout waiting for selector"
}
The browser settings control how the Chrome/Chromium browser instance behaves during crawling. These settings are all optional and have sensible defaults.
Configure the browser's viewport size:
{
"browser": {
"viewport_width": 1920,
"viewport_height": 1080
}
}
Parameter | Type | Default | Description |
---|---|---|---|
viewport_width |
integer | 1080 | Initial page width in pixels |
viewport_height |
integer | 600 | Initial page height in pixels |
Control how the browser handles network connections and security:
{
"browser": {
"proxy": "http://user:pass@proxy:8080",
"proxy_config": {
"server": "proxy:8080",
"username": "user",
"password": "pass"
},
"ignore_https_errors": true,
"headers": {
"Accept-Language": "en-US,en;q=0.9",
"DNT": "1"
},
"user_agent": "Mozilla/5.0 (X11; Linux x86_64) ..."
}
}
Parameter | Type | Default | Description |
---|---|---|---|
proxy |
string | null | Single-proxy URL for all traffic |
proxy_config |
object | null | Advanced proxy configuration |
ignore_https_errors |
boolean | true | Continue despite invalid SSL certificates |
headers |
object | null | Extra HTTP headers for all requests |
user_agent |
string | null | Custom User-Agent string |
Control how the browser operates and maintains state:
{
"browser": {
"java_script_enabled": true,
"use_persistent_context": false,
"user_data_dir": "/path/to/user/data",
"cookies": [
{
"name": "session",
"value": "abc123",
"url": "https://example.com"
}
]
}
}
Parameter | Type | Default | Description |
---|---|---|---|
java_script_enabled |
boolean | true | Enable/disable JavaScript execution |
use_persistent_context |
boolean | false | Maintain cookies/session between runs |
user_data_dir |
string | null | Directory for persistent data |
cookies |
array | null | Pre-set cookies for the session |
Some browser settings are fixed and cannot be changed:
Parameter | Value | Description |
---|---|---|
headless |
true | Browser runs in headless mode |
browser_type |
"chromium" | Uses Chromium browser engine |
These settings ensure consistent behavior and optimal performance for web crawling.
The crawler settings control how content is extracted, processed, and filtered during crawling. These settings are all optional and have sensible defaults.
Configure how content is extracted and filtered:
{
"config": {
"word_count_threshold": 200,
"css_selector": "article.main-content",
"excluded_selector": ".ads, .cookie-banner",
"excluded_tags": ["script", "style", "nav"],
"only_text": false,
"prettiify": false,
"keep_data_attributes": false,
"remove_forms": false
}
}
Parameter | Type | Default | Description |
---|---|---|---|
word_count_threshold |
integer | 200 | Minimum words for text block inclusion |
css_selector |
string | null | CSS selector for content extraction |
excluded_selector |
string | null | CSS selector for content to remove |
excluded_tags |
array | null | HTML tags to remove |
only_text |
boolean | false | Extract text content only |
prettiify |
boolean | false | Beautify HTML output |
keep_data_attributes |
boolean | false | Preserve data-* attributes |
remove_forms |
boolean | false | Remove form elements |
Control page loading and timing:
{
"config": {
"wait_until": "networkidle",
"page_timeout": 30000,
"wait_for": "css:.article-loaded",
"wait_for_images": false,
"delay_before_return_html": 0.5,
"mean_delay": 0.2,
"max_range": 0.5,
"semaphore_count": 5
}
}
Parameter | Type | Default | Description |
---|---|---|---|
wait_until |
string | "domcontentloaded" | Navigation completion condition |
page_timeout |
integer | 60000 | Maximum time (ms) for page load |
wait_for |
string | null | CSS selector or JS condition |
wait_for_images |
boolean | false | Wait for images to load |
delay_before_return_html |
float | 0.1 | Additional delay before capture |
mean_delay |
float | 0.1 | Average delay between crawls |
max_range |
float | 0.3 | Maximum random delay variation |
semaphore_count |
integer | 5 | Maximum concurrent crawls |
Configure page interactions and dynamic content handling:
{
"config": {
"js_code": [
"document.querySelector('.cookie-accept')?.click()",
"window.scrollTo(0, document.body.scrollHeight)"
],
"js_only": false,
"ignore_body_visibility": true,
"scan_full_page": true,
"scroll_delay": 0.5,
"process_iframes": false,
"remove_overlay_elements": false,
"simulate_user": false,
"override_navigator": false,
"magic": false,
"adjust_viewport_to_content": false
}
}
Parameter | Type | Default | Description |
---|---|---|---|
js_code |
string/array | null | JavaScript code to execute |
js_only |
boolean | false | Only execute JS without reload |
ignore_body_visibility |
boolean | true | Skip body visibility check |
scan_full_page |
boolean | false | Auto-scroll through page |
scroll_delay |
float | 0.2 | Delay between scroll steps |
process_iframes |
boolean | false | Extract iframe content |
remove_overlay_elements |
boolean | false | Remove modal overlays |
simulate_user |
boolean | false | Simulate user behavior |
override_navigator |
boolean | false | Override navigator properties |
magic |
boolean | false | Auto-handle common obstacles |
adjust_viewport_to_content |
boolean | false | Resize viewport to content |
Configure screenshot, PDF, and image processing:
{
"config": {
"screenshot": false,
"screenshot_wait_for": 1.0,
"screenshot_height_threshold": 20000,
"pdf": false,
"image_description_min_word_threshold": 50,
"image_score_threshold": 3,
"exclude_external_images": false
}
}
Parameter | Type | Default | Description |
---|---|---|---|
screenshot |
boolean | false | Capture page screenshot |
screenshot_wait_for |
float | null | Wait time before screenshot |
screenshot_height_threshold |
integer | 20000 | Max screenshot height |
pdf |
boolean | false | Generate PDF version |
image_description_min_word_threshold |
integer | 50 | Min words in image description |
image_score_threshold |
integer | 3 | Min image relevance score |
exclude_external_images |
boolean | false | Exclude external images |
Configure link and domain handling:
{
"config": {
"exclude_social_media_domains": false,
"exclude_external_links": false,
"exclude_social_media_links": false,
"exclude_domains": ["ads.example.com", "analytics.com"]
}
}
Parameter | Type | Default | Description |
---|---|---|---|
exclude_social_media_domains |
boolean | false | Remove social media domains |
exclude_external_links |
boolean | false | Remove external links |
exclude_social_media_links |
boolean | false | Remove social sharing links |
exclude_domains |
array | null | Domains to exclude |
Control caching behavior:
{
"config": {
"cache_mode": "enabled",
"session_id": "session_123",
"bypass_cache": false,
"disable_cache": false,
"no_cache_read": false,
"no_cache_write": false
}
}
Parameter | Type | Default | Description |
---|---|---|---|
cache_mode |
string | "enabled" | Cache control mode |
session_id |
string | null | Browser session identifier |
bypass_cache |
boolean | false | Ignore but update cache |
disable_cache |
boolean | false | Disable caching completely |
no_cache_read |
boolean | false | Only write to cache |
no_cache_write |
boolean | false | Only read from cache |
Configure debugging options:
{
"config": {
"verbose": true,
"log_console": false
}
}
Parameter | Type | Default | Description |
---|---|---|---|
verbose |
boolean | true | Enable detailed logging |
log_console |
boolean | false | Log browser console output |
The API offers several content extraction strategies to meet different needs.
The API provides three basic formats for content extraction:
{
"config": {
"only_text": false,
"prettiify": false,
"keep_data_attributes": false
}
}
Format | Description | Configuration |
---|---|---|
Full HTML | Complete original HTML | Default parameters |
Clean HTML | Cleaned and formatted HTML | prettiify: true |
Text Only | Raw text without HTML | only_text: true |
Control extracted content with these options:
{
"config": {
"word_count_threshold": 200,
"css_selector": "article.main-content",
"excluded_selector": ".ads, .cookie-banner",
"excluded_tags": ["script", "style", "nav"],
"remove_forms": false
}
}
Parameter | Type | Default | Description |
---|---|---|---|
word_count_threshold |
integer | 200 | Minimum words for text block inclusion |
css_selector |
string | null | CSS selector for content extraction |
excluded_selector |
string | null | CSS selector for content to remove |
excluded_tags |
array | null | HTML tags to remove |
remove_forms |
boolean | false | Remove form elements |
The extraction schema allows structured data extraction without LLM.
{
"config": {
"extraction_schema": {
"name": "Product List",
"baseSelector": "div.product-card",
"baseFields": [
{
"name": "product_id",
"type": "attribute",
"attribute": "data-product-id"
}
],
"fields": [
{
"name": "title",
"selector": "h2.product-title",
"type": "text"
}
]
}
}
}
Type | Description | Example |
---|---|---|
text |
Extract text content | {"type": "text"} |
html |
Extract HTML content | {"type": "html"} |
attribute |
Extract a specific attribute | {"type": "attribute", "attribute": "src"} |
nested |
Extract a nested object | {"type": "nested", "fields": [...]} |
list |
Extract a list of simple values | {"type": "list"} |
nested_list |
Extract a list of complex objects | {"type": "nested_list", "fields": [...]} |
{
"name": "Product List",
"baseSelector": "div.product-card",
"baseFields": [
{
"name": "product_id",
"type": "attribute",
"attribute": "data-product-id"
}
],
"fields": [
{
"name": "title",
"selector": "h2.product-title",
"type": "text"
},
{
"name": "details",
"selector": "div.product-details",
"type": "nested",
"fields": [
{
"name": "brand",
"selector": "span.brand",
"type": "text"
},
{
"name": "model",
"selector": "span.model",
"type": "text"
}
]
},
{
"name": "features",
"selector": "ul.features li",
"type": "list",
"fields": [
{
"name": "feature",
"type": "text"
}
]
}
]
}