Skip to content

Commit da84db1

Browse files
authored
feat: Add StagehandCrawler with AI-powered browser automation (#1854)
### Description Adds `StagehandCrawler` - a new browser crawler powered by [Stagehand](https://www.browserbase.com/stagehand) that lets users interact with pages using natural language instead of CSS selectors or XPath. Extends `PlaywrightCrawler` and inherits all of its features: routing, sessions, autoscaling, proxies, and navigation hooks. - `StagehandPage` extends Playwright `Page` with four AI methods: `act()`, `extract()`, `observe()`, and `execute()`. - `StagehandOptions` configures the AI model, execution environment (`LOCAL` / `BROWSERBASE`), and session parameters. - `StagehandBrowserPlugin` and `StagehandBrowserController` integrate Stagehand into the browser pool, managing session lifecycle and fingerprint header injection. - Because Stagehand controls the browser launch internally and Playwright connects via CDP, only Chromium is supported, and browser configuration is limited to the subset accepted by Stagehand's `BrowserLaunchOptions`. - Added a new guide covering basic usage, AI page operations, and Browserbase integration. ### Issues - Closes: #1738 ### Testing - Added unit tests for the `StagehandBrowserController`, `StagehandBrowserPlugin`, and `StagehandCrawler` with Stagehand mocked out - no real LLM connection required to run the test suite.
1 parent 590ad1c commit da84db1

23 files changed

Lines changed: 1973 additions & 365 deletions

docs/guides/architecture_overview.mdx

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,8 @@ class PlaywrightCrawler
5353
5454
class AdaptivePlaywrightCrawler
5555
56+
class StagehandCrawler
57+
5658
%% ========================
5759
%% Inheritance arrows
5860
%% ========================
@@ -63,6 +65,7 @@ BasicCrawler --|> AdaptivePlaywrightCrawler
6365
AbstractHttpCrawler --|> HttpCrawler
6466
AbstractHttpCrawler --|> ParselCrawler
6567
AbstractHttpCrawler --|> BeautifulSoupCrawler
68+
PlaywrightCrawler --|> StagehandCrawler
6669
```
6770

6871
### HTTP crawlers
@@ -79,7 +82,10 @@ You can learn more about HTTP crawlers in the [HTTP crawlers guide](./http-crawl
7982

8083
### Browser crawlers
8184

82-
Browser crawlers use a real browser to render pages, enabling scraping of sites that require JavaScript. They manage browser instances, pages, and context lifecycles. Currently, the only browser crawler is <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, which utilizes the [Playwright](https://playwright.dev/) library. Playwright provides a high-level API for controlling and navigating browsers. You can learn more about <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, its features, and how it internally manages browser instances in the [Playwright crawler guide](./playwright-crawler).
85+
Browser crawlers use a real browser to render pages, enabling scraping of sites that require JavaScript. They manage browser instances, pages, and context lifecycles. Crawlee provides two browser crawlers:
86+
87+
- <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> utilizes the [Playwright](https://playwright.dev/) library and provides a high-level API for controlling and navigating browsers. You can learn more about it in the [Playwright crawler guide](./playwright-crawler).
88+
- <ApiLink to="class/StagehandCrawler">`StagehandCrawler`</ApiLink> extends `PlaywrightCrawler` with AI-powered browser automation via [Stagehand](https://github.com/browserbase/stagehand). It adds natural-language methods (`act`, `extract`, `observe`, `execute`) directly on the page object. You can learn more about it in the [Stagehand crawler guide](./stagehand-crawler).
8389

8490
### Adaptive crawler
8591

@@ -122,6 +128,12 @@ class AdaptivePlaywrightPreNavCrawlingContext
122128
123129
class AdaptivePlaywrightCrawlingContext
124130
131+
class StagehandPreNavCrawlingContext
132+
133+
class StagehandPostNavCrawlingContext
134+
135+
class StagehandCrawlingContext
136+
125137
%% ========================
126138
%% Inheritance arrows
127139
%% ========================
@@ -143,6 +155,12 @@ PlaywrightPreNavCrawlingContext --|> PlaywrightCrawlingContext
143155
BasicCrawlingContext --|> AdaptivePlaywrightPreNavCrawlingContext
144156
145157
ParsedHttpCrawlingContext --|> AdaptivePlaywrightCrawlingContext
158+
159+
PlaywrightPreNavCrawlingContext --|> StagehandPreNavCrawlingContext
160+
161+
StagehandPreNavCrawlingContext --|> StagehandPostNavCrawlingContext
162+
163+
StagehandPostNavCrawlingContext --|> StagehandCrawlingContext
146164
```
147165

148166
They have a similar inheritance structure as the crawlers, with the base class being <ApiLink to="class/BasicCrawlingContext">`BasicCrawlingContext`</ApiLink>. The specific crawling contexts are:
@@ -154,6 +172,9 @@ They have a similar inheritance structure as the crawlers, with the base class b
154172
- <ApiLink to="class/PlaywrightCrawlingContext">`PlaywrightCrawlingContext`</ApiLink> for Playwright crawlers.
155173
- <ApiLink to="class/AdaptivePlaywrightPreNavCrawlingContext">`AdaptivePlaywrightPreNavCrawlingContext`</ApiLink> for Adaptive Playwright crawlers before the page is navigated.
156174
- <ApiLink to="class/AdaptivePlaywrightCrawlingContext">`AdaptivePlaywrightCrawlingContext`</ApiLink> for Adaptive Playwright crawlers.
175+
- <ApiLink to="class/StagehandPreNavCrawlingContext">`StagehandPreNavCrawlingContext`</ApiLink> for Stagehand crawlers before the page is navigated.
176+
- <ApiLink to="class/StagehandPostNavCrawlingContext">`StagehandPostNavCrawlingContext`</ApiLink> for Stagehand crawlers after the page is navigated.
177+
- <ApiLink to="class/StagehandCrawlingContext">`StagehandCrawlingContext`</ApiLink> for Stagehand crawlers.
157178

158179
## Storages
159180

docs/guides/code_examples/playwright_crawler_stagehand/__init__.py

Whitespace-only changes.

docs/guides/code_examples/playwright_crawler_stagehand/browser_classes.py

Lines changed: 0 additions & 101 deletions
This file was deleted.

docs/guides/code_examples/playwright_crawler_stagehand/stagehand_run.py

Lines changed: 0 additions & 66 deletions
This file was deleted.

docs/guides/code_examples/playwright_crawler_stagehand/support_classes.py

Lines changed: 0 additions & 57 deletions
This file was deleted.
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
import asyncio
2+
from typing import cast
3+
4+
from crawlee.browsers import StagehandOptions
5+
from crawlee.crawlers import StagehandCrawler, StagehandCrawlingContext
6+
7+
8+
async def main() -> None:
9+
crawler = StagehandCrawler(
10+
stagehand_options=StagehandOptions(
11+
model_api_key='your-openai-api-key',
12+
model='openai/gpt-5.4-nano',
13+
),
14+
max_requests_per_crawl=5,
15+
)
16+
17+
@crawler.router.default_handler
18+
async def handler(context: StagehandCrawlingContext) -> None:
19+
context.log.info(f'Processing {context.request.url} ...')
20+
21+
# Dismiss overlays or interact with the page using natural language.
22+
await context.page.act(input='Click the accept cookies button if present')
23+
24+
# Extract data from the page using AI.
25+
extracted = await context.page.extract(
26+
instruction='Get the page title and the main heading text',
27+
schema={
28+
'type': 'object',
29+
'properties': {
30+
'title': {'type': 'string'},
31+
'heading': {'type': 'string'},
32+
},
33+
},
34+
)
35+
36+
extract_result = extracted.data.result
37+
38+
if isinstance(extract_result, dict):
39+
# Push extracted data to the dataset
40+
# Use `cast()` to provide a more specific type hint for the extracted data.
41+
await context.push_data(cast('dict[str, str | None]', extract_result))
42+
43+
await crawler.run(['https://example.com'])
44+
45+
46+
if __name__ == '__main__':
47+
asyncio.run(main())
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
import asyncio
2+
from typing import cast
3+
4+
from crawlee.browsers import StagehandOptions
5+
from crawlee.crawlers import StagehandCrawler, StagehandCrawlingContext
6+
7+
8+
async def main() -> None:
9+
# Use Browserbase cloud browser instead of a local Chromium instance.
10+
crawler = StagehandCrawler(
11+
stagehand_options=StagehandOptions(
12+
env='BROWSERBASE',
13+
browserbase_api_key='your-browserbase-api-key',
14+
project_id='your-project-id',
15+
model_api_key='your-openai-api-key',
16+
model='openai/gpt-5.4-nano',
17+
),
18+
max_requests_per_crawl=5,
19+
)
20+
21+
@crawler.router.default_handler
22+
async def handler(context: StagehandCrawlingContext) -> None:
23+
context.log.info(f'Processing {context.request.url} ...')
24+
25+
extracted = await context.page.extract(
26+
instruction='Get the main content of the page',
27+
)
28+
29+
extract_result = extracted.data.result
30+
31+
await context.push_data(cast('dict[str, str | None]', extract_result))
32+
33+
await crawler.run(['https://example.com'])
34+
35+
36+
if __name__ == '__main__':
37+
asyncio.run(main())

0 commit comments

Comments
 (0)