Skip to content

Commit

Permalink
Update documents, upload new version of quickstart.
Browse files Browse the repository at this point in the history
  • Loading branch information
unclecode committed Oct 30, 2024
1 parent 3529c2e commit 9307c19
Show file tree
Hide file tree
Showing 10 changed files with 1,471 additions and 789 deletions.
5 changes: 1 addition & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,9 @@ Use the [Crawl4AI GPT Assistant](https://tinyurl.com/crawl4ai-gpt) as your AI-po
- 💾 Improved caching system for better performance
- ⚡ Optimized batch processing with automatic rate limiting

Try new features in this colab notebook: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1L6LJ3KlplhJdUy3Wcry6pstnwRpCJ3yB?usp=sharing)


## Try it Now!

✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1REChY6fXQf-EaVYLv0eHEWvzlYxGm0pd?usp=sharing)
✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)

✨ Visit our [Documentation Website](https://crawl4ai.com/mkdocs/)

Expand Down
1,373 changes: 651 additions & 722 deletions docs/examples/quickstart.ipynb

Large diffs are not rendered by default.

735 changes: 735 additions & 0 deletions docs/examples/quickstart_v0.ipynb

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions docs/md_v2/assets/styles.css
Original file line number Diff line number Diff line change
Expand Up @@ -150,4 +150,11 @@ strong,
.tab-content pre {
margin: 0;
max-height: 300px; overflow: auto; border:none;
}

ol li::before {
content: counters(item, ".") ". ";
counter-increment: item;
/* float: left; */
/* padding-right: 5px; */
}
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,19 @@ Here's a condensed outline of the **Installation and Setup** video content:

---

1. **Introduction to Crawl4AI**:
- Briefly explain that Crawl4AI is a powerful tool for web scraping, data extraction, and content processing, with customizable options for various needs.
1 **Introduction to Crawl4AI**: Briefly explain that Crawl4AI is a powerful tool for web scraping, data extraction, and content processing, with customizable options for various needs.

2. **Installation Overview**:
2 **Installation Overview**:

- **Basic Install**: Run `pip install crawl4ai` and `playwright install` (to set up browser dependencies).

- **Optional Advanced Installs**:
- `pip install crawl4ai[torch]` - Adds PyTorch for clustering.
- `pip install crawl4ai[transformer]` - Adds support for LLM-based extraction.
- `pip install crawl4ai[all]` - Installs all features for complete functionality.

3. **Verifying the Installation**:
3 **Verifying the Installation**:

- Walk through a simple test script to confirm the setup:
```python
import asyncio
Expand All @@ -34,12 +36,13 @@ Here's a condensed outline of the **Installation and Setup** video content:
```
- Explain that this script initializes the crawler and runs it on a test URL, displaying part of the extracted content to verify functionality.

4. **Important Tips**:
4 **Important Tips**:

- **Run** `playwright install` **after installation** to set up dependencies.
- **For full performance** on text-related tasks, run `crawl4ai-download-models` after installing with `[torch]`, `[transformer]`, or `[all]` options.
- If you encounter issues, refer to the documentation or GitHub issues.

5. **Wrap Up**:
5 **Wrap Up**:
- Introduce the next topic in the series, which will cover Crawl4AI's browser configuration options (like choosing between `chromium`, `firefox`, and `webkit`).

---
Expand Down
24 changes: 16 additions & 8 deletions docs/md_v2/tutorial/episode_02_Overview_of_Advanced_Features.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,21 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri

### **Overview of Advanced Features**

1. **Introduction to Advanced Features**:
1 **Introduction to Advanced Features**:

- Briefly introduce Crawl4AI’s advanced tools, which let users go beyond basic crawling to customize and fine-tune their scraping workflows.

2. **Taking Screenshots**:
2 **Taking Screenshots**:

- Explain the screenshot capability for capturing page state and verifying content.
- **Example**:
```python
result = await crawler.arun(url="https://www.example.com", screenshot=True)
```
- Mention that screenshots are saved as a base64 string in `result`, allowing easy decoding and saving.

3. **Media and Link Extraction**:
3 **Media and Link Extraction**:

- Demonstrate how to pull all media (images, videos) and links (internal and external) from a page for deeper analysis or content gathering.
- **Example**:
```python
Expand All @@ -31,37 +34,42 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
print("Links:", result.links)
```

4. **Custom User Agent**:
4 **Custom User Agent**:

- Show how to set a custom user agent to disguise the crawler or simulate specific devices/browsers.
- **Example**:
```python
result = await crawler.arun(url="https://www.example.com", user_agent="Mozilla/5.0 (compatible; MyCrawler/1.0)")
```

5. **Custom Hooks for Enhanced Control**:
5 **Custom Hooks for Enhanced Control**:

- Briefly cover how to use hooks, which allow custom actions like setting headers or handling login during the crawl.
- **Example**: Setting a custom header with `before_get_url` hook.
```python
async def before_get_url(page):
await page.set_extra_http_headers({"X-Test-Header": "test"})
```

6. **CSS Selectors for Targeted Extraction**:
6 **CSS Selectors for Targeted Extraction**:

- Explain the use of CSS selectors to extract specific elements, ideal for structured data like articles or product details.
- **Example**:
```python
result = await crawler.arun(url="https://www.example.com", css_selector="h2")
print("H2 Tags:", result.extracted_content)
```

7. **Crawling Inside Iframes**:
7 **Crawling Inside Iframes**:

- Mention how enabling `process_iframes=True` allows extracting content within iframes, useful for sites with embedded content or ads.
- **Example**:
```python
result = await crawler.arun(url="https://www.example.com", process_iframes=True)
```

8. **Wrap-Up**:
8 **Wrap-Up**:

- Summarize these advanced features and how they allow users to customize every part of their web scraping experience.
- Tease upcoming videos where each feature will be explored in detail.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
async def log_browser_creation(browser):
print("Browser instance created:", browser)

crawler.set_hook('on_browser_created', log_browser_creation)
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
```
- **Explanation**: This hook logs the browser creation event, useful for tracking when a new browser instance starts.

Expand All @@ -57,7 +57,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
def update_user_agent(user_agent):
print(f"User Agent Updated: {user_agent}")

crawler.set_hook('on_user_agent_updated', update_user_agent)
crawler.crawler_strategy.set_hook('on_user_agent_updated', update_user_agent)
crawler.update_user_agent("Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)")
```
- **Explanation**: This hook provides a callback every time the user agent changes, helpful for debugging or dynamically altering user agent settings based on conditions.
Expand All @@ -73,7 +73,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
async def log_execution_start(page):
print("Execution started on page:", page.url)

crawler.set_hook('on_execution_started', log_execution_start)
crawler.crawler_strategy.set_hook('on_execution_started', log_execution_start)
```
- **Explanation**: Logs the start of any major interaction on the page, ideal for cases where you want to monitor each interaction.

Expand All @@ -90,7 +90,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.set_extra_http_headers({"X-Custom-Header": "CustomValue"})
print("Custom headers set before navigation")

crawler.set_hook('before_goto', modify_headers_before_goto)
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
```
- **Explanation**: This hook allows injecting headers or altering settings based on the page’s needs, particularly useful for pages with custom requirements.

Expand All @@ -106,7 +106,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
print("Scrolled to the bottom after navigation")

crawler.set_hook('after_goto', post_navigation_scroll)
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
```
- **Explanation**: This hook scrolls to the bottom of the page after loading, which can help load dynamically added content like infinite scroll elements.

Expand All @@ -122,7 +122,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.evaluate("document.querySelectorAll('.ad-banner').forEach(el => el.remove());")
print("Advertisements removed before returning HTML")

crawler.set_hook('before_return_html', remove_advertisements)
crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
```
- **Explanation**: The hook removes ad banners from the HTML before it’s retrieved, ensuring a cleaner data extraction.

Expand All @@ -138,7 +138,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.wait_for_selector('.main-content')
print("Main content loaded, ready to retrieve HTML")

crawler.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
crawler.crawler_strategy.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
```
- **Explanation**: This hook waits for the main content to load before retrieving the HTML, ensuring that all essential content is captured.

Expand All @@ -148,9 +148,9 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
- Each hook function can be asynchronous (useful for actions like waiting or retrieving async data).
- **Example Setup**:
```python
crawler.set_hook('on_browser_created', log_browser_creation)
crawler.set_hook('before_goto', modify_headers_before_goto)
crawler.set_hook('after_goto', post_navigation_scroll)
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
```

#### **5. Complete Example: Using Hooks for a Customized Crawl Workflow**
Expand All @@ -160,10 +160,10 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
async def custom_crawl():
async with AsyncWebCrawler() as crawler:
# Set hooks for custom workflow
crawler.set_hook('on_browser_created', log_browser_creation)
crawler.set_hook('before_goto', modify_headers_before_goto)
crawler.set_hook('after_goto', post_navigation_scroll)
crawler.set_hook('before_return_html', remove_advertisements)
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)

# Perform the crawl
url = "https://example.com"
Expand Down
34 changes: 18 additions & 16 deletions docs/md_v2/tutorial/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -771,9 +771,11 @@ Here’s a concise outline for the **Custom Headers, Identity Management, and Us
async with AsyncWebCrawler(
headers={"Accept-Language": "en-US", "Cache-Control": "no-cache"},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0",
simulate_user=True
) as crawler:
result = await crawler.arun(url="https://example.com/secure-page")
result = await crawler.arun(
url="https://example.com/secure-page",
simulate_user=True
)
print(result.markdown[:500]) # Display extracted content
```
- This example enables detailed customization for evading detection and accessing protected pages smoothly.
Expand Down Expand Up @@ -1576,7 +1578,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
async def log_browser_creation(browser):
print("Browser instance created:", browser)

crawler.set_hook('on_browser_created', log_browser_creation)
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
```
- **Explanation**: This hook logs the browser creation event, useful for tracking when a new browser instance starts.

Expand All @@ -1591,7 +1593,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
def update_user_agent(user_agent):
print(f"User Agent Updated: {user_agent}")

crawler.set_hook('on_user_agent_updated', update_user_agent)
crawler.crawler_strategy.set_hook('on_user_agent_updated', update_user_agent)
crawler.update_user_agent("Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)")
```
- **Explanation**: This hook provides a callback every time the user agent changes, helpful for debugging or dynamically altering user agent settings based on conditions.
Expand All @@ -1607,7 +1609,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
async def log_execution_start(page):
print("Execution started on page:", page.url)

crawler.set_hook('on_execution_started', log_execution_start)
crawler.crawler_strategy.set_hook('on_execution_started', log_execution_start)
```
- **Explanation**: Logs the start of any major interaction on the page, ideal for cases where you want to monitor each interaction.

Expand All @@ -1624,7 +1626,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.set_extra_http_headers({"X-Custom-Header": "CustomValue"})
print("Custom headers set before navigation")

crawler.set_hook('before_goto', modify_headers_before_goto)
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
```
- **Explanation**: This hook allows injecting headers or altering settings based on the page’s needs, particularly useful for pages with custom requirements.

Expand All @@ -1640,7 +1642,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
print("Scrolled to the bottom after navigation")

crawler.set_hook('after_goto', post_navigation_scroll)
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
```
- **Explanation**: This hook scrolls to the bottom of the page after loading, which can help load dynamically added content like infinite scroll elements.

Expand All @@ -1656,7 +1658,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.evaluate("document.querySelectorAll('.ad-banner').forEach(el => el.remove());")
print("Advertisements removed before returning HTML")

crawler.set_hook('before_return_html', remove_advertisements)
crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
```
- **Explanation**: The hook removes ad banners from the HTML before it’s retrieved, ensuring a cleaner data extraction.

Expand All @@ -1672,7 +1674,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.wait_for_selector('.main-content')
print("Main content loaded, ready to retrieve HTML")

crawler.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
crawler.crawler_strategy.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
```
- **Explanation**: This hook waits for the main content to load before retrieving the HTML, ensuring that all essential content is captured.

Expand All @@ -1682,9 +1684,9 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
- Each hook function can be asynchronous (useful for actions like waiting or retrieving async data).
- **Example Setup**:
```python
crawler.set_hook('on_browser_created', log_browser_creation)
crawler.set_hook('before_goto', modify_headers_before_goto)
crawler.set_hook('after_goto', post_navigation_scroll)
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
```

#### **5. Complete Example: Using Hooks for a Customized Crawl Workflow**
Expand All @@ -1694,10 +1696,10 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
async def custom_crawl():
async with AsyncWebCrawler() as crawler:
# Set hooks for custom workflow
crawler.set_hook('on_browser_created', log_browser_creation)
crawler.set_hook('before_goto', modify_headers_before_goto)
crawler.set_hook('after_goto', post_navigation_scroll)
crawler.set_hook('before_return_html', remove_advertisements)
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)

# Perform the crawl
url = "https://example.com"
Expand Down
Loading

0 comments on commit 9307c19

Please sign in to comment.