Update documents, upload new version of quickstart.

unclecode · Oct 30, 2024 · 9307c19 · 9307c19
1 parent 3529c2e
commit 9307c19
Show file tree

Hide file tree

Showing 10 changed files with 1,471 additions and 789 deletions.
diff --git a/README.md b/README.md
@@ -25,12 +25,9 @@ Use the [Crawl4AI GPT Assistant](https://tinyurl.com/crawl4ai-gpt) as your AI-po
 - 💾 Improved caching system for better performance
 - ⚡ Optimized batch processing with automatic rate limiting
 
-Try new features in this colab notebook: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1L6LJ3KlplhJdUy3Wcry6pstnwRpCJ3yB?usp=sharing)
-
-
 ## Try it Now!
 
-✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1REChY6fXQf-EaVYLv0eHEWvzlYxGm0pd?usp=sharing)
+✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
 
 ✨ Visit our [Documentation Website](https://crawl4ai.com/mkdocs/)
 

diff --git a/docs/examples/quickstart.ipynb b/docs/examples/quickstart.ipynb
diff --git a/docs/examples/quickstart_v0.ipynb b/docs/examples/quickstart_v0.ipynb
diff --git a/docs/md_v2/assets/styles.css b/docs/md_v2/assets/styles.css
@@ -150,4 +150,11 @@ strong,
 .tab-content pre {
     margin: 0;
     max-height: 300px; overflow: auto; border:none;
+}
+
+ol li::before {
+    content: counters(item, ".") ". ";
+    counter-increment: item;
+    /* float: left; */
+    /* padding-right: 5px; */
 }
diff --git a/docs/md_v2/tutorial/episode_01_Introduction_to_Crawl4AI_and_Basic_Installation.md b/docs/md_v2/tutorial/episode_01_Introduction_to_Crawl4AI_and_Basic_Installation.md
@@ -9,17 +9,19 @@ Here's a condensed outline of the **Installation and Setup** video content:
 
 ---
 
-1. **Introduction to Crawl4AI**:
-   - Briefly explain that Crawl4AI is a powerful tool for web scraping, data extraction, and content processing, with customizable options for various needs.
+1 **Introduction to Crawl4AI**: Briefly explain that Crawl4AI is a powerful tool for web scraping, data extraction, and content processing, with customizable options for various needs.
 
-2. **Installation Overview**:
+2 **Installation Overview**:   
+
    - **Basic Install**: Run `pip install crawl4ai` and `playwright install` (to set up browser dependencies).
+
    - **Optional Advanced Installs**:
      - `pip install crawl4ai[torch]` - Adds PyTorch for clustering.
      - `pip install crawl4ai[transformer]` - Adds support for LLM-based extraction.
      - `pip install crawl4ai[all]` - Installs all features for complete functionality.
 
-3. **Verifying the Installation**:
+3 **Verifying the Installation**:
+
    - Walk through a simple test script to confirm the setup:
       ```python
       import asyncio
@@ -34,12 +36,13 @@ Here's a condensed outline of the **Installation and Setup** video content:
       ```
    - Explain that this script initializes the crawler and runs it on a test URL, displaying part of the extracted content to verify functionality.
 
-4. **Important Tips**:
+4 **Important Tips**:
+
    - **Run** `playwright install` **after installation** to set up dependencies.
    - **For full performance** on text-related tasks, run `crawl4ai-download-models` after installing with `[torch]`, `[transformer]`, or `[all]` options.
    - If you encounter issues, refer to the documentation or GitHub issues.
 
-5. **Wrap Up**:
+5 **Wrap Up**:
    - Introduce the next topic in the series, which will cover Crawl4AI's browser configuration options (like choosing between `chromium`, `firefox`, and `webkit`).
 
 ---

diff --git a/docs/md_v2/tutorial/episode_02_Overview_of_Advanced_Features.md b/docs/md_v2/tutorial/episode_02_Overview_of_Advanced_Features.md
@@ -11,18 +11,21 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
 
 ### **Overview of Advanced Features**
 
-1. **Introduction to Advanced Features**:
+1 **Introduction to Advanced Features**:
+
    - Briefly introduce Crawl4AI’s advanced tools, which let users go beyond basic crawling to customize and fine-tune their scraping workflows.
 
-2. **Taking Screenshots**:
+2 **Taking Screenshots**:
+
    - Explain the screenshot capability for capturing page state and verifying content.
    - **Example**:
       ```python
       result = await crawler.arun(url="https://www.example.com", screenshot=True)
       ```
    - Mention that screenshots are saved as a base64 string in `result`, allowing easy decoding and saving.
 
-3. **Media and Link Extraction**:
+3 **Media and Link Extraction**:
+
    - Demonstrate how to pull all media (images, videos) and links (internal and external) from a page for deeper analysis or content gathering.
    - **Example**:
       ```python
@@ -31,37 +34,42 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
       print("Links:", result.links)
       ```
 
-4. **Custom User Agent**:
+4 **Custom User Agent**:
+
    - Show how to set a custom user agent to disguise the crawler or simulate specific devices/browsers.
    - **Example**:
       ```python
       result = await crawler.arun(url="https://www.example.com", user_agent="Mozilla/5.0 (compatible; MyCrawler/1.0)")
       ```
 
-5. **Custom Hooks for Enhanced Control**:
+5 **Custom Hooks for Enhanced Control**:
+
    - Briefly cover how to use hooks, which allow custom actions like setting headers or handling login during the crawl.
    - **Example**: Setting a custom header with `before_get_url` hook.
       ```python
       async def before_get_url(page):
           await page.set_extra_http_headers({"X-Test-Header": "test"})
       ```
 
-6. **CSS Selectors for Targeted Extraction**:
+6 **CSS Selectors for Targeted Extraction**:
+
    - Explain the use of CSS selectors to extract specific elements, ideal for structured data like articles or product details.
    - **Example**:
       ```python
       result = await crawler.arun(url="https://www.example.com", css_selector="h2")
       print("H2 Tags:", result.extracted_content)
       ```
 
-7. **Crawling Inside Iframes**:
+7 **Crawling Inside Iframes**:
+
    - Mention how enabling `process_iframes=True` allows extracting content within iframes, useful for sites with embedded content or ads.
    - **Example**:
       ```python
       result = await crawler.arun(url="https://www.example.com", process_iframes=True)
       ```
 
-8. **Wrap-Up**:
+8 **Wrap-Up**:
+
    - Summarize these advanced features and how they allow users to customize every part of their web scraping experience.
    - Tease upcoming videos where each feature will be explored in detail.
 

diff --git a/docs/md_v2/tutorial/episode_14_Hooks_and_Custom_Workflow_with_AsyncWebCrawler.md b/docs/md_v2/tutorial/episode_14_Hooks_and_Custom_Workflow_with_AsyncWebCrawler.md
@@ -42,7 +42,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
      async def log_browser_creation(browser):
          print("Browser instance created:", browser)
 
-     crawler.set_hook('on_browser_created', log_browser_creation)
+     crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
      ```
    - **Explanation**: This hook logs the browser creation event, useful for tracking when a new browser instance starts.
 
@@ -57,7 +57,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
      def update_user_agent(user_agent):
          print(f"User Agent Updated: {user_agent}")
 
-     crawler.set_hook('on_user_agent_updated', update_user_agent)
+     crawler.crawler_strategy.set_hook('on_user_agent_updated', update_user_agent)
      crawler.update_user_agent("Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)")
      ```
    - **Explanation**: This hook provides a callback every time the user agent changes, helpful for debugging or dynamically altering user agent settings based on conditions.
@@ -73,7 +73,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
      async def log_execution_start(page):
          print("Execution started on page:", page.url)
 
-     crawler.set_hook('on_execution_started', log_execution_start)
+     crawler.crawler_strategy.set_hook('on_execution_started', log_execution_start)
      ```
    - **Explanation**: Logs the start of any major interaction on the page, ideal for cases where you want to monitor each interaction.
 
@@ -90,7 +90,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
          await page.set_extra_http_headers({"X-Custom-Header": "CustomValue"})
          print("Custom headers set before navigation")
 
-     crawler.set_hook('before_goto', modify_headers_before_goto)
+     crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
      ```
    - **Explanation**: This hook allows injecting headers or altering settings based on the page’s needs, particularly useful for pages with custom requirements.
 
@@ -106,7 +106,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
          await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
          print("Scrolled to the bottom after navigation")
 
-     crawler.set_hook('after_goto', post_navigation_scroll)
+     crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
      ```
    - **Explanation**: This hook scrolls to the bottom of the page after loading, which can help load dynamically added content like infinite scroll elements.
 
@@ -122,7 +122,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
          await page.evaluate("document.querySelectorAll('.ad-banner').forEach(el => el.remove());")
          print("Advertisements removed before returning HTML")
 
-     crawler.set_hook('before_return_html', remove_advertisements)
+     crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
      ```
    - **Explanation**: The hook removes ad banners from the HTML before it’s retrieved, ensuring a cleaner data extraction.
 
@@ -138,7 +138,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
          await page.wait_for_selector('.main-content')
          print("Main content loaded, ready to retrieve HTML")
 
-     crawler.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
+     crawler.crawler_strategy.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
      ```
    - **Explanation**: This hook waits for the main content to load before retrieving the HTML, ensuring that all essential content is captured.
 
@@ -148,9 +148,9 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
      - Each hook function can be asynchronous (useful for actions like waiting or retrieving async data).
    - **Example Setup**:
      ```python
-     crawler.set_hook('on_browser_created', log_browser_creation)
-     crawler.set_hook('before_goto', modify_headers_before_goto)
-     crawler.set_hook('after_goto', post_navigation_scroll)
+     crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
+     crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
+     crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
      ```
 
 #### **5. Complete Example: Using Hooks for a Customized Crawl Workflow**
@@ -160,10 +160,10 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
      async def custom_crawl():
          async with AsyncWebCrawler() as crawler:
              # Set hooks for custom workflow
-             crawler.set_hook('on_browser_created', log_browser_creation)
-             crawler.set_hook('before_goto', modify_headers_before_goto)
-             crawler.set_hook('after_goto', post_navigation_scroll)
-             crawler.set_hook('before_return_html', remove_advertisements)
+             crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
+             crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
+             crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
+             crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
 
              # Perform the crawl
              url = "https://example.com"

diff --git a/docs/md_v2/tutorial/tutorial.md b/docs/md_v2/tutorial/tutorial.md
@@ -771,9 +771,11 @@ Here’s a concise outline for the **Custom Headers, Identity Management, and Us
      async with AsyncWebCrawler(
          headers={"Accept-Language": "en-US", "Cache-Control": "no-cache"},
          user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0",
-         simulate_user=True
      ) as crawler:
-         result = await crawler.arun(url="https://example.com/secure-page")
+         result = await crawler.arun(
+            url="https://example.com/secure-page",
+            simulate_user=True
+        )
          print(result.markdown[:500])  # Display extracted content
      ```
    - This example enables detailed customization for evading detection and accessing protected pages smoothly.
@@ -1576,7 +1578,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
      async def log_browser_creation(browser):
          print("Browser instance created:", browser)
 
-     crawler.set_hook('on_browser_created', log_browser_creation)
+     crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
      ```
    - **Explanation**: This hook logs the browser creation event, useful for tracking when a new browser instance starts.
 
@@ -1591,7 +1593,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
      def update_user_agent(user_agent):
          print(f"User Agent Updated: {user_agent}")
 
-     crawler.set_hook('on_user_agent_updated', update_user_agent)
+     crawler.crawler_strategy.set_hook('on_user_agent_updated', update_user_agent)
      crawler.update_user_agent("Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)")
      ```
    - **Explanation**: This hook provides a callback every time the user agent changes, helpful for debugging or dynamically altering user agent settings based on conditions.
@@ -1607,7 +1609,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
      async def log_execution_start(page):
          print("Execution started on page:", page.url)
 
-     crawler.set_hook('on_execution_started', log_execution_start)
+     crawler.crawler_strategy.set_hook('on_execution_started', log_execution_start)
      ```
    - **Explanation**: Logs the start of any major interaction on the page, ideal for cases where you want to monitor each interaction.
 
@@ -1624,7 +1626,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
          await page.set_extra_http_headers({"X-Custom-Header": "CustomValue"})
          print("Custom headers set before navigation")
 
-     crawler.set_hook('before_goto', modify_headers_before_goto)
+     crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
      ```
    - **Explanation**: This hook allows injecting headers or altering settings based on the page’s needs, particularly useful for pages with custom requirements.
 
@@ -1640,7 +1642,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
          await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
          print("Scrolled to the bottom after navigation")
 
-     crawler.set_hook('after_goto', post_navigation_scroll)
+     crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
      ```
    - **Explanation**: This hook scrolls to the bottom of the page after loading, which can help load dynamically added content like infinite scroll elements.
 
@@ -1656,7 +1658,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
          await page.evaluate("document.querySelectorAll('.ad-banner').forEach(el => el.remove());")
          print("Advertisements removed before returning HTML")
 
-     crawler.set_hook('before_return_html', remove_advertisements)
+     crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
      ```
    - **Explanation**: The hook removes ad banners from the HTML before it’s retrieved, ensuring a cleaner data extraction.
 
@@ -1672,7 +1674,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
          await page.wait_for_selector('.main-content')
          print("Main content loaded, ready to retrieve HTML")
 
-     crawler.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
+     crawler.crawler_strategy.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
      ```
    - **Explanation**: This hook waits for the main content to load before retrieving the HTML, ensuring that all essential content is captured.
 
@@ -1682,9 +1684,9 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
      - Each hook function can be asynchronous (useful for actions like waiting or retrieving async data).
    - **Example Setup**:
      ```python
-     crawler.set_hook('on_browser_created', log_browser_creation)
-     crawler.set_hook('before_goto', modify_headers_before_goto)
-     crawler.set_hook('after_goto', post_navigation_scroll)
+     crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
+     crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
+     crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
      ```
 
 #### **5. Complete Example: Using Hooks for a Customized Crawl Workflow**
@@ -1694,10 +1696,10 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
      async def custom_crawl():
          async with AsyncWebCrawler() as crawler:
              # Set hooks for custom workflow
-             crawler.set_hook('on_browser_created', log_browser_creation)
-             crawler.set_hook('before_goto', modify_headers_before_goto)
-             crawler.set_hook('after_goto', post_navigation_scroll)
-             crawler.set_hook('before_return_html', remove_advertisements)
+             crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
+             crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
+             crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
+             crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
 
              # Perform the crawl
              url = "https://example.com"

diff --git a/...awl4AI_v0.3.72_Release_Announcement.ipynb → ...awl4AI_v0.3.72_Release_Announcement.ipynb b/...awl4AI_v0.3.72_Release_Announcement.ipynb → ...awl4AI_v0.3.72_Release_Announcement.ipynb