Skip to content

Commit

Permalink
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Browse files Browse the repository at this point in the history
  • Loading branch information
unclecode committed Dec 8, 2024
2 parents be37abe + 740214e commit d720013
Show file tree
Hide file tree
Showing 11 changed files with 848 additions and 479 deletions.
86 changes: 86 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,91 @@
# Changelog

## [0.4.1] December 8, 2024

### **File: `crawl4ai/async_crawler_strategy.py`**

#### **New Parameters and Attributes Added**
- **`text_only` (boolean)**: Enables text-only mode, disables images, JavaScript, and GPU-related features for faster, minimal rendering.
- **`light_mode` (boolean)**: Optimizes the browser by disabling unnecessary background processes and features for efficiency.
- **`viewport_width` and `viewport_height`**: Dynamically adjusts based on `text_only` mode (default values: 800x600 for `text_only`, 1920x1080 otherwise).
- **`extra_args`**: Adds browser-specific flags for `text_only` mode.
- **`adjust_viewport_to_content`**: Dynamically adjusts the viewport to the content size for accurate rendering.

#### **Browser Context Adjustments**
- Added **`viewport` adjustments**: Dynamically computed based on `text_only` or custom configuration.
- Enhanced support for `light_mode` and `text_only` by adding specific browser arguments to reduce resource consumption.

#### **Dynamic Content Handling**
- **Full Page Scan Feature**:
- Scrolls through the entire page while dynamically detecting content changes.
- Ensures scrolling stops when no new dynamic content is loaded.

#### **Session Management**
- Added **`create_session`** method:
- Creates a new browser session and assigns a unique ID.
- Supports persistent and non-persistent contexts with full compatibility for cookies, headers, and proxies.

#### **Improved Content Loading and Adjustment**
- **`adjust_viewport_to_content`**:
- Automatically adjusts viewport to match content dimensions.
- Includes scaling via Chrome DevTools Protocol (CDP).
- Enhanced content loading:
- Waits for images to load and ensures network activity is idle before proceeding.

#### **Error Handling and Logging**
- Improved error handling and detailed logging for:
- Viewport adjustment (`adjust_viewport_to_content`).
- Full page scanning (`scan_full_page`).
- Dynamic content loading.

#### **Refactoring and Cleanup**
- Removed hardcoded viewport dimensions in multiple places, replaced with dynamic values (`self.viewport_width`, `self.viewport_height`).
- Removed commented-out and unused code for better readability.
- Added default value for `delay_before_return_html` parameter.

#### **Optimizations**
- Reduced resource usage in `light_mode` by disabling unnecessary browser features such as extensions, background timers, and sync.
- Improved compatibility for different browser types (`chrome`, `firefox`, `webkit`).

---

### **File: `docs/examples/quickstart_async.py`**

#### **Schema Adjustment**
- Changed schema reference for `LLMExtractionStrategy`:
- **Old**: `OpenAIModelFee.schema()`
- **New**: `OpenAIModelFee.model_json_schema()`
- This likely ensures better compatibility with the `OpenAIModelFee` class and its JSON schema.

#### **Documentation Comments Updated**
- Improved extraction instruction for schema-based LLM strategies.

---

### **New Features Added**
1. **Text-Only Mode**:
- Focuses on minimal resource usage by disabling non-essential browser features.
2. **Light Mode**:
- Optimizes browser for performance by disabling background tasks and unnecessary services.
3. **Full Page Scanning**:
- Ensures the entire content of a page is crawled, including dynamic elements loaded during scrolling.
4. **Dynamic Viewport Adjustment**:
- Automatically resizes the viewport to match content dimensions, improving compatibility and rendering accuracy.
5. **Session Management**:
- Simplifies session handling with better support for persistent and non-persistent contexts.

---

### **Bug Fixes**
- Fixed potential viewport mismatches by ensuring consistent use of `self.viewport_width` and `self.viewport_height` throughout the code.
- Improved robustness of dynamic content loading to avoid timeouts and failed evaluations.







## [0.3.75] December 1, 2024

### PruningContentFilter
Expand Down
23 changes: 12 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,9 @@

Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.

[✨ Check out latest update v0.4.1](#-recent-updates)

🎉 **Version 0.4.0 is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.0.md)

[✨ Check out latest update v0.4.0](#-recent-updates)
🎉 **Version 0.4.x is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://crawl4ai.com/mkdocs/blog)

## 🧐 Why Crawl4AI?

Expand Down Expand Up @@ -80,6 +79,7 @@ if __name__ == "__main__":
- 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
- ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
- 🌍 **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.

</details>

Expand All @@ -95,6 +95,8 @@ if __name__ == "__main__":
- 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.
- 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.
- 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
- 🕵️ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.
- 🔄 **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.

</details>

Expand All @@ -121,8 +123,6 @@ if __name__ == "__main__":

</details>



## Try it Now!

✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
Expand Down Expand Up @@ -626,13 +626,14 @@ async def test_news_crawl():

## ✨ Recent Updates

- 🔬 **PruningContentFilter**: New unsupervised filtering strategy for intelligent content extraction based on text density and relevance scoring.
- 🧵 **Enhanced Thread Safety**: Improved multi-threaded environment handling with better locks and parallel processing support.
- 🤖 **Smart User-Agent Generation**: Advanced user-agent generator with customization options and randomization capabilities.
- 📝 **New Blog Launch**: Stay updated with our detailed release notes and technical deep dives at [crawl4ai.com/blog](https://crawl4ai.com/blog).
- 🧪 **Expanded Test Coverage**: Comprehensive test suite for both PruningContentFilter and BM25ContentFilter with edge case handling.
- 🖼️ **Lazy Load Handling**: Improved support for websites with lazy-loaded images. The crawler now waits for all images to fully load, ensuring no content is missed.
-**Text-Only Mode**: New mode for fast, lightweight crawling. Disables images, JavaScript, and GPU rendering, improving speed by 3-4x for text-focused crawls.
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to fit page content, ensuring accurate rendering and capturing of all elements.
- 🔄 **Full-Page Scanning**: Added scrolling support for pages with infinite scroll or dynamic content loading. Ensures every part of the page is captured.
- 🧑‍💻 **Session Reuse**: Introduced `create_session` for efficient crawling by reusing the same browser session across multiple requests.
- 🌟 **Light Mode**: Optimized browser performance by disabling unnecessary features like extensions, background timers, and sync processes.

Read the full details of this release in our [0.4.0 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.0.md).
Read the full details of this release in our [0.4.1 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.1.md).

## 📖 Documentation & Roadmap

Expand Down
2 changes: 1 addition & 1 deletion crawl4ai/__version__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# crawl4ai/_version.py
__version__ = "0.4.0"
__version__ = "0.4.1"
Loading

0 comments on commit d720013

Please sign in to comment.