diff --git a/README.md b/README.md
index e4f9bad..e42adbd 100644
--- a/README.md
+++ b/README.md
@@ -9,7 +9,7 @@
-Official SDKs for the ScrapeGraph AI API - Intelligent web scraping powered by AI. Extract structured data from any webpage with natural language prompts.
+Official SDKs for the ScrapeGraph AI API - Intelligent web scraping and search powered by AI. Extract structured data from any webpage or perform AI-powered web searches with natural language prompts.
Get your [API key](https://scrapegraphai.com)!
@@ -17,7 +17,7 @@ Get your [API key](https://scrapegraphai.com)!
- [Python SDK Documentation](scrapegraph-py/README.md)
- [JavaScript SDK Documentation](scrapegraph-js/README.md)
-- [API Documentation](https://docs.scrapegraphai.com)
+- [API Documentation](https://docs.scrapegraphai.com)
- [Website](https://scrapegraphai.com)
## 📦 Installation
@@ -34,7 +34,7 @@ npm install scrapegraph-js
## 🎯 Core Features
-- 🤖 **AI-Powered Extraction**: Use natural language to describe what data you want
+- 🤖 **AI-Powered Extraction & Search**: Use natural language to extract data or search the web
- 📊 **Structured Output**: Get clean, structured data with optional schema validation
- 🔄 **Multiple Formats**: Extract data as JSON, Markdown, or custom schemas
- ⚡ **High Performance**: Concurrent processing and automatic retries
@@ -42,23 +42,23 @@ npm install scrapegraph-js
## 🛠️ Available Endpoints
-### 🔍 SmartScraper
-Extract structured data from any webpage using natural language prompts.
+### 🤖 SmartScraper
+Using AI to extract structured data from any webpage or HTML content with natural language prompts.
+
+### 🔍 SearchScraper
+Perform AI-powered web searches with structured results and reference URLs.
### 📝 Markdownify
Convert any webpage into clean, formatted markdown.
-### 💻 LocalScraper
-Extract information from a local HTML file using AI.
-
-
## 🌟 Key Benefits
- 📝 **Natural Language Queries**: No complex selectors or XPath needed
- 🎯 **Precise Extraction**: AI understands context and structure
-- 🔄 **Adaptive Scraping**: Works with dynamic and static content
+- 🔄 **Adaptive Processing**: Works with both web content and direct HTML
- 📊 **Schema Validation**: Ensure data consistency with Pydantic/TypeScript
- ⚡ **Async Support**: Handle multiple requests efficiently
+- 🔍 **Source Attribution**: Get reference URLs for search results
## 💡 Use Cases
@@ -67,13 +67,14 @@ Extract information from a local HTML file using AI.
- 📰 **Content Aggregation**: Convert articles to structured formats
- 🔍 **Data Mining**: Extract specific information from multiple sources
- 📱 **App Integration**: Feed clean data into your applications
+- 🌐 **Web Research**: Perform AI-powered searches with structured results
## 📖 Documentation
For detailed documentation and examples, visit:
- [Python SDK Guide](scrapegraph-py/README.md)
- [JavaScript SDK Guide](scrapegraph-js/README.md)
-- [API Documentation](https://docs.scrapegraphai.com)
+- [API Documentation](https://docs.scrapegraphai.com)
## 💬 Support & Feedback
diff --git a/scrapegraph-py/CHANGELOG.md b/scrapegraph-py/CHANGELOG.md
index 3ed141b..89c12db 100644
--- a/scrapegraph-py/CHANGELOG.md
+++ b/scrapegraph-py/CHANGELOG.md
@@ -1,3 +1,27 @@
+## [1.11.0-beta.1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.10.2...v1.11.0-beta.1) (2025-02-03)
+
+
+### Features
+
+* add optional headers to request ([bb851d7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bb851d785d121b039d5e968327fb930955a3fd92))
+* merged localscraper into smartscraper ([503dbd1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/503dbd19b8cec4d2ff4575786b0eec25db2e80e6))
+* modified icons ([bcb9b0b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bcb9b0b731b057d242fdf80b43d96879ff7a2764))
+* searchscraper ([2e04e5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2e04e5a1bbd207a7ceeea594878bdea542a7a856))
+* updated readmes ([bfdbea0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bfdbea038918d79df2e3e9442e25d5f08bbccbbc))
+
+
+### chore
+
+* refactor examples ([8e00846](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8e008465f7280c53e2faab7a92f02871ffc5b867))
+* **tests:** updated tests ([9149ce8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9149ce85a78b503098f80910c20de69831030378))
+
+
+### CI
+
+* **release:** 1.9.0-beta.6 [skip ci] ([c898e99](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c898e9917cde8fe291312e3b2dc7d06f6afd3932))
+* **release:** 1.9.0-beta.7 [skip ci] ([2c1875b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2c1875b1413eef5a2335688a7e0baf32ec31dcee))
+
+## [1.9.0-beta.7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.9.0-beta.6...v1.9.0-beta.7) (2025-02-03)
## [1.10.2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.10.1...v1.10.2) (2025-01-22)
@@ -18,6 +42,19 @@
### Features
+* add optional headers to request ([bb851d7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bb851d785d121b039d5e968327fb930955a3fd92))
+* merged localscraper into smartscraper ([503dbd1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/503dbd19b8cec4d2ff4575786b0eec25db2e80e6))
+* modified icons ([bcb9b0b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bcb9b0b731b057d242fdf80b43d96879ff7a2764))
+* searchscraper ([2e04e5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2e04e5a1bbd207a7ceeea594878bdea542a7a856))
+* updated readmes ([bfdbea0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bfdbea038918d79df2e3e9442e25d5f08bbccbbc))
+
+
+### chore
+
+* refactor examples ([8e00846](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8e008465f7280c53e2faab7a92f02871ffc5b867))
+* **tests:** updated tests ([9149ce8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9149ce85a78b503098f80910c20de69831030378))
+
+## [1.9.0-beta.6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.9.0-beta.5...v1.9.0-beta.6) (2025-01-08)
* add integration for sql ([2543b5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2543b5a9b84826de5c583d38fe89cf21aad077e6))
diff --git a/scrapegraph-py/README.md b/scrapegraph-py/README.md
index e530452..bb4c3ce 100644
--- a/scrapegraph-py/README.md
+++ b/scrapegraph-py/README.md
@@ -4,7 +4,7 @@
[](https://pypi.org/project/scrapegraph-py/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/psf/black)
-[](https://docs.scrapegraphai.com)
+[](https://docs.scrapegraphai.com)
@@ -20,7 +20,7 @@ pip install scrapegraph-py
## 🚀 Features
-- 🤖 AI-powered web scraping
+- 🤖 AI-powered web scraping and search
- 🔄 Both sync and async clients
- 📊 Structured output with Pydantic schemas
- 🔍 Detailed logging
@@ -40,21 +40,36 @@ client = Client(api_key="your-api-key-here")
## 📚 Available Endpoints
-### 🔍 SmartScraper
+### 🤖 SmartScraper
-Scrapes any webpage using AI to extract specific information.
+Extract structured data from any webpage or HTML content using AI.
```python
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
-# Basic usage
+# Using a URL
response = client.smartscraper(
website_url="https://example.com",
user_prompt="Extract the main heading and description"
)
+# Or using HTML content
+html_content = """
+
+
+ Company Name
+ We are a technology company focused on AI solutions.
+
+
+"""
+
+response = client.smartscraper(
+ website_html=html_content,
+ user_prompt="Extract the company description"
+)
+
print(response)
```
@@ -80,46 +95,56 @@ response = client.smartscraper(
-### 📝 Markdownify
+### 🔍 SearchScraper
-Converts any webpage into clean, formatted markdown.
+Perform AI-powered web searches with structured results and reference URLs.
```python
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
-response = client.markdownify(
- website_url="https://example.com"
+response = client.searchscraper(
+ user_prompt="What is the latest version of Python and its main features?"
)
-print(response)
+print(f"Answer: {response['result']}")
+print(f"Sources: {response['reference_urls']}")
```
-### 💻 LocalScraper
-
-Extracts information from HTML content using AI.
+
+Output Schema (Optional)
```python
+from pydantic import BaseModel, Field
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
-html_content = """
-
-
- Company Name
- We are a technology company focused on AI solutions.
-
-
-
-"""
+class PythonVersionInfo(BaseModel):
+ version: str = Field(description="The latest Python version number")
+ release_date: str = Field(description="When this version was released")
+ major_features: list[str] = Field(description="List of main features")
+
+response = client.searchscraper(
+ user_prompt="What is the latest version of Python and its main features?",
+ output_schema=PythonVersionInfo
+)
+```
+
+
-response = client.localscraper(
- user_prompt="Extract the company description",
- website_html=html_content
+### 📝 Markdownify
+
+Converts any webpage into clean, formatted markdown.
+
+```python
+from scrapegraph_py import Client
+
+client = Client(api_key="your-api-key-here")
+
+response = client.markdownify(
+ website_url="https://example.com"
)
print(response)
@@ -177,7 +202,7 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
## 🔗 Links
- [Website](https://scrapegraphai.com)
-- [Documentation](https://docs.scrapegraphai.com)
+- [Documentation](https://docs.scrapegraphai.com)
- [GitHub](https://github.com/ScrapeGraphAI/scrapegraph-sdk)
---
diff --git a/scrapegraph-py/examples/async_markdownify_example.py b/scrapegraph-py/examples/async/async_markdownify_example.py
similarity index 100%
rename from scrapegraph-py/examples/async_markdownify_example.py
rename to scrapegraph-py/examples/async/async_markdownify_example.py
diff --git a/scrapegraph-py/examples/async/async_searchscraper_example.py b/scrapegraph-py/examples/async/async_searchscraper_example.py
new file mode 100644
index 0000000..1aae8f9
--- /dev/null
+++ b/scrapegraph-py/examples/async/async_searchscraper_example.py
@@ -0,0 +1,46 @@
+"""
+Example of using the async searchscraper functionality to search for information concurrently.
+"""
+
+import asyncio
+
+from scrapegraph_py import AsyncClient
+from scrapegraph_py.logger import sgai_logger
+
+sgai_logger.set_logging(level="INFO")
+
+
+async def main():
+ # Initialize async client
+ sgai_client = AsyncClient(api_key="your-api-key-here")
+
+ # List of search queries
+ queries = [
+ "What is the latest version of Python and what are its main features?",
+ "What are the key differences between Python 2 and Python 3?",
+ "What is Python's GIL and how does it work?",
+ ]
+
+ # Create tasks for concurrent execution
+ tasks = [sgai_client.searchscraper(user_prompt=query) for query in queries]
+
+ # Execute requests concurrently
+ responses = await asyncio.gather(*tasks, return_exceptions=True)
+
+ # Process results
+ for i, response in enumerate(responses):
+ if isinstance(response, Exception):
+ print(f"\nError for query {i+1}: {response}")
+ else:
+ print(f"\nSearch {i+1}:")
+ print(f"Query: {queries[i]}")
+ print(f"Result: {response['result']}")
+ print("Reference URLs:")
+ for url in response["reference_urls"]:
+ print(f"- {url}")
+
+ await sgai_client.close()
+
+
+if __name__ == "__main__":
+ asyncio.run(main())
diff --git a/scrapegraph-py/examples/async/async_searchscraper_schema_example.py b/scrapegraph-py/examples/async/async_searchscraper_schema_example.py
new file mode 100644
index 0000000..753a3e0
--- /dev/null
+++ b/scrapegraph-py/examples/async/async_searchscraper_schema_example.py
@@ -0,0 +1,119 @@
+"""
+Example of using the async searchscraper functionality with output schemas for extraction.
+"""
+
+import asyncio
+from typing import List
+
+from pydantic import BaseModel
+
+from scrapegraph_py import AsyncClient
+from scrapegraph_py.logger import sgai_logger
+
+sgai_logger.set_logging(level="INFO")
+
+
+# Define schemas for extracting structured data
+class PythonVersionInfo(BaseModel):
+ version: str
+ release_date: str
+ major_features: List[str]
+
+
+class PythonComparison(BaseModel):
+ key_differences: List[str]
+ backward_compatible: bool
+ migration_difficulty: str
+
+
+class GILInfo(BaseModel):
+ definition: str
+ purpose: str
+ limitations: List[str]
+ workarounds: List[str]
+
+
+async def main():
+ # Initialize async client
+ sgai_client = AsyncClient(api_key="your-api-key-here")
+
+ # Define search queries with their corresponding schemas
+ searches = [
+ {
+ "prompt": "What is the latest version of Python? Include the release date and main features.",
+ "schema": PythonVersionInfo,
+ },
+ {
+ "prompt": "Compare Python 2 and Python 3, including backward compatibility and migration difficulty.",
+ "schema": PythonComparison,
+ },
+ {
+ "prompt": "Explain Python's GIL, its purpose, limitations, and possible workarounds.",
+ "schema": GILInfo,
+ },
+ ]
+
+ # Create tasks for concurrent execution
+ tasks = [
+ sgai_client.searchscraper(
+ user_prompt=search["prompt"],
+ output_schema=search["schema"],
+ )
+ for search in searches
+ ]
+
+ # Execute requests concurrently
+ responses = await asyncio.gather(*tasks, return_exceptions=True)
+
+ # Process results
+ for i, response in enumerate(responses):
+ if isinstance(response, Exception):
+ print(f"\nError for search {i+1}: {response}")
+ else:
+ print(f"\nSearch {i+1}:")
+ print(f"Query: {searches[i]['prompt']}")
+ # print(f"Raw Result: {response['result']}")
+
+ try:
+ # Try to extract structured data using the schema
+ result = searches[i]["schema"].model_validate(response["result"])
+
+ # Print extracted structured data
+ if isinstance(result, PythonVersionInfo):
+ print("\nExtracted Data:")
+ print(f"Python Version: {result.version}")
+ print(f"Release Date: {result.release_date}")
+ print("Major Features:")
+ for feature in result.major_features:
+ print(f"- {feature}")
+
+ elif isinstance(result, PythonComparison):
+ print("\nExtracted Data:")
+ print("Key Differences:")
+ for diff in result.key_differences:
+ print(f"- {diff}")
+ print(f"Backward Compatible: {result.backward_compatible}")
+ print(f"Migration Difficulty: {result.migration_difficulty}")
+
+ elif isinstance(result, GILInfo):
+ print("\nExtracted Data:")
+ print(f"Definition: {result.definition}")
+ print(f"Purpose: {result.purpose}")
+ print("Limitations:")
+ for limit in result.limitations:
+ print(f"- {limit}")
+ print("Workarounds:")
+ for workaround in result.workarounds:
+ print(f"- {workaround}")
+ except Exception as e:
+ print(f"\nCould not extract structured data: {e}")
+
+ print("\nReference URLs:")
+ for url in response["reference_urls"]:
+ print(f"- {url}")
+
+ await sgai_client.close()
+
+
+if __name__ == "__main__":
+ asyncio.run(main())
diff --git a/scrapegraph-py/examples/async_smartscraper_example.py b/scrapegraph-py/examples/async/async_smartscraper_example.py
similarity index 100%
rename from scrapegraph-py/examples/async_smartscraper_example.py
rename to scrapegraph-py/examples/async/async_smartscraper_example.py
diff --git a/scrapegraph-py/examples/async_smartscraper_schema_example.py b/scrapegraph-py/examples/async/async_smartscraper_schema_example.py
similarity index 100%
rename from scrapegraph-py/examples/async_smartscraper_schema_example.py
rename to scrapegraph-py/examples/async/async_smartscraper_schema_example.py
diff --git a/scrapegraph-py/examples/localscraper_example.py b/scrapegraph-py/examples/localscraper_example.py
deleted file mode 100644
index 65bbed6..0000000
--- a/scrapegraph-py/examples/localscraper_example.py
+++ /dev/null
@@ -1,31 +0,0 @@
-from scrapegraph_py import Client
-from scrapegraph_py.logger import sgai_logger
-
-sgai_logger.set_logging(level="INFO")
-
-# Initialize the client
-sgai_client = Client(api_key="your-api-key-here")
-
-# Example HTML content
-html_content = """
-
-
- Company Name
- We are a technology company focused on AI solutions.
-
-
-
-"""
-
-# LocalScraper request
-response = sgai_client.localscraper(
- user_prompt="Extract the company description and contact information",
- website_html=html_content,
-)
-
-# Print the response
-print(f"Request ID: {response['request_id']}")
-print(f"Result: {response['result']}")
diff --git a/scrapegraph-py/examples/get_credits_example.py b/scrapegraph-py/examples/miscellaneous/get_credits_example.py
similarity index 100%
rename from scrapegraph-py/examples/get_credits_example.py
rename to scrapegraph-py/examples/miscellaneous/get_credits_example.py
diff --git a/scrapegraph-py/examples/miscellaneous/optional_headers_example.py b/scrapegraph-py/examples/miscellaneous/optional_headers_example.py
new file mode 100644
index 0000000..7763f8f
--- /dev/null
+++ b/scrapegraph-py/examples/miscellaneous/optional_headers_example.py
@@ -0,0 +1,28 @@
+from scrapegraph_py import Client
+from scrapegraph_py.logger import sgai_logger
+
+sgai_logger.set_logging(level="INFO")
+
+# Initialize the client with explicit API key
+sgai_client = Client(api_key="your-api-key-here")
+
+# SmartScraper request
+response = sgai_client.smartscraper(
+ website_url="https://example.com",
+ user_prompt="Extract the main heading, description, and summary of the webpage",
+ headers={
+ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
+ "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
+ "Accept-Language": "en-US,en;q=0.9",
+ "Accept-Encoding": "gzip, deflate, br",
+ "Connection": "keep-alive",
+ "Upgrade-Insecure-Requests": "1",
+ },
+)
+
+
+# Print the response
+print(f"Request ID: {response['request_id']}")
+print(f"Result: {response['result']}")
+
+sgai_client.close()
diff --git a/scrapegraph-py/examples/feedback_example.py b/scrapegraph-py/examples/miscellaneous/send_feedback_example.py
similarity index 96%
rename from scrapegraph-py/examples/feedback_example.py
rename to scrapegraph-py/examples/miscellaneous/send_feedback_example.py
index dc20ae2..4c397ed 100644
--- a/scrapegraph-py/examples/feedback_example.py
+++ b/scrapegraph-py/examples/miscellaneous/send_feedback_example.py
@@ -1,28 +1,28 @@
-from scrapegraph_py import Client
-from scrapegraph_py.logger import sgai_logger
-
-sgai_logger.set_logging(level="INFO")
-
-# Initialize the client
-sgai_client = Client(api_key="your-api-key-here")
-
-# Example request_id (replace with an actual request_id from a previous request)
-request_id = "your-request-id-here"
-
-# Check remaining credits
-credits = sgai_client.get_credits()
-print(f"Credits Info: {credits}")
-
-# Submit feedback for a previous request
-feedback_response = sgai_client.submit_feedback(
- request_id=request_id,
- rating=5, # Rating from 1-5
- feedback_text="The extraction was accurate and exactly what I needed!",
-)
-print(f"\nFeedback Response: {feedback_response}")
-
-# Get previous results using get_smartscraper
-previous_result = sgai_client.get_smartscraper(request_id=request_id)
-print(f"\nRetrieved Previous Result: {previous_result}")
-
-sgai_client.close()
+from scrapegraph_py import Client
+from scrapegraph_py.logger import sgai_logger
+
+sgai_logger.set_logging(level="INFO")
+
+# Initialize the client
+sgai_client = Client(api_key="your-api-key-here")
+
+# Example request_id (replace with an actual request_id from a previous request)
+request_id = "your-request-id-here"
+
+# Check remaining credits
+credits = sgai_client.get_credits()
+print(f"Credits Info: {credits}")
+
+# Submit feedback for a previous request
+feedback_response = sgai_client.submit_feedback(
+ request_id=request_id,
+ rating=5, # Rating from 1-5
+ feedback_text="The extraction was accurate and exactly what I needed!",
+)
+print(f"\nFeedback Response: {feedback_response}")
+
+# Get previous results using get_smartscraper
+previous_result = sgai_client.get_smartscraper(request_id=request_id)
+print(f"\nRetrieved Previous Result: {previous_result}")
+
+sgai_client.close()
diff --git a/scrapegraph-py/examples/markdownify_example.py b/scrapegraph-py/examples/sync/markdownify_example.py
similarity index 100%
rename from scrapegraph-py/examples/markdownify_example.py
rename to scrapegraph-py/examples/sync/markdownify_example.py
diff --git a/scrapegraph-py/examples/sync/searchscraper_example.py b/scrapegraph-py/examples/sync/searchscraper_example.py
new file mode 100644
index 0000000..2b1903d
--- /dev/null
+++ b/scrapegraph-py/examples/sync/searchscraper_example.py
@@ -0,0 +1,26 @@
+"""
+Example of using the searchscraper functionality to search for information.
+"""
+
+from scrapegraph_py import Client
+from scrapegraph_py.logger import sgai_logger
+
+sgai_logger.set_logging(level="INFO")
+
+# Initialize the client
+client = Client(api_key="your-api-key-here")
+
+# Send a searchscraper request
+response = client.searchscraper(
+ user_prompt="What is the latest version of Python and what are its main features?"
+)
+
+# Print the results
+print("\nResults:")
+print(f"Answer: {response['result']}")
+print("\nReference URLs:")
+for url in response["reference_urls"]:
+ print(f"- {url}")
+
+# Close the client
+client.close()
diff --git a/scrapegraph-py/examples/sync/searchscraper_schema_example.py b/scrapegraph-py/examples/sync/searchscraper_schema_example.py
new file mode 100644
index 0000000..8c678b2
--- /dev/null
+++ b/scrapegraph-py/examples/sync/searchscraper_schema_example.py
@@ -0,0 +1,41 @@
+"""
+Example of using the searchscraper functionality with a custom output schema.
+"""
+
+from typing import List
+
+from pydantic import BaseModel
+
+from scrapegraph_py import Client
+from scrapegraph_py.logger import sgai_logger
+
+sgai_logger.set_logging(level="INFO")
+
+
+# Define a custom schema for the output
+class PythonVersionInfo(BaseModel):
+ version: str
+ release_date: str
+ major_features: List[str]
+ is_latest: bool
+
+
+# Initialize the client
+client = Client(api_key="your-api-key-here")
+
+# Send a searchscraper request with schema
+response = client.searchscraper(
+ user_prompt="What is the latest version of Python? Include the release date and main features.",
+ output_schema=PythonVersionInfo,
+)
+
+# The result will be structured according to our schema
+print(f"Request ID: {response['request_id']}")
+print(f"Result: {response['result']}")
+
+print("\nReference URLs:")
+for url in response["reference_urls"]:
+ print(f"- {url}")
+
+# Close the client
+client.close()
diff --git a/scrapegraph-py/examples/smartscraper_example.py b/scrapegraph-py/examples/sync/smartscraper_example.py
similarity index 85%
rename from scrapegraph-py/examples/smartscraper_example.py
rename to scrapegraph-py/examples/sync/smartscraper_example.py
index 37e4542..f583804 100644
--- a/scrapegraph-py/examples/smartscraper_example.py
+++ b/scrapegraph-py/examples/sync/smartscraper_example.py
@@ -9,9 +9,11 @@
# SmartScraper request
response = sgai_client.smartscraper(
website_url="https://example.com",
+ # website_html="...", # Optional, if you want to pass in HTML content instead of a URL
user_prompt="Extract the main heading, description, and summary of the webpage",
)
+
# Print the response
print(f"Request ID: {response['request_id']}")
print(f"Result: {response['result']}")
diff --git a/scrapegraph-py/examples/smartscraper_schema_example.py b/scrapegraph-py/examples/sync/smartscraper_schema_example.py
similarity index 89%
rename from scrapegraph-py/examples/smartscraper_schema_example.py
rename to scrapegraph-py/examples/sync/smartscraper_schema_example.py
index 3553a22..5b54bd8 100644
--- a/scrapegraph-py/examples/smartscraper_schema_example.py
+++ b/scrapegraph-py/examples/sync/smartscraper_schema_example.py
@@ -16,6 +16,7 @@ class WebpageSchema(BaseModel):
# SmartScraper request with output schema
response = sgai_client.smartscraper(
website_url="https://example.com",
+ # website_html="...", # Optional, if you want to pass in HTML content instead of a URL
user_prompt="Extract webpage information",
output_schema=WebpageSchema,
)
diff --git a/scrapegraph-py/pyproject.toml b/scrapegraph-py/pyproject.toml
index 90b2f78..77bedda 100644
--- a/scrapegraph-py/pyproject.toml
+++ b/scrapegraph-py/pyproject.toml
@@ -83,7 +83,7 @@ line-length = 88
[tool.ruff.lint]
select = ["F", "E", "W", "C"]
-ignore = ["E203", "E501"] # Ignore conflicts with Black
+ignore = ["E203", "E501", "C901"] # Ignore conflicts with Black and function complexity
[tool.mypy]
python_version = "3.10"
@@ -97,4 +97,4 @@ build-backend = "hatchling.build"
[tool.poe.tasks]
pylint-local = "pylint scrapegraph_py/**/*.py"
-pylint-ci = "pylint --disable=C0114,C0115,C0116 --exit-zero scrapegraph_py/**/*.py"
+pylint-ci = "pylint --disable=C0114,C0115,C0116,C901 --exit-zero scrapegraph_py/**/*.py"
diff --git a/scrapegraph-py/scrapegraph_py/async_client.py b/scrapegraph-py/scrapegraph_py/async_client.py
index ce74b96..99b6212 100644
--- a/scrapegraph-py/scrapegraph_py/async_client.py
+++ b/scrapegraph-py/scrapegraph_py/async_client.py
@@ -9,11 +9,11 @@
from scrapegraph_py.exceptions import APIError
from scrapegraph_py.logger import sgai_logger as logger
from scrapegraph_py.models.feedback import FeedbackRequest
-from scrapegraph_py.models.localscraper import (
- GetLocalScraperRequest,
- LocalScraperRequest,
-)
from scrapegraph_py.models.markdownify import GetMarkdownifyRequest, MarkdownifyRequest
+from scrapegraph_py.models.searchscraper import (
+ GetSearchScraperRequest,
+ SearchScraperRequest,
+)
from scrapegraph_py.models.smartscraper import (
GetSmartScraperRequest,
SmartScraperRequest,
@@ -136,11 +136,15 @@ async def _make_request(self, method: str, url: str, **kwargs) -> Any:
logger.info(f"⏳ Waiting {retry_delay}s before retry {attempt + 2}")
await asyncio.sleep(retry_delay)
- async def markdownify(self, website_url: str):
+ async def markdownify(
+ self, website_url: str, headers: Optional[dict[str, str]] = None
+ ):
"""Send a markdownify request"""
logger.info(f"🔍 Starting markdownify request for {website_url}")
+ if headers:
+ logger.debug("🔧 Using custom headers")
- request = MarkdownifyRequest(website_url=website_url)
+ request = MarkdownifyRequest(website_url=website_url, headers=headers)
logger.debug("✅ Request validation passed")
result = await self._make_request(
@@ -165,16 +169,26 @@ async def get_markdownify(self, request_id: str):
async def smartscraper(
self,
- website_url: str,
user_prompt: str,
+ website_url: Optional[str] = None,
+ website_html: Optional[str] = None,
+ headers: Optional[dict[str, str]] = None,
output_schema: Optional[BaseModel] = None,
):
"""Send a smartscraper request"""
- logger.info(f"🔍 Starting smartscraper request for {website_url}")
+ logger.info("🔍 Starting smartscraper request")
+ if website_url:
+ logger.debug(f"🌐 URL: {website_url}")
+ if website_html:
+ logger.debug("📄 Using provided HTML content")
+ if headers:
+ logger.debug("🔧 Using custom headers")
logger.debug(f"📝 Prompt: {user_prompt}")
request = SmartScraperRequest(
website_url=website_url,
+ website_html=website_html,
+ headers=headers,
user_prompt=user_prompt,
output_schema=output_schema,
)
@@ -200,43 +214,6 @@ async def get_smartscraper(self, request_id: str):
logger.info(f"✨ Successfully retrieved result for request {request_id}")
return result
- async def localscraper(
- self,
- user_prompt: str,
- website_html: str,
- output_schema: Optional[BaseModel] = None,
- ):
- """Send a localscraper request"""
- logger.info("🔍 Starting localscraper request")
- logger.debug(f"📝 Prompt: {user_prompt}")
-
- request = LocalScraperRequest(
- user_prompt=user_prompt,
- website_html=website_html,
- output_schema=output_schema,
- )
- logger.debug("✅ Request validation passed")
-
- result = await self._make_request(
- "POST", f"{API_BASE_URL}/localscraper", json=request.model_dump()
- )
- logger.info("✨ Localscraper request completed successfully")
- return result
-
- async def get_localscraper(self, request_id: str):
- """Get the result of a previous localscraper request"""
- logger.info(f"🔍 Fetching localscraper result for request {request_id}")
-
- # Validate input using Pydantic model
- GetLocalScraperRequest(request_id=request_id)
- logger.debug("✅ Request ID validation passed")
-
- result = await self._make_request(
- "GET", f"{API_BASE_URL}/localscraper/{request_id}"
- )
- logger.info(f"✨ Successfully retrieved result for request {request_id}")
- return result
-
async def submit_feedback(
self, request_id: str, rating: int, feedback_text: Optional[str] = None
):
@@ -268,6 +245,45 @@ async def get_credits(self):
)
return result
+ async def searchscraper(
+ self,
+ user_prompt: str,
+ headers: Optional[dict[str, str]] = None,
+ output_schema: Optional[BaseModel] = None,
+ ):
+ """Send a searchscraper request"""
+ logger.info("🔍 Starting searchscraper request")
+ logger.debug(f"📝 Prompt: {user_prompt}")
+ if headers:
+ logger.debug("🔧 Using custom headers")
+
+ request = SearchScraperRequest(
+ user_prompt=user_prompt,
+ headers=headers,
+ output_schema=output_schema,
+ )
+ logger.debug("✅ Request validation passed")
+
+ result = await self._make_request(
+ "POST", f"{API_BASE_URL}/searchscraper", json=request.model_dump()
+ )
+ logger.info("✨ Searchscraper request completed successfully")
+ return result
+
+ async def get_searchscraper(self, request_id: str):
+ """Get the result of a previous searchscraper request"""
+ logger.info(f"🔍 Fetching searchscraper result for request {request_id}")
+
+ # Validate input using Pydantic model
+ GetSearchScraperRequest(request_id=request_id)
+ logger.debug("✅ Request ID validation passed")
+
+ result = await self._make_request(
+ "GET", f"{API_BASE_URL}/searchscraper/{request_id}"
+ )
+ logger.info(f"✨ Successfully retrieved result for request {request_id}")
+ return result
+
async def close(self):
"""Close the session to free up resources"""
logger.info("🔒 Closing AsyncClient session")
diff --git a/scrapegraph-py/scrapegraph_py/client.py b/scrapegraph-py/scrapegraph_py/client.py
index 5e61a21..1168557 100644
--- a/scrapegraph-py/scrapegraph_py/client.py
+++ b/scrapegraph-py/scrapegraph_py/client.py
@@ -10,11 +10,11 @@
from scrapegraph_py.exceptions import APIError
from scrapegraph_py.logger import sgai_logger as logger
from scrapegraph_py.models.feedback import FeedbackRequest
-from scrapegraph_py.models.localscraper import (
- GetLocalScraperRequest,
- LocalScraperRequest,
-)
from scrapegraph_py.models.markdownify import GetMarkdownifyRequest, MarkdownifyRequest
+from scrapegraph_py.models.searchscraper import (
+ GetSearchScraperRequest,
+ SearchScraperRequest,
+)
from scrapegraph_py.models.smartscraper import (
GetSmartScraperRequest,
SmartScraperRequest,
@@ -148,11 +148,13 @@ def _make_request(self, method: str, url: str, **kwargs) -> Any:
logger.error(f"🔴 Connection Error: {str(e)}")
raise ConnectionError(f"Failed to connect to API: {str(e)}")
- def markdownify(self, website_url: str):
+ def markdownify(self, website_url: str, headers: Optional[dict[str, str]] = None):
"""Send a markdownify request"""
logger.info(f"🔍 Starting markdownify request for {website_url}")
+ if headers:
+ logger.debug("🔧 Using custom headers")
- request = MarkdownifyRequest(website_url=website_url)
+ request = MarkdownifyRequest(website_url=website_url, headers=headers)
logger.debug("✅ Request validation passed")
result = self._make_request(
@@ -175,16 +177,26 @@ def get_markdownify(self, request_id: str):
def smartscraper(
self,
- website_url: str,
user_prompt: str,
+ website_url: Optional[str] = None,
+ website_html: Optional[str] = None,
+ headers: Optional[dict[str, str]] = None,
output_schema: Optional[BaseModel] = None,
):
"""Send a smartscraper request"""
- logger.info(f"🔍 Starting smartscraper request for {website_url}")
+ logger.info("🔍 Starting smartscraper request")
+ if website_url:
+ logger.debug(f"🌐 URL: {website_url}")
+ if website_html:
+ logger.debug("📄 Using provided HTML content")
+ if headers:
+ logger.debug("🔧 Using custom headers")
logger.debug(f"📝 Prompt: {user_prompt}")
request = SmartScraperRequest(
website_url=website_url,
+ website_html=website_html,
+ headers=headers,
user_prompt=user_prompt,
output_schema=output_schema,
)
@@ -208,41 +220,6 @@ def get_smartscraper(self, request_id: str):
logger.info(f"✨ Successfully retrieved result for request {request_id}")
return result
- def localscraper(
- self,
- user_prompt: str,
- website_html: str,
- output_schema: Optional[BaseModel] = None,
- ):
- """Send a localscraper request"""
- logger.info("🔍 Starting localscraper request")
- logger.debug(f"📝 Prompt: {user_prompt}")
-
- request = LocalScraperRequest(
- user_prompt=user_prompt,
- website_html=website_html,
- output_schema=output_schema,
- )
- logger.debug("✅ Request validation passed")
-
- result = self._make_request(
- "POST", f"{API_BASE_URL}/localscraper", json=request.model_dump()
- )
- logger.info("✨ Localscraper request completed successfully")
- return result
-
- def get_localscraper(self, request_id: str):
- """Get the result of a previous localscraper request"""
- logger.info(f"🔍 Fetching localscraper result for request {request_id}")
-
- # Validate input using Pydantic model
- GetLocalScraperRequest(request_id=request_id)
- logger.debug("✅ Request ID validation passed")
-
- result = self._make_request("GET", f"{API_BASE_URL}/localscraper/{request_id}")
- logger.info(f"✨ Successfully retrieved result for request {request_id}")
- return result
-
def submit_feedback(
self, request_id: str, rating: int, feedback_text: Optional[str] = None
):
@@ -274,6 +251,43 @@ def get_credits(self):
)
return result
+ def searchscraper(
+ self,
+ user_prompt: str,
+ headers: Optional[dict[str, str]] = None,
+ output_schema: Optional[BaseModel] = None,
+ ):
+ """Send a searchscraper request"""
+ logger.info("🔍 Starting searchscraper request")
+ logger.debug(f"📝 Prompt: {user_prompt}")
+ if headers:
+ logger.debug("🔧 Using custom headers")
+
+ request = SearchScraperRequest(
+ user_prompt=user_prompt,
+ headers=headers,
+ output_schema=output_schema,
+ )
+ logger.debug("✅ Request validation passed")
+
+ result = self._make_request(
+ "POST", f"{API_BASE_URL}/searchscraper", json=request.model_dump()
+ )
+ logger.info("✨ Searchscraper request completed successfully")
+ return result
+
+ def get_searchscraper(self, request_id: str):
+ """Get the result of a previous searchscraper request"""
+ logger.info(f"🔍 Fetching searchscraper result for request {request_id}")
+
+ # Validate input using Pydantic model
+ GetSearchScraperRequest(request_id=request_id)
+ logger.debug("✅ Request ID validation passed")
+
+ result = self._make_request("GET", f"{API_BASE_URL}/searchscraper/{request_id}")
+ logger.info(f"✨ Successfully retrieved result for request {request_id}")
+ return result
+
def close(self):
"""Close the session to free up resources"""
logger.info("🔒 Closing Client session")
diff --git a/scrapegraph-py/scrapegraph_py/models/localscraper.py b/scrapegraph-py/scrapegraph_py/models/localscraper.py
deleted file mode 100644
index cdcfe37..0000000
--- a/scrapegraph-py/scrapegraph_py/models/localscraper.py
+++ /dev/null
@@ -1,67 +0,0 @@
-# Models for localscraper endpoint
-
-from typing import Optional, Type
-from uuid import UUID
-
-from bs4 import BeautifulSoup
-from pydantic import BaseModel, Field, model_validator
-
-
-class LocalScraperRequest(BaseModel):
- user_prompt: str = Field(
- ...,
- example="Extract info about the company",
- )
- website_html: str = Field(
- ...,
- example="Title
Content
",
- description="HTML content, maximum size 2MB",
- )
- output_schema: Optional[Type[BaseModel]] = None
-
- @model_validator(mode="after")
- def validate_user_prompt(self) -> "LocalScraperRequest":
- if self.user_prompt is None or not self.user_prompt.strip():
- raise ValueError("User prompt cannot be empty")
- if not any(c.isalnum() for c in self.user_prompt):
- raise ValueError("User prompt must contain a valid prompt")
- return self
-
- @model_validator(mode="after")
- def validate_website_html(self) -> "LocalScraperRequest":
- if self.website_html is None or not self.website_html.strip():
- raise ValueError("Website HTML cannot be empty")
-
- if len(self.website_html.encode("utf-8")) > 2 * 1024 * 1024:
- raise ValueError("Website HTML content exceeds maximum size of 2MB")
-
- try:
- soup = BeautifulSoup(self.website_html, "html.parser")
- if not soup.find():
- raise ValueError("Invalid HTML - no parseable content found")
- except Exception as e:
- raise ValueError(f"Invalid HTML structure: {str(e)}")
-
- return self
-
- def model_dump(self, *args, **kwargs) -> dict:
- data = super().model_dump(*args, **kwargs)
- # Convert the Pydantic model schema to dict if present
- if self.output_schema is not None:
- data["output_schema"] = self.output_schema.model_json_schema()
- return data
-
-
-class GetLocalScraperRequest(BaseModel):
- """Request model for get_localscraper endpoint"""
-
- request_id: str = Field(..., example="123e4567-e89b-12d3-a456-426614174000")
-
- @model_validator(mode="after")
- def validate_request_id(self) -> "GetLocalScraperRequest":
- try:
- # Validate the request_id is a valid UUID
- UUID(self.request_id)
- except ValueError:
- raise ValueError("request_id must be a valid UUID")
- return self
diff --git a/scrapegraph-py/scrapegraph_py/models/markdownify.py b/scrapegraph-py/scrapegraph_py/models/markdownify.py
index 5b12aa2..678ee6c 100644
--- a/scrapegraph-py/scrapegraph_py/models/markdownify.py
+++ b/scrapegraph-py/scrapegraph_py/models/markdownify.py
@@ -1,5 +1,6 @@
# Models for markdownify endpoint
+from typing import Optional
from uuid import UUID
from pydantic import BaseModel, Field, model_validator
@@ -7,6 +8,14 @@
class MarkdownifyRequest(BaseModel):
website_url: str = Field(..., example="https://scrapegraphai.com/")
+ headers: Optional[dict[str, str]] = Field(
+ None,
+ example={
+ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
+ "Cookie": "cookie1=value1; cookie2=value2",
+ },
+ description="Optional headers to send with the request, including cookies and user agent",
+ )
@model_validator(mode="after")
def validate_url(self) -> "MarkdownifyRequest":
diff --git a/scrapegraph-py/scrapegraph_py/models/searchscraper.py b/scrapegraph-py/scrapegraph_py/models/searchscraper.py
new file mode 100644
index 0000000..997d407
--- /dev/null
+++ b/scrapegraph-py/scrapegraph_py/models/searchscraper.py
@@ -0,0 +1,49 @@
+# Models for searchscraper endpoint
+
+from typing import Optional, Type
+from uuid import UUID
+
+from pydantic import BaseModel, Field, model_validator
+
+
+class SearchScraperRequest(BaseModel):
+ user_prompt: str = Field(..., example="What is the latest version of Python?")
+ headers: Optional[dict[str, str]] = Field(
+ None,
+ example={
+ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
+ "Cookie": "cookie1=value1; cookie2=value2",
+ },
+ description="Optional headers to send with the request, including cookies and user agent",
+ )
+ output_schema: Optional[Type[BaseModel]] = None
+
+ @model_validator(mode="after")
+ def validate_user_prompt(self) -> "SearchScraperRequest":
+ if self.user_prompt is None or not self.user_prompt.strip():
+ raise ValueError("User prompt cannot be empty")
+ if not any(c.isalnum() for c in self.user_prompt):
+ raise ValueError("User prompt must contain a valid prompt")
+ return self
+
+ def model_dump(self, *args, **kwargs) -> dict:
+ data = super().model_dump(*args, **kwargs)
+ # Convert the Pydantic model schema to dict if present
+ if self.output_schema is not None:
+ data["output_schema"] = self.output_schema.model_json_schema()
+ return data
+
+
+class GetSearchScraperRequest(BaseModel):
+ """Request model for get_searchscraper endpoint"""
+
+ request_id: str = Field(..., example="123e4567-e89b-12d3-a456-426614174000")
+
+ @model_validator(mode="after")
+ def validate_request_id(self) -> "GetSearchScraperRequest":
+ try:
+ # Validate the request_id is a valid UUID
+ UUID(self.request_id)
+ except ValueError:
+ raise ValueError("request_id must be a valid UUID")
+ return self
diff --git a/scrapegraph-py/scrapegraph_py/models/smartscraper.py b/scrapegraph-py/scrapegraph_py/models/smartscraper.py
index c1fdd9a..21b346e 100644
--- a/scrapegraph-py/scrapegraph_py/models/smartscraper.py
+++ b/scrapegraph-py/scrapegraph_py/models/smartscraper.py
@@ -3,6 +3,7 @@
from typing import Optional, Type
from uuid import UUID
+from bs4 import BeautifulSoup
from pydantic import BaseModel, Field, model_validator
@@ -11,7 +12,22 @@ class SmartScraperRequest(BaseModel):
...,
example="Extract info about the company",
)
- website_url: str = Field(..., example="https://scrapegraphai.com/")
+ website_url: Optional[str] = Field(
+ default=None, example="https://scrapegraphai.com/"
+ )
+ website_html: Optional[str] = Field(
+ default=None,
+ example="Title
Content
",
+ description="HTML content, maximum size 2MB",
+ )
+ headers: Optional[dict[str, str]] = Field(
+ None,
+ example={
+ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
+ "Cookie": "cookie1=value1; cookie2=value2",
+ },
+ description="Optional headers to send with the request, including cookies and user agent",
+ )
output_schema: Optional[Type[BaseModel]] = None
@model_validator(mode="after")
@@ -23,14 +39,26 @@ def validate_user_prompt(self) -> "SmartScraperRequest":
return self
@model_validator(mode="after")
- def validate_url(self) -> "SmartScraperRequest":
- if self.website_url is None or not self.website_url.strip():
- raise ValueError("Website URL cannot be empty")
- if not (
- self.website_url.startswith("http://")
- or self.website_url.startswith("https://")
- ):
- raise ValueError("Invalid URL")
+ def validate_url_and_html(self) -> "SmartScraperRequest":
+ if self.website_html is not None:
+ if len(self.website_html.encode("utf-8")) > 2 * 1024 * 1024:
+ raise ValueError("Website HTML content exceeds maximum size of 2MB")
+ try:
+ soup = BeautifulSoup(self.website_html, "html.parser")
+ if not soup.find():
+ raise ValueError("Invalid HTML - no parseable content found")
+ except Exception as e:
+ raise ValueError(f"Invalid HTML structure: {str(e)}")
+ elif self.website_url is not None:
+ if not self.website_url.strip():
+ raise ValueError("Website URL cannot be empty")
+ if not (
+ self.website_url.startswith("http://")
+ or self.website_url.startswith("https://")
+ ):
+ raise ValueError("Invalid URL")
+ else:
+ raise ValueError("Either website_url or website_html must be provided")
return self
def model_dump(self, *args, **kwargs) -> dict:
diff --git a/scrapegraph-py/tests/test_async_client.py b/scrapegraph-py/tests/test_async_client.py
index 78f9717..69c067e 100644
--- a/scrapegraph-py/tests/test_async_client.py
+++ b/scrapegraph-py/tests/test_async_client.py
@@ -19,7 +19,7 @@ def mock_uuid():
@pytest.mark.asyncio
-async def test_smartscraper(mock_api_key):
+async def test_smartscraper_with_url(mock_api_key):
with aioresponses() as mocked:
mocked.post(
"https://api.scrapegraphai.com/v1/smartscraper",
@@ -38,6 +38,54 @@ async def test_smartscraper(mock_api_key):
assert "description" in response["result"]
+@pytest.mark.asyncio
+async def test_smartscraper_with_html(mock_api_key):
+ with aioresponses() as mocked:
+ mocked.post(
+ "https://api.scrapegraphai.com/v1/smartscraper",
+ payload={
+ "request_id": str(uuid4()),
+ "status": "completed",
+ "result": {"description": "Test content."},
+ },
+ )
+
+ async with AsyncClient(api_key=mock_api_key) as client:
+ response = await client.smartscraper(
+ website_html="Test content
",
+ user_prompt="Extract info",
+ )
+ assert response["status"] == "completed"
+ assert "description" in response["result"]
+
+
+@pytest.mark.asyncio
+async def test_smartscraper_with_headers(mock_api_key):
+ with aioresponses() as mocked:
+ mocked.post(
+ "https://api.scrapegraphai.com/v1/smartscraper",
+ payload={
+ "request_id": str(uuid4()),
+ "status": "completed",
+ "result": {"description": "Example domain."},
+ },
+ )
+
+ headers = {
+ "User-Agent": "Mozilla/5.0",
+ "Cookie": "session=123",
+ }
+
+ async with AsyncClient(api_key=mock_api_key) as client:
+ response = await client.smartscraper(
+ website_url="https://example.com",
+ user_prompt="Describe this page.",
+ headers=headers,
+ )
+ assert response["status"] == "completed"
+ assert "description" in response["result"]
+
+
@pytest.mark.asyncio
async def test_get_credits(mock_api_key):
with aioresponses() as mocked:
@@ -121,6 +169,31 @@ async def test_markdownify(mock_api_key):
assert "# Example Page" in response["result"]
+@pytest.mark.asyncio
+async def test_markdownify_with_headers(mock_api_key):
+ with aioresponses() as mocked:
+ mocked.post(
+ "https://api.scrapegraphai.com/v1/markdownify",
+ payload={
+ "request_id": str(uuid4()),
+ "status": "completed",
+ "result": "# Example Page\n\nThis is markdown content.",
+ },
+ )
+
+ headers = {
+ "User-Agent": "Mozilla/5.0",
+ "Cookie": "session=123",
+ }
+
+ async with AsyncClient(api_key=mock_api_key) as client:
+ response = await client.markdownify(
+ website_url="https://example.com", headers=headers
+ )
+ assert response["status"] == "completed"
+ assert "# Example Page" in response["result"]
+
+
@pytest.mark.asyncio
async def test_get_markdownify(mock_api_key, mock_uuid):
with aioresponses() as mocked:
@@ -140,39 +213,74 @@ async def test_get_markdownify(mock_api_key, mock_uuid):
@pytest.mark.asyncio
-async def test_localscraper(mock_api_key):
+async def test_searchscraper(mock_api_key):
with aioresponses() as mocked:
mocked.post(
- "https://api.scrapegraphai.com/v1/localscraper",
+ "https://api.scrapegraphai.com/v1/searchscraper",
payload={
"request_id": str(uuid4()),
"status": "completed",
- "result": {"extracted_info": "Test content"},
+ "result": {"answer": "Python 3.12 is the latest version."},
+ "reference_urls": ["https://www.python.org/downloads/"],
},
)
async with AsyncClient(api_key=mock_api_key) as client:
- response = await client.localscraper(
- user_prompt="Extract info",
- website_html="Test content
",
+ response = await client.searchscraper(
+ user_prompt="What is the latest version of Python?"
+ )
+ assert response["status"] == "completed"
+ assert "answer" in response["result"]
+ assert "reference_urls" in response
+ assert isinstance(response["reference_urls"], list)
+
+
+@pytest.mark.asyncio
+async def test_searchscraper_with_headers(mock_api_key):
+ with aioresponses() as mocked:
+ mocked.post(
+ "https://api.scrapegraphai.com/v1/searchscraper",
+ payload={
+ "request_id": str(uuid4()),
+ "status": "completed",
+ "result": {"answer": "Python 3.12 is the latest version."},
+ "reference_urls": ["https://www.python.org/downloads/"],
+ },
+ )
+
+ headers = {
+ "User-Agent": "Mozilla/5.0",
+ "Cookie": "session=123",
+ }
+
+ async with AsyncClient(api_key=mock_api_key) as client:
+ response = await client.searchscraper(
+ user_prompt="What is the latest version of Python?",
+ headers=headers,
)
assert response["status"] == "completed"
- assert "extracted_info" in response["result"]
+ assert "answer" in response["result"]
+ assert "reference_urls" in response
+ assert isinstance(response["reference_urls"], list)
@pytest.mark.asyncio
-async def test_get_localscraper(mock_api_key, mock_uuid):
+async def test_get_searchscraper(mock_api_key, mock_uuid):
with aioresponses() as mocked:
mocked.get(
- f"https://api.scrapegraphai.com/v1/localscraper/{mock_uuid}",
+ f"https://api.scrapegraphai.com/v1/searchscraper/{mock_uuid}",
payload={
"request_id": mock_uuid,
"status": "completed",
- "result": {"extracted_info": "Test content"},
+ "result": {"answer": "Python 3.12 is the latest version."},
+ "reference_urls": ["https://www.python.org/downloads/"],
},
)
async with AsyncClient(api_key=mock_api_key) as client:
- response = await client.get_localscraper(mock_uuid)
+ response = await client.get_searchscraper(mock_uuid)
assert response["status"] == "completed"
assert response["request_id"] == mock_uuid
+ assert "answer" in response["result"]
+ assert "reference_urls" in response
+ assert isinstance(response["reference_urls"], list)
diff --git a/scrapegraph-py/tests/test_client.py b/scrapegraph-py/tests/test_client.py
index 23c1162..11ef12f 100644
--- a/scrapegraph-py/tests/test_client.py
+++ b/scrapegraph-py/tests/test_client.py
@@ -18,7 +18,7 @@ def mock_uuid():
@responses.activate
-def test_smartscraper(mock_api_key):
+def test_smartscraper_with_url(mock_api_key):
# Mock the API response
responses.add(
responses.POST,
@@ -37,6 +37,54 @@ def test_smartscraper(mock_api_key):
assert response["status"] == "completed"
+@responses.activate
+def test_smartscraper_with_html(mock_api_key):
+ # Mock the API response
+ responses.add(
+ responses.POST,
+ "https://api.scrapegraphai.com/v1/smartscraper",
+ json={
+ "request_id": str(uuid4()),
+ "status": "completed",
+ "result": {"description": "Test content."},
+ },
+ )
+
+ with Client(api_key=mock_api_key) as client:
+ response = client.smartscraper(
+ website_html="Test content
",
+ user_prompt="Extract info",
+ )
+ assert response["status"] == "completed"
+
+
+@responses.activate
+def test_smartscraper_with_headers(mock_api_key):
+ # Mock the API response
+ responses.add(
+ responses.POST,
+ "https://api.scrapegraphai.com/v1/smartscraper",
+ json={
+ "request_id": str(uuid4()),
+ "status": "completed",
+ "result": {"description": "Example domain."},
+ },
+ )
+
+ headers = {
+ "User-Agent": "Mozilla/5.0",
+ "Cookie": "session=123",
+ }
+
+ with Client(api_key=mock_api_key) as client:
+ response = client.smartscraper(
+ website_url="https://example.com",
+ user_prompt="Describe this page.",
+ headers=headers,
+ )
+ assert response["status"] == "completed"
+
+
@responses.activate
def test_get_smartscraper(mock_api_key, mock_uuid):
responses.add(
@@ -117,6 +165,31 @@ def test_markdownify(mock_api_key):
assert "# Example Page" in response["result"]
+@responses.activate
+def test_markdownify_with_headers(mock_api_key):
+ responses.add(
+ responses.POST,
+ "https://api.scrapegraphai.com/v1/markdownify",
+ json={
+ "request_id": str(uuid4()),
+ "status": "completed",
+ "result": "# Example Page\n\nThis is markdown content.",
+ },
+ )
+
+ headers = {
+ "User-Agent": "Mozilla/5.0",
+ "Cookie": "session=123",
+ }
+
+ with Client(api_key=mock_api_key) as client:
+ response = client.markdownify(
+ website_url="https://example.com", headers=headers
+ )
+ assert response["status"] == "completed"
+ assert "# Example Page" in response["result"]
+
+
@responses.activate
def test_get_markdownify(mock_api_key, mock_uuid):
responses.add(
@@ -136,39 +209,76 @@ def test_get_markdownify(mock_api_key, mock_uuid):
@responses.activate
-def test_localscraper(mock_api_key):
+def test_searchscraper(mock_api_key):
+ # Mock the API response
responses.add(
responses.POST,
- "https://api.scrapegraphai.com/v1/localscraper",
+ "https://api.scrapegraphai.com/v1/searchscraper",
json={
"request_id": str(uuid4()),
"status": "completed",
- "result": {"extracted_info": "Test content"},
+ "result": {"answer": "Python 3.12 is the latest version."},
+ "reference_urls": ["https://www.python.org/downloads/"],
},
)
with Client(api_key=mock_api_key) as client:
- response = client.localscraper(
- user_prompt="Extract info",
- website_html="Test content
",
+ response = client.searchscraper(
+ user_prompt="What is the latest version of Python?"
+ )
+ assert response["status"] == "completed"
+ assert "answer" in response["result"]
+ assert "reference_urls" in response
+ assert isinstance(response["reference_urls"], list)
+
+
+@responses.activate
+def test_searchscraper_with_headers(mock_api_key):
+ # Mock the API response
+ responses.add(
+ responses.POST,
+ "https://api.scrapegraphai.com/v1/searchscraper",
+ json={
+ "request_id": str(uuid4()),
+ "status": "completed",
+ "result": {"answer": "Python 3.12 is the latest version."},
+ "reference_urls": ["https://www.python.org/downloads/"],
+ },
+ )
+
+ headers = {
+ "User-Agent": "Mozilla/5.0",
+ "Cookie": "session=123",
+ }
+
+ with Client(api_key=mock_api_key) as client:
+ response = client.searchscraper(
+ user_prompt="What is the latest version of Python?",
+ headers=headers,
)
assert response["status"] == "completed"
- assert "extracted_info" in response["result"]
+ assert "answer" in response["result"]
+ assert "reference_urls" in response
+ assert isinstance(response["reference_urls"], list)
@responses.activate
-def test_get_localscraper(mock_api_key, mock_uuid):
+def test_get_searchscraper(mock_api_key, mock_uuid):
responses.add(
responses.GET,
- f"https://api.scrapegraphai.com/v1/localscraper/{mock_uuid}",
+ f"https://api.scrapegraphai.com/v1/searchscraper/{mock_uuid}",
json={
"request_id": mock_uuid,
"status": "completed",
- "result": {"extracted_info": "Test content"},
+ "result": {"answer": "Python 3.12 is the latest version."},
+ "reference_urls": ["https://www.python.org/downloads/"],
},
)
with Client(api_key=mock_api_key) as client:
- response = client.get_localscraper(mock_uuid)
+ response = client.get_searchscraper(mock_uuid)
assert response["status"] == "completed"
assert response["request_id"] == mock_uuid
+ assert "answer" in response["result"]
+ assert "reference_urls" in response
+ assert isinstance(response["reference_urls"], list)
diff --git a/scrapegraph-py/tests/test_models.py b/scrapegraph-py/tests/test_models.py
index 70841aa..50c788f 100644
--- a/scrapegraph-py/tests/test_models.py
+++ b/scrapegraph-py/tests/test_models.py
@@ -2,11 +2,11 @@
from pydantic import BaseModel, ValidationError
from scrapegraph_py.models.feedback import FeedbackRequest
-from scrapegraph_py.models.localscraper import (
- GetLocalScraperRequest,
- LocalScraperRequest,
-)
from scrapegraph_py.models.markdownify import GetMarkdownifyRequest, MarkdownifyRequest
+from scrapegraph_py.models.searchscraper import (
+ GetSearchScraperRequest,
+ SearchScraperRequest,
+)
from scrapegraph_py.models.smartscraper import (
GetSmartScraperRequest,
SmartScraperRequest,
@@ -14,17 +14,40 @@
def test_smartscraper_request_validation():
-
class ExampleSchema(BaseModel):
name: str
age: int
- # Valid input
+ # Valid input with website_url
request = SmartScraperRequest(
website_url="https://example.com", user_prompt="Describe this page."
)
assert request.website_url == "https://example.com"
assert request.user_prompt == "Describe this page."
+ assert request.website_html is None
+ assert request.headers is None
+
+ # Valid input with website_html
+ request = SmartScraperRequest(
+ website_html="Test content
",
+ user_prompt="Extract info",
+ )
+ assert request.website_url is None
+ assert request.website_html == "Test content
"
+ assert request.user_prompt == "Extract info"
+ assert request.headers is None
+
+ # Valid input with headers
+ headers = {
+ "User-Agent": "Mozilla/5.0",
+ "Cookie": "session=123",
+ }
+ request = SmartScraperRequest(
+ website_url="https://example.com",
+ user_prompt="Describe this page.",
+ headers=headers,
+ )
+ assert request.headers == headers
# Test with output_schema
request = SmartScraperRequest(
@@ -50,6 +73,25 @@ class ExampleSchema(BaseModel):
with pytest.raises(ValidationError):
SmartScraperRequest(website_url="https://example.com", user_prompt="")
+ # Invalid HTML
+ with pytest.raises(ValidationError):
+ SmartScraperRequest(
+ website_html="not valid html",
+ user_prompt="Extract info",
+ )
+
+ # HTML too large (>2MB)
+ large_html = "x" * (2 * 1024 * 1024 + 1)
+ with pytest.raises(ValidationError):
+ SmartScraperRequest(
+ website_html=large_html,
+ user_prompt="Extract info",
+ )
+
+ # Neither URL nor HTML provided
+ with pytest.raises(ValidationError):
+ SmartScraperRequest(user_prompt="Extract info")
+
def test_get_smartscraper_request_validation():
# Valid UUID
@@ -88,9 +130,19 @@ def test_feedback_request_validation():
def test_markdownify_request_validation():
- # Valid input
+ # Valid input without headers
request = MarkdownifyRequest(website_url="https://example.com")
assert request.website_url == "https://example.com"
+ assert request.headers is None
+
+ # Valid input with headers
+ headers = {
+ "User-Agent": "Mozilla/5.0",
+ "Cookie": "session=123",
+ }
+ request = MarkdownifyRequest(website_url="https://example.com", headers=headers)
+ assert request.website_url == "https://example.com"
+ assert request.headers == headers
# Invalid URL
with pytest.raises(ValidationError):
@@ -111,36 +163,55 @@ def test_get_markdownify_request_validation():
GetMarkdownifyRequest(request_id="invalid-uuid")
-def test_localscraper_request_validation():
- # Valid input
- request = LocalScraperRequest(
- user_prompt="Extract info",
- website_html="Test content
",
+def test_searchscraper_request_validation():
+ class ExampleSchema(BaseModel):
+ name: str
+ age: int
+
+ # Valid input without headers
+ request = SearchScraperRequest(user_prompt="What is the latest version of Python?")
+ assert request.user_prompt == "What is the latest version of Python?"
+ assert request.headers is None
+ assert request.output_schema is None
+
+ # Valid input with headers
+ headers = {
+ "User-Agent": "Mozilla/5.0",
+ "Cookie": "session=123",
+ }
+ request = SearchScraperRequest(
+ user_prompt="What is the latest version of Python?",
+ headers=headers,
)
- assert request.user_prompt == "Extract info"
- assert "Test content
" in request.website_html
+ assert request.headers == headers
- # Empty prompt
- with pytest.raises(ValidationError):
- LocalScraperRequest(
- user_prompt="", website_html="Test content
"
- )
+ # Test with output_schema
+ request = SearchScraperRequest(
+ user_prompt="What is the latest version of Python?",
+ output_schema=ExampleSchema,
+ )
- # Invalid HTML
+ # When we dump the model, the output_schema should be converted to a dict
+ dumped = request.model_dump()
+ assert isinstance(dumped["output_schema"], dict)
+ assert "properties" in dumped["output_schema"]
+ assert "name" in dumped["output_schema"]["properties"]
+ assert "age" in dumped["output_schema"]["properties"]
+
+ # Empty prompt
with pytest.raises(ValidationError):
- LocalScraperRequest(user_prompt="Extract info", website_html="not valid html")
+ SearchScraperRequest(user_prompt="")
- # HTML too large (>2MB)
- large_html = "x" * (2 * 1024 * 1024 + 1)
+ # Invalid prompt (no alphanumeric characters)
with pytest.raises(ValidationError):
- LocalScraperRequest(user_prompt="Extract info", website_html=large_html)
+ SearchScraperRequest(user_prompt="!@#$%^")
-def test_get_localscraper_request_validation():
+def test_get_searchscraper_request_validation():
# Valid UUID
- request = GetLocalScraperRequest(request_id="123e4567-e89b-12d3-a456-426614174000")
+ request = GetSearchScraperRequest(request_id="123e4567-e89b-12d3-a456-426614174000")
assert request.request_id == "123e4567-e89b-12d3-a456-426614174000"
# Invalid UUID
with pytest.raises(ValidationError):
- GetLocalScraperRequest(request_id="invalid-uuid")
+ GetSearchScraperRequest(request_id="invalid-uuid")