Description
crawl4ai version
0.4.248
Expected Behavior
I want to get the text content of all the website pages of the URL but getting too much issues in order to get it.
can you provide me simple main.py file in order to achieve it.
Current Behavior
currently, unable to get the content from the website pages.
here is my code below -
import asyncio
import json
from typing import List, Optional
from crawl4ai import *
from crawl4ai import CrawlerRunConfig
help(CrawlerRunConfig)
help(BrowserConfig)
from crawl4ai.extraction_strategy import ExtractionStrategy
BASE_URL = "https://web.lmarena.ai" # Change this to your target website
class FullTextExtractionStrategy(ExtractionStrategy):
"""Custom extraction strategy to fetch full text from pages."""
def init(self):
self.input_format = "html"
def extract(self, content: str, metadata: Optional[dict] = None, **kwargs) -> List[dict]:
try:
return [{"text": content}]
except Exception as e:
return [{"error": True, "message": str(e)}]
async def main():
extraction_strategy = FullTextExtractionStrategy()
crawl_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
word_count_threshold=1,
only_text=True, # Extract only text content
verbose=True, # Enable logging
wait_until="domcontentloaded", # Ensure full page load
page_timeout=60000 # Timeout in milliseconds
)
browser_cfg = BrowserConfig(
headless=True, # Run browser in headless mode
java_script_enabled=True, # Ensure JavaScript rendering is enabled
verbose=True # Enable logs for debugging
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
try:
result = await crawler.arun(url=BASE_URL, config=crawl_config)
if result.success:
extracted_text = [page['text'] for page in result.extracted_content if 'text' in page]
print("\nExtracted Text Content:")
print("\n".join(extracted_text)) # Print extracted text
# Save to a JSON file
with open("extracted_text.json", "w", encoding="utf-8") as f:
json.dump(extracted_text, f, indent=4, ensure_ascii=False)
print("\nExtracted text saved to 'extracted_text.json'.")
else:
print("\nError:", result.error_message)
except Exception as e:
print(f"\nCrawler error: {str(e)}")
if name == "main":
asyncio.run(main())
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
WINDOWS 11
Python version
3.13.2
Browser
chrome
Browser version
No response
Error logs & Screenshots (if applicable)
No response
Activity