Skip to content

[Bug]: I am unable to get text content for all the pages of the provided Website URL #661

Closed
@jf-harshit

Description

crawl4ai version

0.4.248

Expected Behavior

I want to get the text content of all the website pages of the URL but getting too much issues in order to get it.
can you provide me simple main.py file in order to achieve it.

Current Behavior

currently, unable to get the content from the website pages.

here is my code below -

import asyncio
import json
from typing import List, Optional
from crawl4ai import *
from crawl4ai import CrawlerRunConfig

help(CrawlerRunConfig)

help(BrowserConfig)

from crawl4ai.extraction_strategy import ExtractionStrategy

BASE_URL = "https://web.lmarena.ai" # Change this to your target website

class FullTextExtractionStrategy(ExtractionStrategy):
"""Custom extraction strategy to fetch full text from pages."""
def init(self):
self.input_format = "html"

def extract(self, content: str, metadata: Optional[dict] = None, **kwargs) -> List[dict]:
    try:
        return [{"text": content}]
    except Exception as e:
        return [{"error": True, "message": str(e)}]

async def main():
extraction_strategy = FullTextExtractionStrategy()

crawl_config = CrawlerRunConfig(
    extraction_strategy=extraction_strategy,
    word_count_threshold=1,
    only_text=True,  # Extract only text content
    verbose=True,  # Enable logging
    wait_until="domcontentloaded",  # Ensure full page load
    page_timeout=60000  # Timeout in milliseconds
)

browser_cfg = BrowserConfig(
    headless=True,  # Run browser in headless mode
    java_script_enabled=True,  # Ensure JavaScript rendering is enabled
    verbose=True  # Enable logs for debugging
)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    try:
        result = await crawler.arun(url=BASE_URL, config=crawl_config)

        if result.success:
            extracted_text = [page['text'] for page in result.extracted_content if 'text' in page]

            print("\nExtracted Text Content:")
            print("\n".join(extracted_text))  # Print extracted text
            
            # Save to a JSON file
            with open("extracted_text.json", "w", encoding="utf-8") as f:
                json.dump(extracted_text, f, indent=4, ensure_ascii=False)
            
            print("\nExtracted text saved to 'extracted_text.json'.")

        else:
            print("\nError:", result.error_message)

    except Exception as e:
        print(f"\nCrawler error: {str(e)}")

if name == "main":
asyncio.run(main())

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

WINDOWS 11

Python version

3.13.2

Browser

chrome

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    ❗InvalidThis doesn't seem right🐞 BugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions