Skip to content

[Bug]: PruningContentFilter takes no effect #673

Closed
@phorn1

Description

crawl4ai version

0.4.248

Expected Behavior

I followed the example from this link and experimented with various configurations. Specifically, I adjusted the threshold and min_word_threshold parameters, but these changes had no effect on the length of fit_markdown. To test further, I tried extreme values—setting threshold close to 1 or 0 and min_word_threshold to 1 and 100,000—yet the output remained unchanged.

I also tried this on different urls, and this behavior stayed the same.

Current Behavior

This is the output no matter the the configuration of PruningContentFilter:

[INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://news.ycombinator.com... | Status: True | Time: 0.01s
[COMPLETE] ● https://news.ycombinator.com... | Status: True | Total: 0.01s
Raw Markdown length: 17825
Fit Markdown length: 3756

Process finished with exit code 0

Is this reproducible?

Yes

Inputs Causing the Bug

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    # Step 1: Create a pruning filter
    prune_filter = PruningContentFilter(
        # Lower → more content retained, higher → more content pruned
        threshold=0.45,           
        # "fixed" or "dynamic"
        threshold_type="dynamic",  
        # Ignore nodes with <5 words
        min_word_threshold=5      
    )

    # Step 2: Insert it into a Markdown Generator
    md_generator = DefaultMarkdownGenerator(content_filter=prune_filter)

    # Step 3: Pass it to CrawlerRunConfig
    config = CrawlerRunConfig(
        markdown_generator=md_generator
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://news.ycombinator.com", 
            config=config
        )

        if result.success:
            # 'fit_markdown' is your pruned content, focusing on "denser" text
            print("Raw Markdown length:", len(result.markdown_v2.raw_markdown))
            print("Fit Markdown length:", len(result.markdown_v2.fit_markdown))
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

Steps to Reproduce

Code snippets

OS

macOS

Python version

3.12

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions