Better page scraping and results display by T-rav · Pull Request #29 · Kode-Rex/webcat

T-rav · 2025-10-03T21:50:15Z

Summary by CodeRabbit

New Features
- Enhanced search previews with Markdown rendering in the web client.
- Improved web content extraction for richer, cleaner results.
Changes
- Demo UI and health endpoints updated from /client to /demo.
Refactor
- Modularized service integrations and switched to environment-based API key configuration.
Chores
- Standardized runtime to fixed port 8000 and simplified healthcheck configuration.
- Updated Docker base image and compose settings for consistency.
- Added new dependency to support improved extraction.

Fixed import errors after module refactoring: - clients.serper_client.fetch_search_results - clients.duckduckgo_client.fetch_duckduckgo_search_results - services.search_processor.process_search_results - models.search_result.SearchResult - services.content_scraper.scrape_search_result - Use os.environ for SERPER_API_KEY instead of mcp_server import Also fixed Docker configuration: - Use Python 3.11 in Dockerfile - Copy all docker subdirectories - Simplify docker-compose.yml port mapping 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Renamed the demo UI endpoint from /client to /demo for clarity. Changes: - health.py: @app.get("/demo") instead of /client - Updated all endpoint references in health.py - Updated log messages in simple_demo.py - Updated log messages in demo_server.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Improved content extraction to provide cleaner results: Changes to html2text configuration: - ignore_links = True to skip navigation clutter - ignore_images = True for text-focused output - single_line_break = True for more compact output - skip_internal_links = True to reduce noise Added _clean_soup() function to remove: - Navigation elements (nav, header, footer, aside) - UI elements (menus, sidebars, ads, popups) - Social/share buttons and comments - Scripts, styles, iframes Added whitespace cleanup: - Strip empty lines - Normalize spacing between paragraphs This should significantly improve readability and reduce errors from complex navigation structures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com)

Completely revamped scraping to extract clean text only: New _extract_main_text() function: - Extracts only p, h1-h6, li, blockquote, pre tags - Filters out short text (< 20 chars) to skip noise - Preserves basic markdown: headings, quotes, code blocks - Returns plain text with minimal formatting Changes to _convert_with_readability(): - Use Readability to extract main content first - Then clean the soup to remove navigation - Finally extract just the meaningful text - Skip html2text entirely for main content Result: Clean, readable article text without navigation, menus, popups, or UI chrome. Focus on actual content. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Switched to Trafilatura - the industry standard for web article extraction. Why Trafilatura: - Specifically designed for extracting clean article text - Better than Readability at filtering navigation/UI - Handles modern websites with complex layouts - Outputs clean markdown natively - No need for html2text or manual cleaning Changes: - Add trafilatura>=1.6.0 dependency - Simplified content_scraper.py from 360 to 138 lines - Removed all Readability, html2text, BeautifulSoup code - Removed manual cleaning functions (no longer needed) - Trafilatura.extract() with markdown output - Fallback to snippet if extraction fails (< 100 chars) Result: Clean, readable article content without navigation chrome. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com)

Added Marked.js library to render markdown content preview properly. Changes: - Include Marked.js from CDN - Use marked.parse() to render markdown content - Show first 2000 chars instead of 500 - Added comprehensive markdown CSS styling: - Proper heading sizes and spacing - Styled code blocks (dark theme) - Inline code styling - List indentation - Blockquote with colored border - Scrollable content area Result: Content previews now render as formatted markdown with proper headings, paragraphs, code blocks, and styling instead of raw markdown text. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Trafilatura often includes the article title as the first heading in its markdown output, causing duplicate titles in the preview. Added logic to detect and remove the first line if it matches the article title we already display in metadata. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com)

coderabbitai · 2025-10-03T21:50:52Z

Warning

Rate limit exceeded

@T-rav has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 51 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 77ebfa6 and 5a1adcb.

📒 Files selected for processing (1)

docker/tests/unit/services/test_content_scraper.py (1 hunks)

Walkthrough

Refactors search/scrape tooling to modular clients/services, overhauls content extraction to Trafilatura with fallback, updates demo endpoints from /client to /demo, fixes ports to 8000 across Docker assets, broadens Docker copy scope, adds Markdown rendering in the web client, and adds trafilatura dependency.

Changes

Cohort / File(s)	Summary
Docker packaging and runtime `docker/Dockerfile`, `docker/docker-compose.yml`	Pin Python base to 3.11-slim, copy entire docker/ dir, expose fixed port 8000; compose uses static 8000 mapping, removes HEALTH_PORT, sets fixed healthcheck URL and start_period.
Demo endpoint path updates `docker/health.py`, `docker/demo_server.py`, `docker/simple_demo.py`	Replace public route and log references from /client to /demo; adjust health endpoints and descriptive strings accordingly.
Search/API tools modularization `docker/api_tools.py`	Switch imports from monolithic mcp_server to clients/services modules; load SERPER_API_KEY from environment; maintain function signatures.
Scraping pipeline overhaul `docker/services/content_scraper.py`	Replace html2text/BS4/Readability pipeline with Trafilatura extraction to Markdown; simplify helpers; add fallback to snippet if extraction fails; retain binary/plain handling and truncation.
Web client rendering `examples/webcat_client.html`	Add Marked.js and styles; render content as Markdown with larger excerpt and truncation indicator; keep existing heartbeat/error paths.
Dependency updates `pyproject.toml`	Add dependency: `trafilatura>=1.6.0`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Client as Webcat Client (HTML)
  participant API as API Tools
  participant Scraper as services.content_scraper (Trafilatura)
  participant Source as Target Website

  User->>Client: Initiate search/scrape
  Client->>API: fetch_*_search_results(query)
  API->>Source: Perform search HTTP requests
  Source-->>API: Search results
  API->>Scraper: scrape_search_result(result.url, snippet)
  Scraper->>Source: GET URL
  Source-->>Scraper: HTML/content
  alt Trafilatura extracts content
    Scraper->>Scraper: extract (markdown), trim/concat, truncate
    Scraper-->>API: {title, url, markdown content}
  else Extraction fails
    Scraper-->>API: Fallback using snippet (+note)
  end
  API-->>Client: Processed results (Markdown)
  Client->>Client: marked.parse(markdown)
  Client-->>User: Rendered Markdown preview

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

added in docker changes #8 — Touches Docker packaging and scraper modules similar to this PR, indicating related restructuring of container setup and content extraction logic.

Poem

A rabbit hopped through ports made new,
From /client lanes to /demo’s view.
It nibbled leaves of Markdown bright,
Trafilatura brewed the bite.
With 8000 set and steady breeze,
The scraper sings through webs with ease.
Thump-thump—ship it, please! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly describes the main improvement of enhancing both page scraping and the way results are displayed, aligning well with the refactoring of the content scraper to use Trafilatura and the frontend updates for markdown rendering.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

examples/webcat_client.html (1)
251-633: Sanitize rendered Markdown before injecting into the DOM

marked.parse(...) is fed straight into the template, so any HTML returned from remote pages (the very content we scrape) is executed in our client. That’s a direct XSS vector. Please sanitize the parsed HTML (e.g., with DOMPurify) before inserting it.
@@
-    
-    <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
+    
+    <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
+    
+    <script src="https://cdn.jsdelivr.net/npm/dompurify@2.4.7/dist/purify.min.js"></script>
@@
-                            <div class="markdown-content">${marked.parse(result.content.substring(0, 2000))}${result.content.length > 2000 ? '<p><em>... [content truncated]</em></p>' : ''}</div>
+                            <div class="markdown-content">${DOMPurify.sanitize(marked.parse(result.content.substring(0, 2000)))}${result.content.length > 2000 ? '<p><em>... [content truncated]</em></p>' : ''}</div>
docker/services/content_scraper.py (1)

11-138: Restore compatibility for callers/tests expecting html2text

CI is failing (AttributeError: module 'services.content_scraper' has no attribute 'html2text') because existing tests still patch services.content_scraper.html2text. Please update the test fixtures (and any remaining call sites) to stub trafilatura.extract instead, or provide a compatibility shim so the attribute still exists. Without that adjustment, the suite cannot pass. (Pipeline failure reference: GitHub Actions – CI Pipeline / Docker MCP Tests)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ace28ad and 77ebfa6.

📒 Files selected for processing (9)

docker/Dockerfile (2 hunks)
docker/api_tools.py (5 hunks)
docker/demo_server.py (1 hunks)
docker/docker-compose.yml (1 hunks)
docker/health.py (3 hunks)
docker/services/content_scraper.py (2 hunks)
docker/simple_demo.py (2 hunks)
examples/webcat_client.html (2 hunks)
pyproject.toml (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Format all Python code with Black (line-length=88) and isort (profile="black")
Prefer absolute imports (e.g., from docker.mcp_server import ...)
Require type hints for all function signatures (enforced by mypy)
Use Google-style docstrings for public functions
Limit function nesting depth to 3 levels; refactor deeper logic into helpers
One class per file (applies to production code and test infrastructure)
Extract helper methods per logical concept; favor multiple small methods over one large method
Code organization: single concept per file, avoid deep directory nesting, co-locate related concerns, prefer explicit code with descriptive names, keep modules small and focused

Files:

docker/demo_server.py
docker/health.py
docker/api_tools.py
docker/simple_demo.py
docker/services/content_scraper.py

pyproject.toml

📄 CodeRabbit inference engine (CLAUDE.md)

pyproject.toml: Define Black, isort, mypy, pytest, and coverage config in pyproject.toml
Project requires Python 3.11 exactly; enforce via pyproject.toml

Files:

pyproject.toml

🧬 Code graph analysis (1)

docker/api_tools.py (4)

docker/tests/test_mcp_server.py (1)

SearchResult (20-25)

docker/services/content_scraper.py (1)

scrape_search_result (84-145)

docker/clients/serper_client.py (1)

fetch_search_results (38-64)

docker/services/search_processor.py (1)

process_search_results (15-39)

🪛 GitHub Actions: CI Pipeline

docker/services/content_scraper.py

[error] 1-1: AttributeError: module 'services.content_scraper' has no attribute 'html2text' encountered during test setup (likely due to a missing import or refactor affecting tests). Command failed: make test-coverage.

🪛 GitHub Actions: Docker MCP Tests

docker/services/content_scraper.py

[error] 1-1: AttributeError: module 'services.content_scraper' has no attribute 'html2text'. This caused the test 'test_truncates_content_exceeding_max_length' to fail during pytest.

🔇 Additional comments (1)

docker/api_tools.py (1)

365-368: LGTM! Environment-based configuration is appropriate.

Moving from centralized config imports to environment variable retrieval is a good practice for modularization and configuration management. The use of os.environ.get("SERPER_API_KEY") properly handles the case where the key might not be set (returns None), which aligns with the fallback logic to DuckDuckGo.

coderabbitai · 2025-10-03T21:57:44Z

docker/api_tools.py

+            from models.search_result import SearchResult
+            from services.content_scraper import scrape_search_result


🛠️ Refactor suggestion | 🟠 Major

Use absolute imports per coding guidelines.

The coding guidelines specify: "Prefer absolute imports (e.g., from docker.mcp_server import ...)". These relative imports should be changed to absolute imports for consistency.

As per coding guidelines.

Apply this diff:

- from models.search_result import SearchResult - from services.content_scraper import scrape_search_result + from docker.models.search_result import SearchResult + from docker.services.content_scraper import scrape_search_result

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from models.search_result import SearchResult

from services.content_scraper import scrape_search_result

from docker.models.search_result import SearchResult

from docker.services.content_scraper import scrape_search_result

🤖 Prompt for AI Agents

In docker/api_tools.py around lines 181 to 182, the two relative imports ("from models.search_result import SearchResult" and "from services.content_scraper import scrape_search_result") violate the guideline to use absolute imports; replace them with their absolute module paths (for example "from docker.models.search_result import SearchResult" and "from docker.services.content_scraper import scrape_search_result" or the project-appropriate top-level package names) so imports are absolute and update any import paths to match the repository package layout.

coderabbitai · 2025-10-03T21:57:44Z

docker/api_tools.py

    import asyncio

-    from mcp_server import fetch_search_results
+    from clients.serper_client import fetch_search_results


🛠️ Refactor suggestion | 🟠 Major

Use absolute import per coding guidelines.

The import should use an absolute path consistent with the project's coding guidelines.

As per coding guidelines.

Apply this diff:

- from clients.serper_client import fetch_search_results + from docker.clients.serper_client import fetch_search_results

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from clients.serper_client import fetch_search_results

from docker.clients.serper_client import fetch_search_results

🤖 Prompt for AI Agents

In docker/api_tools.py around line 267 the relative import "from clients.serper_client import fetch_search_results" violates the project's coding guidelines requiring absolute imports; update the statement to use the package-absolute path (e.g., "from my_project.clients.serper_client import fetch_search_results" or the correct top-level package name used across the repo), ensure the import path matches the project's package layout and update any related tests or import resolvers if necessary.

coderabbitai · 2025-10-03T21:57:44Z

docker/api_tools.py

    import asyncio

-    from mcp_server import fetch_duckduckgo_search_results
+    from clients.duckduckgo_client import fetch_duckduckgo_search_results


🛠️ Refactor suggestion | 🟠 Major

Use absolute import per coding guidelines.

The import should use an absolute path consistent with the project's coding guidelines.

As per coding guidelines.

Apply this diff:

- from clients.duckduckgo_client import fetch_duckduckgo_search_results + from docker.clients.duckduckgo_client import fetch_duckduckgo_search_results

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from clients.duckduckgo_client import fetch_duckduckgo_search_results

from docker.clients.duckduckgo_client import fetch_duckduckgo_search_results

🤖 Prompt for AI Agents

In docker/api_tools.py around line 288, the local-style import "from clients.duckduckgo_client import fetch_duckduckgo_search_results" must be changed to an absolute import from the project package root; replace it with the package-root-qualified import (e.g. "from <project_package>.clients.duckduckgo_client import fetch_duckduckgo_search_results"), update any import paths/packaging if needed so the module resolves, and run tests to ensure the absolute import works in the runtime environment.

coderabbitai · 2025-10-03T21:57:44Z

docker/api_tools.py

    import asyncio

-    from mcp_server import process_search_results
+    from services.search_processor import process_search_results


🛠️ Refactor suggestion | 🟠 Major

Use absolute import per coding guidelines.

The import should use an absolute path consistent with the project's coding guidelines.

As per coding guidelines.

Apply this diff:

- from services.search_processor import process_search_results + from docker.services.search_processor import process_search_results

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from services.search_processor import process_search_results

from docker.services.search_processor import process_search_results

🤖 Prompt for AI Agents

In docker/api_tools.py at line 349, replace the current module import with the project's absolute package path: change the import "from services.search_processor import process_search_results" to import from the repository's top-level package (e.g. "from <root_package>.services.search_processor import process_search_results"), ensuring you use the exact project root package name used across the codebase; update any other imports in this file to match and run the test/import checks to confirm no unresolved imports remain.

Fixed test_truncates_content_exceeding_max_length to patch trafilatura.extract instead of html2text.HTML2Text since we switched to Trafilatura for content extraction. All 44 unit tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Better page scraping and results display

T-rav and others added 7 commits October 3, 2025 15:32

coderabbitai bot reviewed Oct 3, 2025

View reviewed changes

T-rav merged commit f1edf93 into main Oct 3, 2025
11 checks passed

T-rav deleted the fix/html2text-mock branch October 3, 2025 22:15

coderabbitai bot mentioned this pull request Oct 3, 2025

refactor: Extract duplicate demo utilities and remove static mock data #30

Merged

T-rav added a commit that referenced this pull request Oct 10, 2025

Merge pull request #29 from Kode-Rex/fix/html2text-mock

aca6d4d

Better page scraping and results display

This was referenced Oct 10, 2025

chore: Remove demo functionality for v2.3.2 #40

Merged

feat: Increase MAX_CONTENT_LENGTH to 50,000 chars and make it configurable #46

Merged

feat: Add Serper scrape API for content extraction #48

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better page scraping and results display#29

Better page scraping and results display#29
T-rav merged 8 commits intomainfrom
fix/html2text-mock

T-rav commented Oct 3, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 3, 2025 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 3, 2025

Uh oh!

coderabbitai bot Oct 3, 2025

Uh oh!

coderabbitai bot Oct 3, 2025

Uh oh!

coderabbitai bot Oct 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		from models.search_result import SearchResult
		from services.content_scraper import scrape_search_result

	from clients.serper_client import fetch_search_results
	from docker.clients.serper_client import fetch_search_results

	from clients.duckduckgo_client import fetch_duckduckgo_search_results
	from docker.clients.duckduckgo_client import fetch_duckduckgo_search_results

	from services.search_processor import process_search_results
	from docker.services.search_processor import process_search_results

Conversation

T-rav commented Oct 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

T-rav commented Oct 3, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 3, 2025 •

edited

Loading