Skip to content

Better page scraping and results display#29

Merged
T-rav merged 8 commits intomainfrom
fix/html2text-mock
Oct 3, 2025
Merged

Better page scraping and results display#29
T-rav merged 8 commits intomainfrom
fix/html2text-mock

Conversation

@T-rav
Copy link
Collaborator

@T-rav T-rav commented Oct 3, 2025

Summary by CodeRabbit

  • New Features
    • Enhanced search previews with Markdown rendering in the web client.
    • Improved web content extraction for richer, cleaner results.
  • Changes
    • Demo UI and health endpoints updated from /client to /demo.
  • Refactor
    • Modularized service integrations and switched to environment-based API key configuration.
  • Chores
    • Standardized runtime to fixed port 8000 and simplified healthcheck configuration.
    • Updated Docker base image and compose settings for consistency.
    • Added new dependency to support improved extraction.

T-rav and others added 7 commits October 3, 2025 15:32
Fixed import errors after module refactoring:
- clients.serper_client.fetch_search_results
- clients.duckduckgo_client.fetch_duckduckgo_search_results
- services.search_processor.process_search_results
- models.search_result.SearchResult
- services.content_scraper.scrape_search_result
- Use os.environ for SERPER_API_KEY instead of mcp_server import

Also fixed Docker configuration:
- Use Python 3.11 in Dockerfile
- Copy all docker subdirectories
- Simplify docker-compose.yml port mapping

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Renamed the demo UI endpoint from /client to /demo for clarity.

Changes:
- health.py: @app.get("/demo") instead of /client
- Updated all endpoint references in health.py
- Updated log messages in simple_demo.py
- Updated log messages in demo_server.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Improved content extraction to provide cleaner results:

Changes to html2text configuration:
- ignore_links = True to skip navigation clutter
- ignore_images = True for text-focused output
- single_line_break = True for more compact output
- skip_internal_links = True to reduce noise

Added _clean_soup() function to remove:
- Navigation elements (nav, header, footer, aside)
- UI elements (menus, sidebars, ads, popups)
- Social/share buttons and comments
- Scripts, styles, iframes

Added whitespace cleanup:
- Strip empty lines
- Normalize spacing between paragraphs

This should significantly improve readability and reduce errors from
complex navigation structures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com)
Completely revamped scraping to extract clean text only:

New _extract_main_text() function:
- Extracts only p, h1-h6, li, blockquote, pre tags
- Filters out short text (< 20 chars) to skip noise
- Preserves basic markdown: headings, quotes, code blocks
- Returns plain text with minimal formatting

Changes to _convert_with_readability():
- Use Readability to extract main content first
- Then clean the soup to remove navigation
- Finally extract just the meaningful text
- Skip html2text entirely for main content

Result: Clean, readable article text without navigation,
menus, popups, or UI chrome. Focus on actual content.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Switched to Trafilatura - the industry standard for web article extraction.

Why Trafilatura:
- Specifically designed for extracting clean article text
- Better than Readability at filtering navigation/UI
- Handles modern websites with complex layouts
- Outputs clean markdown natively
- No need for html2text or manual cleaning

Changes:
- Add trafilatura>=1.6.0 dependency
- Simplified content_scraper.py from 360 to 138 lines
- Removed all Readability, html2text, BeautifulSoup code
- Removed manual cleaning functions (no longer needed)
- Trafilatura.extract() with markdown output
- Fallback to snippet if extraction fails (< 100 chars)

Result: Clean, readable article content without navigation chrome.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com)
Added Marked.js library to render markdown content preview properly.

Changes:
- Include Marked.js from CDN
- Use marked.parse() to render markdown content
- Show first 2000 chars instead of 500
- Added comprehensive markdown CSS styling:
  - Proper heading sizes and spacing
  - Styled code blocks (dark theme)
  - Inline code styling
  - List indentation
  - Blockquote with colored border
  - Scrollable content area

Result: Content previews now render as formatted markdown with
proper headings, paragraphs, code blocks, and styling instead of
raw markdown text.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Trafilatura often includes the article title as the first heading
in its markdown output, causing duplicate titles in the preview.

Added logic to detect and remove the first line if it matches
the article title we already display in metadata.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com)
@coderabbitai
Copy link

coderabbitai bot commented Oct 3, 2025

Warning

Rate limit exceeded

@T-rav has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 51 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 77ebfa6 and 5a1adcb.

📒 Files selected for processing (1)
  • docker/tests/unit/services/test_content_scraper.py (1 hunks)

Walkthrough

Refactors search/scrape tooling to modular clients/services, overhauls content extraction to Trafilatura with fallback, updates demo endpoints from /client to /demo, fixes ports to 8000 across Docker assets, broadens Docker copy scope, adds Markdown rendering in the web client, and adds trafilatura dependency.

Changes

Cohort / File(s) Summary
Docker packaging and runtime
docker/Dockerfile, docker/docker-compose.yml
Pin Python base to 3.11-slim, copy entire docker/ dir, expose fixed port 8000; compose uses static 8000 mapping, removes HEALTH_PORT, sets fixed healthcheck URL and start_period.
Demo endpoint path updates
docker/health.py, docker/demo_server.py, docker/simple_demo.py
Replace public route and log references from /client to /demo; adjust health endpoints and descriptive strings accordingly.
Search/API tools modularization
docker/api_tools.py
Switch imports from monolithic mcp_server to clients/services modules; load SERPER_API_KEY from environment; maintain function signatures.
Scraping pipeline overhaul
docker/services/content_scraper.py
Replace html2text/BS4/Readability pipeline with Trafilatura extraction to Markdown; simplify helpers; add fallback to snippet if extraction fails; retain binary/plain handling and truncation.
Web client rendering
examples/webcat_client.html
Add Marked.js and styles; render content as Markdown with larger excerpt and truncation indicator; keep existing heartbeat/error paths.
Dependency updates
pyproject.toml
Add dependency: trafilatura>=1.6.0.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Client as Webcat Client (HTML)
  participant API as API Tools
  participant Scraper as services.content_scraper (Trafilatura)
  participant Source as Target Website

  User->>Client: Initiate search/scrape
  Client->>API: fetch_*_search_results(query)
  API->>Source: Perform search HTTP requests
  Source-->>API: Search results
  API->>Scraper: scrape_search_result(result.url, snippet)
  Scraper->>Source: GET URL
  Source-->>Scraper: HTML/content
  alt Trafilatura extracts content
    Scraper->>Scraper: extract (markdown), trim/concat, truncate
    Scraper-->>API: {title, url, markdown content}
  else Extraction fails
    Scraper-->>API: Fallback using snippet (+note)
  end
  API-->>Client: Processed results (Markdown)
  Client->>Client: marked.parse(markdown)
  Client-->>User: Rendered Markdown preview
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • added in docker changes #8 — Touches Docker packaging and scraper modules similar to this PR, indicating related restructuring of container setup and content extraction logic.

Poem

A rabbit hopped through ports made new,
From /client lanes to /demo’s view.
It nibbled leaves of Markdown bright,
Trafilatura brewed the bite.
With 8000 set and steady breeze,
The scraper sings through webs with ease.
Thump-thump—ship it, please! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly describes the main improvement of enhancing both page scraping and the way results are displayed, aligning well with the refactoring of the content scraper to use Trafilatura and the frontend updates for markdown rendering.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
examples/webcat_client.html (1)

251-633: Sanitize rendered Markdown before injecting into the DOM

marked.parse(...) is fed straight into the template, so any HTML returned from remote pages (the very content we scrape) is executed in our client. That’s a direct XSS vector. Please sanitize the parsed HTML (e.g., with DOMPurify) before inserting it.

@@
-    <!-- Marked.js for markdown rendering -->
-    <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
+    <!-- Marked.js for markdown rendering -->
+    <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
+    <!-- DOMPurify to sanitize rendered markdown -->
+    <script src="https://cdn.jsdelivr.net/npm/dompurify@2.4.7/dist/purify.min.js"></script>
@@
-                            <div class="markdown-content">${marked.parse(result.content.substring(0, 2000))}${result.content.length > 2000 ? '<p><em>... [content truncated]</em></p>' : ''}</div>
+                            <div class="markdown-content">${DOMPurify.sanitize(marked.parse(result.content.substring(0, 2000)))}${result.content.length > 2000 ? '<p><em>... [content truncated]</em></p>' : ''}</div>
docker/services/content_scraper.py (1)

11-138: Restore compatibility for callers/tests expecting html2text

CI is failing (AttributeError: module 'services.content_scraper' has no attribute 'html2text') because existing tests still patch services.content_scraper.html2text. Please update the test fixtures (and any remaining call sites) to stub trafilatura.extract instead, or provide a compatibility shim so the attribute still exists. Without that adjustment, the suite cannot pass. (Pipeline failure reference: GitHub Actions – CI Pipeline / Docker MCP Tests)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ace28ad and 77ebfa6.

📒 Files selected for processing (9)
  • docker/Dockerfile (2 hunks)
  • docker/api_tools.py (5 hunks)
  • docker/demo_server.py (1 hunks)
  • docker/docker-compose.yml (1 hunks)
  • docker/health.py (3 hunks)
  • docker/services/content_scraper.py (2 hunks)
  • docker/simple_demo.py (2 hunks)
  • examples/webcat_client.html (2 hunks)
  • pyproject.toml (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Format all Python code with Black (line-length=88) and isort (profile="black")
Prefer absolute imports (e.g., from docker.mcp_server import ...)
Require type hints for all function signatures (enforced by mypy)
Use Google-style docstrings for public functions
Limit function nesting depth to 3 levels; refactor deeper logic into helpers
One class per file (applies to production code and test infrastructure)
Extract helper methods per logical concept; favor multiple small methods over one large method
Code organization: single concept per file, avoid deep directory nesting, co-locate related concerns, prefer explicit code with descriptive names, keep modules small and focused

Files:

  • docker/demo_server.py
  • docker/health.py
  • docker/api_tools.py
  • docker/simple_demo.py
  • docker/services/content_scraper.py
pyproject.toml

📄 CodeRabbit inference engine (CLAUDE.md)

pyproject.toml: Define Black, isort, mypy, pytest, and coverage config in pyproject.toml
Project requires Python 3.11 exactly; enforce via pyproject.toml

Files:

  • pyproject.toml
🧬 Code graph analysis (1)
docker/api_tools.py (4)
docker/tests/test_mcp_server.py (1)
  • SearchResult (20-25)
docker/services/content_scraper.py (1)
  • scrape_search_result (84-145)
docker/clients/serper_client.py (1)
  • fetch_search_results (38-64)
docker/services/search_processor.py (1)
  • process_search_results (15-39)
🪛 GitHub Actions: CI Pipeline
docker/services/content_scraper.py

[error] 1-1: AttributeError: module 'services.content_scraper' has no attribute 'html2text' encountered during test setup (likely due to a missing import or refactor affecting tests). Command failed: make test-coverage.

🪛 GitHub Actions: Docker MCP Tests
docker/services/content_scraper.py

[error] 1-1: AttributeError: module 'services.content_scraper' has no attribute 'html2text'. This caused the test 'test_truncates_content_exceeding_max_length' to fail during pytest.

🔇 Additional comments (1)
docker/api_tools.py (1)

365-368: LGTM! Environment-based configuration is appropriate.

Moving from centralized config imports to environment variable retrieval is a good practice for modularization and configuration management. The use of os.environ.get("SERPER_API_KEY") properly handles the case where the key might not be set (returns None), which aligns with the fallback logic to DuckDuckGo.

Comment on lines +181 to +182
from models.search_result import SearchResult
from services.content_scraper import scrape_search_result
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use absolute imports per coding guidelines.

The coding guidelines specify: "Prefer absolute imports (e.g., from docker.mcp_server import ...)". These relative imports should be changed to absolute imports for consistency.

As per coding guidelines.

Apply this diff:

-            from models.search_result import SearchResult
-            from services.content_scraper import scrape_search_result
+            from docker.models.search_result import SearchResult
+            from docker.services.content_scraper import scrape_search_result
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from models.search_result import SearchResult
from services.content_scraper import scrape_search_result
from docker.models.search_result import SearchResult
from docker.services.content_scraper import scrape_search_result
🤖 Prompt for AI Agents
In docker/api_tools.py around lines 181 to 182, the two relative imports ("from
models.search_result import SearchResult" and "from services.content_scraper
import scrape_search_result") violate the guideline to use absolute imports;
replace them with their absolute module paths (for example "from
docker.models.search_result import SearchResult" and "from
docker.services.content_scraper import scrape_search_result" or the
project-appropriate top-level package names) so imports are absolute and update
any import paths to match the repository package layout.

import asyncio

from mcp_server import fetch_search_results
from clients.serper_client import fetch_search_results
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use absolute import per coding guidelines.

The import should use an absolute path consistent with the project's coding guidelines.

As per coding guidelines.

Apply this diff:

-    from clients.serper_client import fetch_search_results
+    from docker.clients.serper_client import fetch_search_results
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from clients.serper_client import fetch_search_results
from docker.clients.serper_client import fetch_search_results
🤖 Prompt for AI Agents
In docker/api_tools.py around line 267 the relative import "from
clients.serper_client import fetch_search_results" violates the project's coding
guidelines requiring absolute imports; update the statement to use the
package-absolute path (e.g., "from my_project.clients.serper_client import
fetch_search_results" or the correct top-level package name used across the
repo), ensure the import path matches the project's package layout and update
any related tests or import resolvers if necessary.

import asyncio

from mcp_server import fetch_duckduckgo_search_results
from clients.duckduckgo_client import fetch_duckduckgo_search_results
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use absolute import per coding guidelines.

The import should use an absolute path consistent with the project's coding guidelines.

As per coding guidelines.

Apply this diff:

-    from clients.duckduckgo_client import fetch_duckduckgo_search_results
+    from docker.clients.duckduckgo_client import fetch_duckduckgo_search_results
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from clients.duckduckgo_client import fetch_duckduckgo_search_results
from docker.clients.duckduckgo_client import fetch_duckduckgo_search_results
🤖 Prompt for AI Agents
In docker/api_tools.py around line 288, the local-style import "from
clients.duckduckgo_client import fetch_duckduckgo_search_results" must be
changed to an absolute import from the project package root; replace it with the
package-root-qualified import (e.g. "from
<project_package>.clients.duckduckgo_client import
fetch_duckduckgo_search_results"), update any import paths/packaging if needed
so the module resolves, and run tests to ensure the absolute import works in the
runtime environment.

import asyncio

from mcp_server import process_search_results
from services.search_processor import process_search_results
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use absolute import per coding guidelines.

The import should use an absolute path consistent with the project's coding guidelines.

As per coding guidelines.

Apply this diff:

-    from services.search_processor import process_search_results
+    from docker.services.search_processor import process_search_results
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from services.search_processor import process_search_results
from docker.services.search_processor import process_search_results
🤖 Prompt for AI Agents
In docker/api_tools.py at line 349, replace the current module import with the
project's absolute package path: change the import "from
services.search_processor import process_search_results" to import from the
repository's top-level package (e.g. "from
<root_package>.services.search_processor import process_search_results"),
ensuring you use the exact project root package name used across the codebase;
update any other imports in this file to match and run the test/import checks to
confirm no unresolved imports remain.

Fixed test_truncates_content_exceeding_max_length to patch
trafilatura.extract instead of html2text.HTML2Text since we
switched to Trafilatura for content extraction.

All 44 unit tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@T-rav T-rav merged commit f1edf93 into main Oct 3, 2025
11 checks passed
@T-rav T-rav deleted the fix/html2text-mock branch October 3, 2025 22:15
T-rav added a commit that referenced this pull request Oct 10, 2025
Better page scraping and results display
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant