Conversation
Fixed import errors after module refactoring: - clients.serper_client.fetch_search_results - clients.duckduckgo_client.fetch_duckduckgo_search_results - services.search_processor.process_search_results - models.search_result.SearchResult - services.content_scraper.scrape_search_result - Use os.environ for SERPER_API_KEY instead of mcp_server import Also fixed Docker configuration: - Use Python 3.11 in Dockerfile - Copy all docker subdirectories - Simplify docker-compose.yml port mapping 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Renamed the demo UI endpoint from /client to /demo for clarity.
Changes:
- health.py: @app.get("/demo") instead of /client
- Updated all endpoint references in health.py
- Updated log messages in simple_demo.py
- Updated log messages in demo_server.py
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Improved content extraction to provide cleaner results: Changes to html2text configuration: - ignore_links = True to skip navigation clutter - ignore_images = True for text-focused output - single_line_break = True for more compact output - skip_internal_links = True to reduce noise Added _clean_soup() function to remove: - Navigation elements (nav, header, footer, aside) - UI elements (menus, sidebars, ads, popups) - Social/share buttons and comments - Scripts, styles, iframes Added whitespace cleanup: - Strip empty lines - Normalize spacing between paragraphs This should significantly improve readability and reduce errors from complex navigation structures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com)
Completely revamped scraping to extract clean text only: New _extract_main_text() function: - Extracts only p, h1-h6, li, blockquote, pre tags - Filters out short text (< 20 chars) to skip noise - Preserves basic markdown: headings, quotes, code blocks - Returns plain text with minimal formatting Changes to _convert_with_readability(): - Use Readability to extract main content first - Then clean the soup to remove navigation - Finally extract just the meaningful text - Skip html2text entirely for main content Result: Clean, readable article text without navigation, menus, popups, or UI chrome. Focus on actual content. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Switched to Trafilatura - the industry standard for web article extraction. Why Trafilatura: - Specifically designed for extracting clean article text - Better than Readability at filtering navigation/UI - Handles modern websites with complex layouts - Outputs clean markdown natively - No need for html2text or manual cleaning Changes: - Add trafilatura>=1.6.0 dependency - Simplified content_scraper.py from 360 to 138 lines - Removed all Readability, html2text, BeautifulSoup code - Removed manual cleaning functions (no longer needed) - Trafilatura.extract() with markdown output - Fallback to snippet if extraction fails (< 100 chars) Result: Clean, readable article content without navigation chrome. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com)
Added Marked.js library to render markdown content preview properly. Changes: - Include Marked.js from CDN - Use marked.parse() to render markdown content - Show first 2000 chars instead of 500 - Added comprehensive markdown CSS styling: - Proper heading sizes and spacing - Styled code blocks (dark theme) - Inline code styling - List indentation - Blockquote with colored border - Scrollable content area Result: Content previews now render as formatted markdown with proper headings, paragraphs, code blocks, and styling instead of raw markdown text. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Trafilatura often includes the article title as the first heading in its markdown output, causing duplicate titles in the preview. Added logic to detect and remove the first line if it matches the article title we already display in metadata. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com)
|
Warning Rate limit exceeded@T-rav has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 51 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughRefactors search/scrape tooling to modular clients/services, overhauls content extraction to Trafilatura with fallback, updates demo endpoints from /client to /demo, fixes ports to 8000 across Docker assets, broadens Docker copy scope, adds Markdown rendering in the web client, and adds trafilatura dependency. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Client as Webcat Client (HTML)
participant API as API Tools
participant Scraper as services.content_scraper (Trafilatura)
participant Source as Target Website
User->>Client: Initiate search/scrape
Client->>API: fetch_*_search_results(query)
API->>Source: Perform search HTTP requests
Source-->>API: Search results
API->>Scraper: scrape_search_result(result.url, snippet)
Scraper->>Source: GET URL
Source-->>Scraper: HTML/content
alt Trafilatura extracts content
Scraper->>Scraper: extract (markdown), trim/concat, truncate
Scraper-->>API: {title, url, markdown content}
else Extraction fails
Scraper-->>API: Fallback using snippet (+note)
end
API-->>Client: Processed results (Markdown)
Client->>Client: marked.parse(markdown)
Client-->>User: Rendered Markdown preview
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
examples/webcat_client.html (1)
251-633: Sanitize rendered Markdown before injecting into the DOM
marked.parse(...)is fed straight into the template, so any HTML returned from remote pages (the very content we scrape) is executed in our client. That’s a direct XSS vector. Please sanitize the parsed HTML (e.g., with DOMPurify) before inserting it.@@ - <!-- Marked.js for markdown rendering --> - <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script> + <!-- Marked.js for markdown rendering --> + <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script> + <!-- DOMPurify to sanitize rendered markdown --> + <script src="https://cdn.jsdelivr.net/npm/dompurify@2.4.7/dist/purify.min.js"></script> @@ - <div class="markdown-content">${marked.parse(result.content.substring(0, 2000))}${result.content.length > 2000 ? '<p><em>... [content truncated]</em></p>' : ''}</div> + <div class="markdown-content">${DOMPurify.sanitize(marked.parse(result.content.substring(0, 2000)))}${result.content.length > 2000 ? '<p><em>... [content truncated]</em></p>' : ''}</div>docker/services/content_scraper.py (1)
11-138: Restore compatibility for callers/tests expectinghtml2textCI is failing (
AttributeError: module 'services.content_scraper' has no attribute 'html2text') because existing tests still patchservices.content_scraper.html2text. Please update the test fixtures (and any remaining call sites) to stubtrafilatura.extractinstead, or provide a compatibility shim so the attribute still exists. Without that adjustment, the suite cannot pass. (Pipeline failure reference: GitHub Actions – CI Pipeline / Docker MCP Tests)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
docker/Dockerfile(2 hunks)docker/api_tools.py(5 hunks)docker/demo_server.py(1 hunks)docker/docker-compose.yml(1 hunks)docker/health.py(3 hunks)docker/services/content_scraper.py(2 hunks)docker/simple_demo.py(2 hunks)examples/webcat_client.html(2 hunks)pyproject.toml(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Format all Python code with Black (line-length=88) and isort (profile="black")
Prefer absolute imports (e.g., from docker.mcp_server import ...)
Require type hints for all function signatures (enforced by mypy)
Use Google-style docstrings for public functions
Limit function nesting depth to 3 levels; refactor deeper logic into helpers
One class per file (applies to production code and test infrastructure)
Extract helper methods per logical concept; favor multiple small methods over one large method
Code organization: single concept per file, avoid deep directory nesting, co-locate related concerns, prefer explicit code with descriptive names, keep modules small and focused
Files:
docker/demo_server.pydocker/health.pydocker/api_tools.pydocker/simple_demo.pydocker/services/content_scraper.py
pyproject.toml
📄 CodeRabbit inference engine (CLAUDE.md)
pyproject.toml: Define Black, isort, mypy, pytest, and coverage config in pyproject.toml
Project requires Python 3.11 exactly; enforce via pyproject.toml
Files:
pyproject.toml
🧬 Code graph analysis (1)
docker/api_tools.py (4)
docker/tests/test_mcp_server.py (1)
SearchResult(20-25)docker/services/content_scraper.py (1)
scrape_search_result(84-145)docker/clients/serper_client.py (1)
fetch_search_results(38-64)docker/services/search_processor.py (1)
process_search_results(15-39)
🪛 GitHub Actions: CI Pipeline
docker/services/content_scraper.py
[error] 1-1: AttributeError: module 'services.content_scraper' has no attribute 'html2text' encountered during test setup (likely due to a missing import or refactor affecting tests). Command failed: make test-coverage.
🪛 GitHub Actions: Docker MCP Tests
docker/services/content_scraper.py
[error] 1-1: AttributeError: module 'services.content_scraper' has no attribute 'html2text'. This caused the test 'test_truncates_content_exceeding_max_length' to fail during pytest.
🔇 Additional comments (1)
docker/api_tools.py (1)
365-368: LGTM! Environment-based configuration is appropriate.Moving from centralized config imports to environment variable retrieval is a good practice for modularization and configuration management. The use of
os.environ.get("SERPER_API_KEY")properly handles the case where the key might not be set (returnsNone), which aligns with the fallback logic to DuckDuckGo.
| from models.search_result import SearchResult | ||
| from services.content_scraper import scrape_search_result |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use absolute imports per coding guidelines.
The coding guidelines specify: "Prefer absolute imports (e.g., from docker.mcp_server import ...)". These relative imports should be changed to absolute imports for consistency.
As per coding guidelines.
Apply this diff:
- from models.search_result import SearchResult
- from services.content_scraper import scrape_search_result
+ from docker.models.search_result import SearchResult
+ from docker.services.content_scraper import scrape_search_result📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from models.search_result import SearchResult | |
| from services.content_scraper import scrape_search_result | |
| from docker.models.search_result import SearchResult | |
| from docker.services.content_scraper import scrape_search_result |
🤖 Prompt for AI Agents
In docker/api_tools.py around lines 181 to 182, the two relative imports ("from
models.search_result import SearchResult" and "from services.content_scraper
import scrape_search_result") violate the guideline to use absolute imports;
replace them with their absolute module paths (for example "from
docker.models.search_result import SearchResult" and "from
docker.services.content_scraper import scrape_search_result" or the
project-appropriate top-level package names) so imports are absolute and update
any import paths to match the repository package layout.
| import asyncio | ||
|
|
||
| from mcp_server import fetch_search_results | ||
| from clients.serper_client import fetch_search_results |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use absolute import per coding guidelines.
The import should use an absolute path consistent with the project's coding guidelines.
As per coding guidelines.
Apply this diff:
- from clients.serper_client import fetch_search_results
+ from docker.clients.serper_client import fetch_search_results📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from clients.serper_client import fetch_search_results | |
| from docker.clients.serper_client import fetch_search_results |
🤖 Prompt for AI Agents
In docker/api_tools.py around line 267 the relative import "from
clients.serper_client import fetch_search_results" violates the project's coding
guidelines requiring absolute imports; update the statement to use the
package-absolute path (e.g., "from my_project.clients.serper_client import
fetch_search_results" or the correct top-level package name used across the
repo), ensure the import path matches the project's package layout and update
any related tests or import resolvers if necessary.
| import asyncio | ||
|
|
||
| from mcp_server import fetch_duckduckgo_search_results | ||
| from clients.duckduckgo_client import fetch_duckduckgo_search_results |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use absolute import per coding guidelines.
The import should use an absolute path consistent with the project's coding guidelines.
As per coding guidelines.
Apply this diff:
- from clients.duckduckgo_client import fetch_duckduckgo_search_results
+ from docker.clients.duckduckgo_client import fetch_duckduckgo_search_results📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from clients.duckduckgo_client import fetch_duckduckgo_search_results | |
| from docker.clients.duckduckgo_client import fetch_duckduckgo_search_results |
🤖 Prompt for AI Agents
In docker/api_tools.py around line 288, the local-style import "from
clients.duckduckgo_client import fetch_duckduckgo_search_results" must be
changed to an absolute import from the project package root; replace it with the
package-root-qualified import (e.g. "from
<project_package>.clients.duckduckgo_client import
fetch_duckduckgo_search_results"), update any import paths/packaging if needed
so the module resolves, and run tests to ensure the absolute import works in the
runtime environment.
| import asyncio | ||
|
|
||
| from mcp_server import process_search_results | ||
| from services.search_processor import process_search_results |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use absolute import per coding guidelines.
The import should use an absolute path consistent with the project's coding guidelines.
As per coding guidelines.
Apply this diff:
- from services.search_processor import process_search_results
+ from docker.services.search_processor import process_search_results📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from services.search_processor import process_search_results | |
| from docker.services.search_processor import process_search_results |
🤖 Prompt for AI Agents
In docker/api_tools.py at line 349, replace the current module import with the
project's absolute package path: change the import "from
services.search_processor import process_search_results" to import from the
repository's top-level package (e.g. "from
<root_package>.services.search_processor import process_search_results"),
ensuring you use the exact project root package name used across the codebase;
update any other imports in this file to match and run the test/import checks to
confirm no unresolved imports remain.
Fixed test_truncates_content_exceeding_max_length to patch trafilatura.extract instead of html2text.HTML2Text since we switched to Trafilatura for content extraction. All 44 unit tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Better page scraping and results display
Summary by CodeRabbit