Conversation
- Added Llamaparse Support :)
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
|
Caution Review failedThe pull request is closed. WalkthroughThe changes introduce support for selecting between two document parsing services, "UNSTRUCTURED" and "LLAMACLOUD", via new environment variables in both backend and frontend configurations. The backend logic now conditionally processes uploaded files using the chosen service, and the frontend dynamically adjusts accepted file types. Documentation and example environment files are updated to reflect these options. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Frontend
participant Backend
participant LlamaCloud
participant Unstructured
User->>Frontend: Uploads file
Frontend->>Backend: Sends file (with ETL_SERVICE info)
alt ETL_SERVICE = LLAMACLOUD
Backend->>LlamaCloud: Parse file
LlamaCloud-->>Backend: Returns markdown documents
Backend->>Backend: Process markdown docs (add_received_file_document_using_llamacloud)
else ETL_SERVICE = UNSTRUCTURED
Backend->>Unstructured: Parse file
Unstructured-->>Backend: Returns processed elements
Backend->>Backend: Process elements (add_received_file_document_using_unstructured)
end
Backend-->>Frontend: Acknowledge upload
Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
✨ No issues found! Your code is sparkling clean! ✨ |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (4)
surfsense_backend/app/config/__init__.py (1)
101-101: Fix trailing whitespace issues.Static analysis detected trailing whitespace on these lines. Please remove the trailing whitespace to maintain code quality standards.
ETL_SERVICE = os.getenv("ETL_SERVICE") - + if ETL_SERVICE == "UNSTRUCTURED": # Unstructured API Key UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY") - + elif ETL_SERVICE == "LLAMACLOUD": # LlamaCloud API Key LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY") - - + +Also applies to: 105-105, 109-109
🧰 Tools
🪛 Pylint (3.3.7)
[convention] 101-101: Trailing whitespace
(C0303)
surfsense_backend/app/tasks/background_tasks.py (2)
292-292: Add a docstring to maintain consistency.The function lacks a docstring while the new
add_received_file_document_using_llamacloudfunction has one. For consistency and better documentation, please add a docstring.async def add_received_file_document_using_unstructured( + """ + Process and store document content parsed by Unstructured. + + Args: + session: Database session + file_name: Name of the processed file + unstructured_processed_elements: List of processed elements from Unstructured + search_space_id: ID of the search space + + Returns: + Document object if successful, None if failed + """🧰 Tools
🪛 Pylint (3.3.7)
[convention] 292-292: Missing function or method docstring
(C0116)
[refactor] 292-292: Too many local variables (16/15)
(R0914)
379-380: Remove misleading comment.The comment suggests combining multiple documents, but the code just assigns a single string.
- # Combine all markdown documents into one file_in_markdown = llamacloud_markdown_documentsurfsense_backend/app/routes/documents_routes.py (1)
10-10: Split long import statement for better readability.The import line exceeds 253 characters. Consider splitting it across multiple lines.
-from app.tasks.background_tasks import add_received_markdown_file_document, add_extension_received_document, add_received_file_document_using_unstructured, add_crawled_url_document, add_youtube_video_document, add_received_file_document_using_llamacloud +from app.tasks.background_tasks import ( + add_received_markdown_file_document, + add_extension_received_document, + add_received_file_document_using_unstructured, + add_received_file_document_using_llamacloud, + add_crawled_url_document, + add_youtube_video_document +)🧰 Tools
🪛 Pylint (3.3.7)
[convention] 10-10: Line too long (253/100)
(C0301)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
surfsense_backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (10)
README.md(2 hunks)surfsense_backend/.env.example(1 hunks)surfsense_backend/app/config/__init__.py(1 hunks)surfsense_backend/app/routes/documents_routes.py(4 hunks)surfsense_backend/app/tasks/background_tasks.py(2 hunks)surfsense_backend/pyproject.toml(1 hunks)surfsense_web/.env.example(1 hunks)surfsense_web/app/dashboard/[search_space_id]/documents/upload/page.tsx(1 hunks)surfsense_web/content/docs/docker-installation.mdx(2 hunks)surfsense_web/content/docs/manual-installation.mdx(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
surfsense_backend/app/routes/documents_routes.py (1)
surfsense_backend/app/tasks/background_tasks.py (2)
add_received_file_document_using_unstructured(292-357)add_received_file_document_using_llamacloud(360-434)
🪛 Pylint (3.3.7)
surfsense_backend/app/config/__init__.py
[convention] 101-101: Trailing whitespace
(C0303)
[convention] 105-105: Trailing whitespace
(C0303)
[convention] 109-109: Trailing whitespace
(C0303)
[convention] 110-110: Trailing whitespace
(C0303)
surfsense_backend/app/tasks/background_tasks.py
[convention] 292-292: Missing function or method docstring
(C0116)
[refactor] 292-292: Too many local variables (16/15)
(R0914)
[convention] 389-389: Trailing whitespace
(C0303)
[convention] 391-391: Line too long (107/100)
(C0301)
[refactor] 360-360: Too many local variables (16/15)
(R0914)
[warning] 391-391: Use lazy % formatting in logging functions
(W1203)
[warning] 434-434: Consider explicitly re-raising using 'raise RuntimeError(f'Failed to process file document using LlamaCloud: {str(e)}') from e'
(W0707)
surfsense_backend/app/routes/documents_routes.py
[convention] 10-10: Line too long (253/100)
(C0301)
[convention] 104-104: Trailing whitespace
(C0303)
[convention] 195-195: Trailing whitespace
(C0303)
[convention] 227-227: Trailing whitespace
(C0303)
[convention] 236-236: Trailing whitespace
(C0303)
[convention] 239-239: Trailing whitespace
(C0303)
[convention] 246-246: Trailing whitespace
(C0303)
[convention] 249-249: Trailing whitespace
(C0303)
[convention] 253-253: Trailing whitespace
(C0303)
[convention] 194-194: Import outside toplevel (langchain_unstructured.UnstructuredLoader)
(C0415)
[warning] 210-210: Reimport 'os' (imported line 18)
(W0404)
[convention] 210-210: Import outside toplevel (os)
(C0415)
[warning] 213-214: No exception type(s) specified
(W0702)
[convention] 224-224: Import outside toplevel (llama_cloud_services.LlamaParse)
(C0415)
[convention] 225-225: Import outside toplevel (llama_cloud_services.parse.utils.ResultType)
(C0415)
[warning] 241-241: Reimport 'os' (imported line 18)
(W0404)
[convention] 241-241: Import outside toplevel (os)
(C0415)
[warning] 244-245: No exception type(s) specified
(W0702)
[convention] 483-483: Trailing newlines
(C0305)
🪛 LanguageTool
README.md
[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...
(PERSONAL_PERSONNEL)
🪛 Ruff (0.11.9)
surfsense_backend/app/tasks/background_tasks.py
434-434: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
surfsense_backend/app/routes/documents_routes.py
211-214: Use contextlib.suppress(Exception) instead of try-except-pass
Replace with contextlib.suppress(Exception)
(SIM105)
213-213: Do not use bare except
(E722)
242-245: Use contextlib.suppress(Exception) instead of try-except-pass
Replace with contextlib.suppress(Exception)
(SIM105)
244-244: Do not use bare except
(E722)
🔇 Additional comments (11)
surfsense_backend/pyproject.toml (1)
20-20:Details
✅ Verification successful
Verify the new dependency version and security status.
The addition of
llama-cloud-services>=0.6.25looks correct for supporting the new LlamaCloud ETL service. Please ensure this version exists and check for any security advisories.
🏁 Script executed:
#!/bin/bash # Description: Verify llama-cloud-services package version and security status # Check if the specified version exists on PyPI curl -s https://pypi.org/pypi/llama-cloud-services/json | jq -r '.releases | keys[]' | grep -E '^0\.6\.(2[5-9]|[3-9][0-9])' | head -5 # Check for security advisories gh api graphql -f query=' { securityVulnerabilities(first: 5, ecosystem: PIP, package: "llama-cloud-services") { nodes { advisory { summary severity publishedAt } vulnerableVersionRange firstPatchedVersion { identifier } } } }'Length of output: 496
Dependency Approved: llama-cloud-services>=0.6.25
Version 0.6.25 is available on PyPI and no PIP security advisories were found.
• surfsense_backend/pyproject.toml (line 20):
llama-cloud-services>=0.6.25surfsense_web/.env.example (1)
2-3: LGTM! Frontend ETL service configuration added correctly.The addition of
NEXT_PUBLIC_ETL_SERVICEwith clear value options aligns well with the backend configuration. TheNEXT_PUBLIC_prefix correctly exposes this to the client-side code in Next.js.surfsense_backend/.env.example (1)
35-38: Well-organized configuration section for ETL service selection.The new "File Parser Service" section provides clear organization and documentation for the ETL service options. The variable naming is consistent and the available options are clearly documented.
surfsense_web/content/docs/manual-installation.mdx (2)
64-66: Excellent documentation of the new ETL service configuration!The documentation clearly explains the new
ETL_SERVICEenvironment variable and the conditional API key requirements. This will help users understand they only need to obtain the API key for their chosen ETL service.
187-187: Good addition of frontend ETL service configuration.The
NEXT_PUBLIC_ETL_SERVICEvariable properly ensures frontend-backend consistency for file format handling in the upload interface.README.md (2)
34-34: Updated file extension count reflects expanded support.The increase from "34+ file extensions" to "50+ file extensions" accurately reflects the expanded capabilities when using LlamaCloud as the ETL service.
🧰 Tools
🪛 LanguageTool
[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...(PERSONAL_PERSONNEL)
69-96: Excellent categorization of supported file formats!The detailed breakdown of supported file formats by ETL service and category (Documents & Text, Presentations, etc.) provides clear guidance for users. This makes it easy to understand the capabilities and differences between UNSTRUCTURED and LLAMACLOUD services.
surfsense_web/app/dashboard/[search_space_id]/documents/upload/page.tsx (1)
45-133: Excellent implementation of dynamic file type support!The refactoring properly:
- Separates audio files (always supported via Whisper) from document parsing files
- Uses environment variable to conditionally select between LLAMACLOUD and UNSTRUCTURED file types
- Includes comprehensive MIME type mappings for both services
- Maintains clean separation of concerns with well-organized type definitions
The implementation correctly handles the frontend-backend consistency for file upload validation based on the configured ETL service.
surfsense_web/content/docs/docker-installation.mdx (2)
93-95: Consistent documentation across installation methods.The ETL service configuration documentation is perfectly consistent with the manual installation guide, ensuring users have clear guidance regardless of their deployment method.
141-141: Frontend environment variable properly documented.The
NEXT_PUBLIC_ETL_SERVICEvariable documentation correctly emphasizes the need for frontend-backend consistency and its impact on file format support.surfsense_backend/app/routes/documents_routes.py (1)
250-260:Details
❌ Incorrect review comment
Review document processing logic for LlamaCloud.
The loop processes each markdown document individually, but
add_received_file_document_using_llamacloudexpects a single document string, not a list. This could lead to:
- Multiple database entries for a single uploaded file
- Only the last document being retained if documents override each other
Consider either:
- Combining all markdown documents before processing, or
- Updating the function to handle multiple documents appropriately
🏁 Script executed:
#!/bin/bash # Check how LlamaCloud's get_markdown_documents behaves rg -A 10 "get_markdown_documents|aget_markdown_documents" --type pyLength of output: 1197
🏁 Script executed:
#!/bin/bash set -e # Find definition of add_received_file_document_using_llamacloud rg -R "def add_received_file_document_using_llamacloud" -A 5 --type pyLength of output: 486
🏁 Script executed:
#!/bin/bash set -e # Find all occurrences of add_received_file_document_using_llamacloud to locate its definition rg "add_received_file_document_using_llamacloud" -n --type pyLength of output: 607
🏁 Script executed:
#!/bin/bash set -e # Extract the full implementation of add_received_file_document_using_llamacloud rg -n "async def add_received_file_document_using_llamacloud" -A 50 --type py surfsense_backend/app/tasks/background_tasks.pyLength of output: 2205
🏁 Script executed:
#!/bin/bash set -e # Show context around the usage of markdown_documents in documents_routes.py rg -n "markdown_documents" -C 10 --type py surfsense_backend/app/routes/documents_routes.pyLength of output: 1184
Ignore loop retention concerns for LlamaCloud processing
The call to
markdown_documents = await result.aget_markdown_documents(split_by_page=False)with
split_by_page=Falseyields a single element list, so thefor-loop executes exactly once. Also,add_received_file_document_using_llamacloud’s signature and implementation expect a single markdown string (not a list), so the current per-item invocation is correct and will not create duplicate or overridden entries. No changes required.Likely an incorrect or invalid review comment.
🧰 Tools
🪛 Pylint (3.3.7)
[convention] 253-253: Trailing whitespace
(C0303)
| # ETL Service | ||
| ETL_SERVICE = os.getenv("ETL_SERVICE") | ||
|
|
||
| if ETL_SERVICE == "UNSTRUCTURED": | ||
| # Unstructured API Key | ||
| UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY") | ||
|
|
||
| elif ETL_SERVICE == "LLAMACLOUD": | ||
| # LlamaCloud API Key | ||
| LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY") | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add validation and error handling for ETL service configuration.
The conditional loading logic is correct, but consider adding validation to ensure robustness:
- Validate ETL_SERVICE value - Currently no validation if an invalid value is provided
- Error handling for missing API keys - No check if the required API key is missing
- Default behavior - No fallback if ETL_SERVICE is not set
# ETL Service
ETL_SERVICE = os.getenv("ETL_SERVICE")
-if ETL_SERVICE == "UNSTRUCTURED":
+if ETL_SERVICE == "UNSTRUCTURED":
# Unstructured API Key
UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")
-
-elif ETL_SERVICE == "LLAMACLOUD":
+ if not UNSTRUCTURED_API_KEY:
+ raise ValueError("UNSTRUCTURED_API_KEY is required when ETL_SERVICE is set to 'UNSTRUCTURED'")
+
+elif ETL_SERVICE == "LLAMACLOUD":
# LlamaCloud API Key
LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")
-
-
+ if not LLAMA_CLOUD_API_KEY:
+ raise ValueError("LLAMA_CLOUD_API_KEY is required when ETL_SERVICE is set to 'LLAMACLOUD'")
+
+elif ETL_SERVICE is not None:
+ raise ValueError(f"Invalid ETL_SERVICE value: '{ETL_SERVICE}'. Must be 'UNSTRUCTURED' or 'LLAMACLOUD'")
+else:
+ raise ValueError("ETL_SERVICE environment variable is required")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # ETL Service | |
| ETL_SERVICE = os.getenv("ETL_SERVICE") | |
| if ETL_SERVICE == "UNSTRUCTURED": | |
| # Unstructured API Key | |
| UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY") | |
| elif ETL_SERVICE == "LLAMACLOUD": | |
| # LlamaCloud API Key | |
| LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY") | |
| # ETL Service | |
| ETL_SERVICE = os.getenv("ETL_SERVICE") | |
| if ETL_SERVICE == "UNSTRUCTURED": | |
| # Unstructured API Key | |
| UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY") | |
| if not UNSTRUCTURED_API_KEY: | |
| raise ValueError("UNSTRUCTURED_API_KEY is required when ETL_SERVICE is set to 'UNSTRUCTURED'") | |
| elif ETL_SERVICE == "LLAMACLOUD": | |
| # LlamaCloud API Key | |
| LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY") | |
| if not LLAMA_CLOUD_API_KEY: | |
| raise ValueError("LLAMA_CLOUD_API_KEY is required when ETL_SERVICE is set to 'LLAMACLOUD'") | |
| elif ETL_SERVICE is not None: | |
| raise ValueError(f"Invalid ETL_SERVICE value: '{ETL_SERVICE}'. Must be 'UNSTRUCTURED' or 'LLAMACLOUD'") | |
| else: | |
| raise ValueError("ETL_SERVICE environment variable is required") |
🧰 Tools
🪛 Pylint (3.3.7)
[convention] 101-101: Trailing whitespace
(C0303)
[convention] 105-105: Trailing whitespace
(C0303)
[convention] 109-109: Trailing whitespace
(C0303)
🤖 Prompt for AI Agents
In surfsense_backend/app/config/__init__.py around lines 99 to 109, add
validation to check if ETL_SERVICE has a valid value (e.g., "UNSTRUCTURED" or
"LLAMACLOUD") and raise an error if it is invalid or not set. Also, after
loading the API key for the selected service, verify that the key is present and
raise an error if missing. Finally, implement a default behavior or error
handling for when ETL_SERVICE is not set to ensure robustness.
| raise db_error | ||
| except Exception as e: | ||
| await session.rollback() | ||
| raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}") |
There was a problem hiding this comment.
Use explicit exception chaining.
When re-raising exceptions, use from e to preserve the exception chain for better debugging.
- raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}")
+ raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}") from e📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}") | |
| raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}") from e |
🧰 Tools
🪛 Ruff (0.11.9)
434-434: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🪛 Pylint (3.3.7)
[warning] 434-434: Consider explicitly re-raising using 'raise RuntimeError(f'Failed to process file document using LlamaCloud: {str(e)}') from e'
(W0707)
🤖 Prompt for AI Agents
In surfsense_backend/app/tasks/background_tasks.py at line 434, the RuntimeError
is raised without explicit exception chaining. Modify the raise statement to use
'from e' to preserve the original exception context, improving debugging by
maintaining the exception chain.
| llamacloud_markdown_documents: List of markdown content from LlamaCloud parsing | ||
| search_space_id: ID of the search space |
There was a problem hiding this comment.
Fix docstring parameter description.
The docstring incorrectly describes the parameter as a list when it's actually a single string.
- llamacloud_markdown_documents: List of markdown content from LlamaCloud parsing
+ llamacloud_markdown_document: Markdown content from LlamaCloud parsing📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| llamacloud_markdown_documents: List of markdown content from LlamaCloud parsing | |
| search_space_id: ID of the search space | |
| llamacloud_markdown_document: Markdown content from LlamaCloud parsing | |
| search_space_id: ID of the search space |
🤖 Prompt for AI Agents
In surfsense_backend/app/tasks/background_tasks.py around lines 372 to 373, the
docstring incorrectly describes the parameter 'llamacloud_markdown_documents' as
a list, but it is actually a single string. Update the docstring to reflect that
'llamacloud_markdown_documents' is a string, not a list, ensuring the parameter
description accurately matches its type.
| except: | ||
| pass |
There was a problem hiding this comment.
Replace bare except with specific exception handling.
Using bare except: can mask unexpected errors. Use specific exception types or at least Exception.
- except:
- pass
+ except (OSError, IOError):
+ pass # File might already be deletedAlso applies to: 244-245
🧰 Tools
🪛 Ruff (0.11.9)
211-214: Use contextlib.suppress(Exception) instead of try-except-pass
Replace with contextlib.suppress(Exception)
(SIM105)
213-213: Do not use bare except
(E722)
🪛 Pylint (3.3.7)
[warning] 213-214: No exception type(s) specified
(W0702)
🤖 Prompt for AI Agents
In surfsense_backend/app/routes/documents_routes.py at lines 213-214 and
244-245, replace the bare except clauses with specific exception handling by
catching Exception or more specific exceptions if known. This prevents masking
unexpected errors and improves error visibility and debugging.
| from langchain_unstructured import UnstructuredLoader | ||
There was a problem hiding this comment.
🛠️ Refactor suggestion
Move imports to module level.
Importing modules inside functions is an anti-pattern that affects performance and readability. Move all imports to the top of the file.
Add these imports at the module level:
from langchain_unstructured import UnstructuredLoader
from llama_cloud_services import LlamaParse
from llama_cloud_services.parse.utils import ResultTypeThen remove the redundant import os statements on lines 210 and 241 since os is already imported at line 18.
Also applies to: 210-210, 224-225, 241-241
🧰 Tools
🪛 Pylint (3.3.7)
[convention] 195-195: Trailing whitespace
(C0303)
[convention] 194-194: Import outside toplevel (langchain_unstructured.UnstructuredLoader)
(C0415)
🤖 Prompt for AI Agents
In surfsense_backend/app/routes/documents_routes.py around lines 194-195 and
also lines 210, 224-225, and 241, move the imports of UnstructuredLoader,
LlamaParse, and ResultType to the top of the file with other module-level
imports. Remove the redundant os imports at lines 210 and 241 since os is
already imported at line 18. This will improve performance and readability by
avoiding repeated imports inside functions.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (10)
README.md (10)
34-34: Revise phrasing for clarity and grammar
The parenthetical clause “(Documents, images, videos and supports 50+ file extensions)” reads awkwardly. Consider rephrasing to something like:
“Documents, images, and videos (supporting 50+ file extensions)”🧰 Tools
🪛 LanguageTool
[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...(PERSONAL_PERSONNEL)
72-72: Consider sorting file extensions alphabetically
Alphabetical ordering will improve scanability for end users.
74-74: Consider sorting file extensions alphabetically
Apply the same alphabetical ordering to the Unstructured list for consistency.
77-77: Consider sorting file extensions alphabetically
Consistency across lists helps users locate their format more quickly.
82-82: Consider sorting file extensions alphabetically
An ordered list will aid readability, especially as the list grows.
84-84: Consider sorting file extensions alphabetically
Match the ordering style used in other categories.
87-87: Consider sorting file extensions alphabetically
Helps users quickly locate a specific image format.
89-89: Consider sorting file extensions alphabetically
Maintain the same pattern across LlamaCloud and Unstructured lists.
91-92: Clarify scope and ordering for audio/video formats
- Specify that these “Always Supported” formats apply to both ETL services.
- Sort the extensions alphabetically for consistency.
95-95: Consider sorting file extensions alphabetically
Ensure consistency with other lists by ordering.eml,.msg,.p7s.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
README.md(2 hunks)
🧰 Additional context used
🪛 LanguageTool
README.md
[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...
(PERSONAL_PERSONNEL)
🔇 Additional comments (8)
README.md (8)
67-67: Approve new section header
The “Supported File Extensions” section clearly groups all format lists under a single heading.
69-69: Verify ETL service support counts
The note states “LlamaCloud supports 100+ formats, while Unstructured supports 34+ core formats.” Ensure these numbers match the actual lists (and link to upstream docs if available).
71-71: Approve ‘Documents & Text’ subheading
This category clearly delineates text-based document formats.
76-76: Approve ‘Presentations’ subheading
Clear categorization for slide deck formats.
79-79: Approve Unstructured presentations list
The two core formats are correctly called out.
81-81: Approve ‘Spreadsheets & Data’ subheading
Appropriate grouping for tabular and data file types.
86-86: Approve ‘Images’ subheading
Image formats are now clearly separated.
94-94: Approve ‘Email & Communication’ subheading
This clearly identifies message-based file types.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
README.md (3)
34-34: Refine phrasing for clarity and grammar
The current sentenceSave content from your own personal files (Documents, images, videos and supports 50+ file extensions) to your own personal knowledge base .
is awkward and contains redundancy (“and supports”). Consider rephrasing to improve readability and consistency, for example:Save content from your personal files (documents, images, videos, etc.) — supporting **50+ file extensions** — to your knowledge base.🧰 Tools
🪛 LanguageTool
[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...(PERSONAL_PERSONNEL)
67-69: Link to configuration and ensure consistency with environment variable naming
This “Note” refers to your ETL service names (“LlamaCloud” and “Unstructured”), but the actual env var (ETL_SERVICE) expects uppercase values (LLAMACLOUD,UNSTRUCTURED). To avoid confusion:
- Add a link to the
.env.examplefiles (surfsense_backend/.env.example,surfsense_web/.env.example)- Spell out the exact values users should set
[e.g.,]> **Note**: File format support depends on the `ETL_SERVICE` setting (e.g., `LLAMACLOUD` or `UNSTRUCTURED`). See `surfsense_backend/.env.example` for details.
71-95: Enhance readability and maintainability of extension lists
Maintaining a long, flat list of extensions in markdown can be error-prone and hard for users to scan. You might consider:
- Converting each category into a Markdown table with columns for LlamaCloud vs. Unstructured
- Wrapping each category in a collapsible
<details>block- Generating the lists automatically from a shared source or script
This will improve both UX and ease future updates.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
README.md(2 hunks)
🧰 Additional context used
🪛 LanguageTool
README.md
[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...
(PERSONAL_PERSONNEL)
feat: Removed Hard Dependency on Unstructured.io
- Added DOCLING as third ETL_SERVICE option (alongside UNSTRUCTURED/LLAMACLOUD) - Implemented add_received_file_document_using_docling function - Added Docling processing logic in documents_routes.py - Enhanced chunking with configurable overlap support - Added comprehensive document processing service - Supports both CPU and GPU processing with user selection Addresses MODSetter#161 - Add Docling Support as an ETL_SERVICE Follows same pattern as LlamaCloud integration (PR MODSetter#123)
feat: Removed Hard Dependency on Unstructured.io
- Added DOCLING as third ETL_SERVICE option (alongside UNSTRUCTURED/LLAMACLOUD) - Implemented add_received_file_document_using_docling function - Added Docling processing logic in documents_routes.py - Enhanced chunking with configurable overlap support - Added comprehensive document processing service - Supports both CPU and GPU processing with user selection Addresses MODSetter#161 - Add Docling Support as an ETL_SERVICE Follows same pattern as LlamaCloud integration (PR MODSetter#123)
- Added DOCLING as third ETL_SERVICE option (alongside UNSTRUCTURED/LLAMACLOUD) - Implemented add_received_file_document_using_docling function - Added Docling processing logic in documents_routes.py - Enhanced chunking with configurable overlap support - Added comprehensive document processing service - Supports both CPU and GPU processing with user selection Addresses MODSetter#161 - Add Docling Support as an ETL_SERVICE Follows same pattern as LlamaCloud integration (PR MODSetter#123)
Description
Removed Hard Dependency on Unstructured.io due to their recent limited SignUps
Motivation and Context
FIX # #113
Types of changes
Testing
Checklist:
Summary by CodeRabbit
New Features
Documentation
Chores