feat: Removed Hard Dependency on Unstructured.io by MODSetter · Pull Request #123 · MODSetter/SurfSense

MODSetter · 2025-05-31T02:19:21Z

Added Llamaparse Support :)

Description

Removed Hard Dependency on Unstructured.io due to their recent limited SignUps

Motivation and Context

FIX # #113

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance improvement (non-breaking change which enhances performance)
Documentation update
Breaking change (fix or feature that would cause existing functionality to change)

Testing

I have tested these changes locally
I have added/updated unit tests
I have added/updated integration tests

Checklist:

My code follows the code style of this project
My change requires documentation updates
I have updated the documentation accordingly
My change requires dependency updates
I have updated the dependencies accordingly
My code builds clean without any errors or warnings
All new and existing tests passed

Summary by CodeRabbit

New Features
- Added support for selecting between two document parsing services (Unstructured and LlamaCloud), enabling broader file format compatibility for uploads.
- Expanded the range of supported file formats for document uploads, with LlamaCloud supporting 50+ formats and Unstructured supporting 34+ core formats.
- The upload interface now dynamically adapts to show supported file types based on the selected parsing service.
Documentation
- Updated installation and configuration guides to detail new environment variables and clarify API key requirements for each parsing service.
- README now provides categorized, detailed lists of supported file formats for both services.
Chores
- Example environment files updated to reflect new configuration options and variables.

- Added Llamaparse Support :)

vercel · 2025-05-31T02:19:25Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
surf-sense-frontend	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 31, 2025 2:32am

coderabbitai · 2025-05-31T02:19:27Z

Caution

Review failed

The pull request is closed.

Walkthrough

The changes introduce support for selecting between two document parsing services, "UNSTRUCTURED" and "LLAMACLOUD", via new environment variables in both backend and frontend configurations. The backend logic now conditionally processes uploaded files using the chosen service, and the frontend dynamically adjusts accepted file types. Documentation and example environment files are updated to reflect these options.

Changes

File(s)	Change Summary
README.md	Expanded and restructured supported file format documentation, distinguishing between LlamaCloud and Unstructured services, with categorized extension lists.
surfsense_backend/.env.example surfsense_web/.env.example	Added `ETL_SERVICE` and corresponding API key variables for backend and frontend; reintroduced and reorganized `UNSTRUCTURED_API_KEY`; added `LLAMA_CLOUD_API_KEY` and `NEXT_PUBLIC_ETL_SERVICE`.
surfsense_backend/app/config/init.py	Modified config to conditionally load the appropriate API key based on `ETL_SERVICE` value.
surfsense_backend/app/routes/documents_routes.py	Updated imports and background processing logic to branch between Unstructured and LlamaCloud file processing based on `ETL_SERVICE`.
surfsense_backend/app/tasks/background_tasks.py	Renamed and split background task functions for Unstructured and LlamaCloud; added new function to process LlamaCloud markdown documents.
surfsense_backend/pyproject.toml	Added `llama-cloud-services >=0.6.25` as a dependency.
surfsense_web/app/dashboard/[search_space_id]/documents/upload/page.tsx	Refactored accepted file types logic to dynamically select supported formats based on `NEXT_PUBLIC_ETL_SERVICE`, expanding support for LlamaCloud.
surfsense_web/content/docs/docker-installation.mdx surfsense_web/content/docs/manual-installation.mdx	Updated documentation to explain new ETL service selection, related API keys, and frontend variable; clarified configuration for both Unstructured and LlamaCloud services.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Frontend
    participant Backend
    participant LlamaCloud
    participant Unstructured

    User->>Frontend: Uploads file
    Frontend->>Backend: Sends file (with ETL_SERVICE info)
    alt ETL_SERVICE = LLAMACLOUD
        Backend->>LlamaCloud: Parse file
        LlamaCloud-->>Backend: Returns markdown documents
        Backend->>Backend: Process markdown docs (add_received_file_document_using_llamacloud)
    else ETL_SERVICE = UNSTRUCTURED
        Backend->>Unstructured: Parse file
        Unstructured-->>Backend: Returns processed elements
        Backend->>Backend: Process elements (add_received_file_document_using_unstructured)
    end
    Backend-->>Frontend: Acknowledge upload

Possibly related PRs

Fix #33: Refactored code #88: Related by touching the add_received_file_document function but differs in purpose and implementation details, focusing on formatting and adding YouTube video ingestion.
chore: Fixed markdown handling & incorrect imports. #76: Related in handling file processing and background tasks but differs by not supporting multiple ETL services or changing processing logic fundamentally.

Suggested reviewers

MODSetter

Poem

A rabbit leaps through code anew,
With LlamaCloud and Unstructured too!
Now docs of many stripes and spots,
Can hop right in—no second thoughts.
Environment keys in tidy rows,
The upload garden wider grows.
🐇✨

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0dbcf56 and 0365feb.

📒 Files selected for processing (1)

surfsense_backend/app/tasks/background_tasks.py (3 hunks)

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

recurseml · 2025-05-31T02:20:46Z

✨ No issues found! Your code is sparkling clean! ✨

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (4)

surfsense_backend/app/config/__init__.py (1)
101-101: Fix trailing whitespace issues.

Static analysis detected trailing whitespace on these lines. Please remove the trailing whitespace to maintain code quality standards.
 ETL_SERVICE = os.getenv("ETL_SERVICE")
-    
+
 if ETL_SERVICE == "UNSTRUCTURED":
     # Unstructured API Key
     UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")
-        
+        
 elif ETL_SERVICE == "LLAMACLOUD":
     # LlamaCloud API Key
     LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")
-        
-        
+        
+        
Also applies to: 105-105, 109-109

🧰 Tools

🪛 Pylint (3.3.7)

[convention] 101-101: Trailing whitespace

(C0303)
surfsense_backend/app/tasks/background_tasks.py (2)
292-292: Add a docstring to maintain consistency.

The function lacks a docstring while the new add_received_file_document_using_llamacloud function has one. For consistency and better documentation, please add a docstring.
 async def add_received_file_document_using_unstructured(
+    """
+    Process and store document content parsed by Unstructured.
+
+    Args:
+        session: Database session
+        file_name: Name of the processed file
+        unstructured_processed_elements: List of processed elements from Unstructured
+        search_space_id: ID of the search space
+
+    Returns:
+        Document object if successful, None if failed
+    """
🧰 Tools

🪛 Pylint (3.3.7)

[convention] 292-292: Missing function or method docstring

(C0116)

[refactor] 292-292: Too many local variables (16/15)

(R0914)

379-380: Remove misleading comment.

The comment suggests combining multiple documents, but the code just assigns a single string.
-        # Combine all markdown documents into one
         file_in_markdown = llamacloud_markdown_document
surfsense_backend/app/routes/documents_routes.py (1)
10-10: Split long import statement for better readability.

The import line exceeds 253 characters. Consider splitting it across multiple lines.
-from app.tasks.background_tasks import add_received_markdown_file_document, add_extension_received_document, add_received_file_document_using_unstructured, add_crawled_url_document, add_youtube_video_document, add_received_file_document_using_llamacloud
+from app.tasks.background_tasks import (
+    add_received_markdown_file_document,
+    add_extension_received_document,
+    add_received_file_document_using_unstructured,
+    add_received_file_document_using_llamacloud,
+    add_crawled_url_document,
+    add_youtube_video_document
+)
🧰 Tools

🪛 Pylint (3.3.7)

[convention] 10-10: Line too long (253/100)

(C0301)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5737ea8 and 73751c0.

⛔ Files ignored due to path filters (1)

surfsense_backend/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (10)

README.md (2 hunks)
surfsense_backend/.env.example (1 hunks)
surfsense_backend/app/config/__init__.py (1 hunks)
surfsense_backend/app/routes/documents_routes.py (4 hunks)
surfsense_backend/app/tasks/background_tasks.py (2 hunks)
surfsense_backend/pyproject.toml (1 hunks)
surfsense_web/.env.example (1 hunks)
surfsense_web/app/dashboard/[search_space_id]/documents/upload/page.tsx (1 hunks)
surfsense_web/content/docs/docker-installation.mdx (2 hunks)
surfsense_web/content/docs/manual-installation.mdx (2 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

surfsense_backend/app/routes/documents_routes.py (1)

surfsense_backend/app/tasks/background_tasks.py (2)

add_received_file_document_using_unstructured (292-357)

add_received_file_document_using_llamacloud (360-434)

🪛 Pylint (3.3.7)

surfsense_backend/app/config/__init__.py

[convention] 101-101: Trailing whitespace

(C0303)

[convention] 105-105: Trailing whitespace

(C0303)

[convention] 109-109: Trailing whitespace

(C0303)

[convention] 110-110: Trailing whitespace

(C0303)

surfsense_backend/app/tasks/background_tasks.py

[convention] 292-292: Missing function or method docstring

(C0116)

[refactor] 292-292: Too many local variables (16/15)

(R0914)

[convention] 389-389: Trailing whitespace

(C0303)

[convention] 391-391: Line too long (107/100)

(C0301)

[refactor] 360-360: Too many local variables (16/15)

(R0914)

[warning] 391-391: Use lazy % formatting in logging functions

(W1203)

[warning] 434-434: Consider explicitly re-raising using 'raise RuntimeError(f'Failed to process file document using LlamaCloud: {str(e)}') from e'

(W0707)

surfsense_backend/app/routes/documents_routes.py

[convention] 10-10: Line too long (253/100)

(C0301)

[convention] 104-104: Trailing whitespace

(C0303)

[convention] 195-195: Trailing whitespace

(C0303)

[convention] 227-227: Trailing whitespace

(C0303)

[convention] 236-236: Trailing whitespace

(C0303)

[convention] 239-239: Trailing whitespace

(C0303)

[convention] 246-246: Trailing whitespace

(C0303)

[convention] 249-249: Trailing whitespace

(C0303)

[convention] 253-253: Trailing whitespace

(C0303)

[convention] 194-194: Import outside toplevel (langchain_unstructured.UnstructuredLoader)

(C0415)

[warning] 210-210: Reimport 'os' (imported line 18)

(W0404)

[convention] 210-210: Import outside toplevel (os)

(C0415)

[warning] 213-214: No exception type(s) specified

(W0702)

[convention] 224-224: Import outside toplevel (llama_cloud_services.LlamaParse)

(C0415)

[convention] 225-225: Import outside toplevel (llama_cloud_services.parse.utils.ResultType)

(C0415)

[warning] 241-241: Reimport 'os' (imported line 18)

(W0404)

[convention] 241-241: Import outside toplevel (os)

(C0415)

[warning] 244-245: No exception type(s) specified

(W0702)

[convention] 483-483: Trailing newlines

(C0305)

🪛 LanguageTool

README.md

[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...

(PERSONAL_PERSONNEL)

🪛 Ruff (0.11.9)

surfsense_backend/app/tasks/background_tasks.py

434-434: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

surfsense_backend/app/routes/documents_routes.py

211-214: Use contextlib.suppress(Exception) instead of try-except-pass

Replace with contextlib.suppress(Exception)

(SIM105)

213-213: Do not use bare except

(E722)

242-245: Use contextlib.suppress(Exception) instead of try-except-pass

Replace with contextlib.suppress(Exception)

(SIM105)

244-244: Do not use bare except

(E722)

🔇 Additional comments (11)

surfsense_backend/pyproject.toml (1)

20-20:
Details

✅ Verification successful

Verify the new dependency version and security status.

The addition of llama-cloud-services>=0.6.25 looks correct for supporting the new LlamaCloud ETL service. Please ensure this version exists and check for any security advisories.

🏁 Script executed:
#!/bin/bash
# Description: Verify llama-cloud-services package version and security status

# Check if the specified version exists on PyPI
curl -s https://pypi.org/pypi/llama-cloud-services/json | jq -r '.releases | keys[]' | grep -E '^0\.6\.(2[5-9]|[3-9][0-9])' | head -5

# Check for security advisories
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: PIP, package: "llama-cloud-services") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'
Length of output: 496
Dependency Approved: llama-cloud-services>=0.6.25

Version 0.6.25 is available on PyPI and no PIP security advisories were found.

• surfsense_backend/pyproject.toml (line 20): llama-cloud-services>=0.6.25
surfsense_web/.env.example (1)

2-3: LGTM! Frontend ETL service configuration added correctly.

The addition of NEXT_PUBLIC_ETL_SERVICE with clear value options aligns well with the backend configuration. The NEXT_PUBLIC_ prefix correctly exposes this to the client-side code in Next.js.

surfsense_backend/.env.example (1)

35-38: Well-organized configuration section for ETL service selection.

The new "File Parser Service" section provides clear organization and documentation for the ETL service options. The variable naming is consistent and the available options are clearly documented.

surfsense_web/content/docs/manual-installation.mdx (2)

64-66: Excellent documentation of the new ETL service configuration!

The documentation clearly explains the new ETL_SERVICE environment variable and the conditional API key requirements. This will help users understand they only need to obtain the API key for their chosen ETL service.

187-187: Good addition of frontend ETL service configuration.

The NEXT_PUBLIC_ETL_SERVICE variable properly ensures frontend-backend consistency for file format handling in the upload interface.

README.md (2)

34-34: Updated file extension count reflects expanded support.

The increase from "34+ file extensions" to "50+ file extensions" accurately reflects the expanded capabilities when using LlamaCloud as the ETL service.

🧰 Tools

🪛 LanguageTool

[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...

(PERSONAL_PERSONNEL)

69-96: Excellent categorization of supported file formats!

The detailed breakdown of supported file formats by ETL service and category (Documents & Text, Presentations, etc.) provides clear guidance for users. This makes it easy to understand the capabilities and differences between UNSTRUCTURED and LLAMACLOUD services.

surfsense_web/app/dashboard/[search_space_id]/documents/upload/page.tsx (1)

45-133: Excellent implementation of dynamic file type support!

The refactoring properly:

Separates audio files (always supported via Whisper) from document parsing files

Uses environment variable to conditionally select between LLAMACLOUD and UNSTRUCTURED file types

Includes comprehensive MIME type mappings for both services

Maintains clean separation of concerns with well-organized type definitions

The implementation correctly handles the frontend-backend consistency for file upload validation based on the configured ETL service.

surfsense_web/content/docs/docker-installation.mdx (2)

93-95: Consistent documentation across installation methods.

The ETL service configuration documentation is perfectly consistent with the manual installation guide, ensuring users have clear guidance regardless of their deployment method.

141-141: Frontend environment variable properly documented.

The NEXT_PUBLIC_ETL_SERVICE variable documentation correctly emphasizes the need for frontend-backend consistency and its impact on file format support.
surfsense_backend/app/routes/documents_routes.py (1)

250-260:
Details

❌ Incorrect review comment

Review document processing logic for LlamaCloud.

The loop processes each markdown document individually, but add_received_file_document_using_llamacloud expects a single document string, not a list. This could lead to:

Multiple database entries for a single uploaded file

Only the last document being retained if documents override each other

Consider either:

Combining all markdown documents before processing, or

Updating the function to handle multiple documents appropriately

🏁 Script executed:
#!/bin/bash
# Check how LlamaCloud's get_markdown_documents behaves
rg -A 10 "get_markdown_documents|aget_markdown_documents" --type py
Length of output: 1197

🏁 Script executed:
#!/bin/bash
set -e
# Find definition of add_received_file_document_using_llamacloud
rg -R "def add_received_file_document_using_llamacloud" -A 5 --type py
Length of output: 486

🏁 Script executed:
#!/bin/bash
set -e
# Find all occurrences of add_received_file_document_using_llamacloud to locate its definition
rg "add_received_file_document_using_llamacloud" -n --type py
Length of output: 607

🏁 Script executed:
#!/bin/bash
set -e
# Extract the full implementation of add_received_file_document_using_llamacloud
rg -n "async def add_received_file_document_using_llamacloud" -A 50 --type py surfsense_backend/app/tasks/background_tasks.py
Length of output: 2205

🏁 Script executed:
#!/bin/bash
set -e
# Show context around the usage of markdown_documents in documents_routes.py
rg -n "markdown_documents" -C 10 --type py surfsense_backend/app/routes/documents_routes.py
Length of output: 1184
Ignore loop retention concerns for LlamaCloud processing

The call to
markdown_documents = await result.aget_markdown_documents(split_by_page=False)
with split_by_page=False yields a single element list, so the for-loop executes exactly once. Also, add_received_file_document_using_llamacloud’s signature and implementation expect a single markdown string (not a list), so the current per-item invocation is correct and will not create duplicate or overridden entries. No changes required.

Likely an incorrect or invalid review comment.

🧰 Tools

🪛 Pylint (3.3.7)

[convention] 253-253: Trailing whitespace

(C0303)

coderabbitai · 2025-05-31T02:26:11Z

surfsense_backend/app/config/__init__.py

+    # ETL Service
+    ETL_SERVICE = os.getenv("ETL_SERVICE")

+    if ETL_SERVICE == "UNSTRUCTURED":
+        # Unstructured API Key
+        UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")
+
+    elif ETL_SERVICE == "LLAMACLOUD":
+        # LlamaCloud API Key
+        LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")
+


🛠️ Refactor suggestion

Add validation and error handling for ETL service configuration.

The conditional loading logic is correct, but consider adding validation to ensure robustness:

Validate ETL_SERVICE value - Currently no validation if an invalid value is provided

Error handling for missing API keys - No check if the required API key is missing

Default behavior - No fallback if ETL_SERVICE is not set

# ETL Service ETL_SERVICE = os.getenv("ETL_SERVICE") -if ETL_SERVICE == "UNSTRUCTURED": +if ETL_SERVICE == "UNSTRUCTURED": # Unstructured API Key UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY") - -elif ETL_SERVICE == "LLAMACLOUD": + if not UNSTRUCTURED_API_KEY: + raise ValueError("UNSTRUCTURED_API_KEY is required when ETL_SERVICE is set to 'UNSTRUCTURED'") + +elif ETL_SERVICE == "LLAMACLOUD": # LlamaCloud API Key LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY") - - + if not LLAMA_CLOUD_API_KEY: + raise ValueError("LLAMA_CLOUD_API_KEY is required when ETL_SERVICE is set to 'LLAMACLOUD'") + +elif ETL_SERVICE is not None: + raise ValueError(f"Invalid ETL_SERVICE value: '{ETL_SERVICE}'. Must be 'UNSTRUCTURED' or 'LLAMACLOUD'") +else: + raise ValueError("ETL_SERVICE environment variable is required")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# ETL Service

ETL_SERVICE = os.getenv("ETL_SERVICE")

if ETL_SERVICE == "UNSTRUCTURED":

# Unstructured API Key

UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")

elif ETL_SERVICE == "LLAMACLOUD":

# LlamaCloud API Key

LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")

# ETL Service

ETL_SERVICE = os.getenv("ETL_SERVICE")

if ETL_SERVICE == "UNSTRUCTURED":

# Unstructured API Key

UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")

if not UNSTRUCTURED_API_KEY:

raise ValueError("UNSTRUCTURED_API_KEY is required when ETL_SERVICE is set to 'UNSTRUCTURED'")

elif ETL_SERVICE == "LLAMACLOUD":

# LlamaCloud API Key

LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")

if not LLAMA_CLOUD_API_KEY:

raise ValueError("LLAMA_CLOUD_API_KEY is required when ETL_SERVICE is set to 'LLAMACLOUD'")

elif ETL_SERVICE is not None:

raise ValueError(f"Invalid ETL_SERVICE value: '{ETL_SERVICE}'. Must be 'UNSTRUCTURED' or 'LLAMACLOUD'")

else:

raise ValueError("ETL_SERVICE environment variable is required")

🧰 Tools

🪛 Pylint (3.3.7)

[convention] 101-101: Trailing whitespace

(C0303)

[convention] 105-105: Trailing whitespace

(C0303)

[convention] 109-109: Trailing whitespace

(C0303)

🤖 Prompt for AI Agents

In surfsense_backend/app/config/__init__.py around lines 99 to 109, add validation to check if ETL_SERVICE has a valid value (e.g., "UNSTRUCTURED" or "LLAMACLOUD") and raise an error if it is invalid or not set. Also, after loading the API key for the selected service, verify that the key is present and raise an error if missing. Finally, implement a default behavior or error handling for when ETL_SERVICE is not set to ensure robustness.

surfsense_backend/app/tasks/background_tasks.py

coderabbitai · 2025-05-31T02:26:12Z

surfsense_backend/app/tasks/background_tasks.py

+        raise db_error
+    except Exception as e:
+        await session.rollback()
+        raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}")


⚠️ Potential issue

Use explicit exception chaining.

When re-raising exceptions, use from e to preserve the exception chain for better debugging.

- raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}") + raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}") from e

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}")

raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}") from e

🧰 Tools

🪛 Ruff (0.11.9)

434-434: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🪛 Pylint (3.3.7)

[warning] 434-434: Consider explicitly re-raising using 'raise RuntimeError(f'Failed to process file document using LlamaCloud: {str(e)}') from e'

(W0707)

🤖 Prompt for AI Agents

In surfsense_backend/app/tasks/background_tasks.py at line 434, the RuntimeError is raised without explicit exception chaining. Modify the raise statement to use 'from e' to preserve the original exception context, improving debugging by maintaining the exception chain.

coderabbitai · 2025-05-31T02:26:12Z

surfsense_backend/app/tasks/background_tasks.py

+        llamacloud_markdown_documents: List of markdown content from LlamaCloud parsing
+        search_space_id: ID of the search space


⚠️ Potential issue

Fix docstring parameter description.

The docstring incorrectly describes the parameter as a list when it's actually a single string.

- llamacloud_markdown_documents: List of markdown content from LlamaCloud parsing + llamacloud_markdown_document: Markdown content from LlamaCloud parsing

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

llamacloud_markdown_documents: List of markdown content from LlamaCloud parsing

search_space_id: ID of the search space

llamacloud_markdown_document: Markdown content from LlamaCloud parsing

search_space_id: ID of the search space

🤖 Prompt for AI Agents

In surfsense_backend/app/tasks/background_tasks.py around lines 372 to 373, the docstring incorrectly describes the parameter 'llamacloud_markdown_documents' as a list, but it is actually a single string. Update the docstring to reflect that 'llamacloud_markdown_documents' is a string, not a list, ensuring the parameter description accurately matches its type.

coderabbitai · 2025-05-31T02:26:12Z

surfsense_backend/app/routes/documents_routes.py

+                except:
+                    pass


⚠️ Potential issue

Replace bare except with specific exception handling.

Using bare except: can mask unexpected errors. Use specific exception types or at least Exception.

- except: - pass + except (OSError, IOError): + pass # File might already be deleted

Also applies to: 244-245

🧰 Tools

🪛 Ruff (0.11.9)

211-214: Use contextlib.suppress(Exception) instead of try-except-pass

Replace with contextlib.suppress(Exception)

(SIM105)

213-213: Do not use bare except

(E722)

🪛 Pylint (3.3.7)

[warning] 213-214: No exception type(s) specified

(W0702)

🤖 Prompt for AI Agents

In surfsense_backend/app/routes/documents_routes.py at lines 213-214 and 244-245, replace the bare except clauses with specific exception handling by catching Exception or more specific exceptions if known. This prevents masking unexpected errors and improves error visibility and debugging.

coderabbitai · 2025-05-31T02:26:12Z

surfsense_backend/app/routes/documents_routes.py

+                from langchain_unstructured import UnstructuredLoader
+                


🛠️ Refactor suggestion

Move imports to module level.

Importing modules inside functions is an anti-pattern that affects performance and readability. Move all imports to the top of the file.

Add these imports at the module level:

from langchain_unstructured import UnstructuredLoader from llama_cloud_services import LlamaParse from llama_cloud_services.parse.utils import ResultType

Then remove the redundant import os statements on lines 210 and 241 since os is already imported at line 18.

Also applies to: 210-210, 224-225, 241-241

🧰 Tools

🪛 Pylint (3.3.7)

[convention] 195-195: Trailing whitespace

(C0303)

[convention] 194-194: Import outside toplevel (langchain_unstructured.UnstructuredLoader)

(C0415)

🤖 Prompt for AI Agents

In surfsense_backend/app/routes/documents_routes.py around lines 194-195 and also lines 210, 224-225, and 241, move the imports of UnstructuredLoader, LlamaParse, and ResultType to the top of the file with other module-level imports. Remove the redundant os imports at lines 210 and 241 since os is already imported at line 18. This will improve performance and readability by avoiding repeated imports inside functions.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (10)

README.md (10)

34-34: Revise phrasing for clarity and grammar
The parenthetical clause “(Documents, images, videos and supports 50+ file extensions)” reads awkwardly. Consider rephrasing to something like:
“Documents, images, and videos (supporting 50+ file extensions)”

🧰 Tools

🪛 LanguageTool

[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...

(PERSONAL_PERSONNEL)

72-72: Consider sorting file extensions alphabetically
Alphabetical ordering will improve scanability for end users.

74-74: Consider sorting file extensions alphabetically
Apply the same alphabetical ordering to the Unstructured list for consistency.

77-77: Consider sorting file extensions alphabetically
Consistency across lists helps users locate their format more quickly.

82-82: Consider sorting file extensions alphabetically
An ordered list will aid readability, especially as the list grows.

84-84: Consider sorting file extensions alphabetically
Match the ordering style used in other categories.

87-87: Consider sorting file extensions alphabetically
Helps users quickly locate a specific image format.

89-89: Consider sorting file extensions alphabetically
Maintain the same pattern across LlamaCloud and Unstructured lists.

91-92: Clarify scope and ordering for audio/video formats

Specify that these “Always Supported” formats apply to both ETL services.

Sort the extensions alphabetically for consistency.

95-95: Consider sorting file extensions alphabetically
Ensure consistency with other lists by ordering .eml, .msg, .p7s.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73751c0 and b67b497.

📒 Files selected for processing (1)

README.md (2 hunks)

🧰 Additional context used

🪛 LanguageTool

README.md

[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...

(PERSONAL_PERSONNEL)

🔇 Additional comments (8)

README.md (8)

67-67: Approve new section header
The “Supported File Extensions” section clearly groups all format lists under a single heading.

69-69: Verify ETL service support counts
The note states “LlamaCloud supports 100+ formats, while Unstructured supports 34+ core formats.” Ensure these numbers match the actual lists (and link to upstream docs if available).

71-71: Approve ‘Documents & Text’ subheading
This category clearly delineates text-based document formats.

76-76: Approve ‘Presentations’ subheading
Clear categorization for slide deck formats.

79-79: Approve Unstructured presentations list
The two core formats are correctly called out.

81-81: Approve ‘Spreadsheets & Data’ subheading
Appropriate grouping for tabular and data file types.

86-86: Approve ‘Images’ subheading
Image formats are now clearly separated.

94-94: Approve ‘Email & Communication’ subheading
This clearly identifies message-based file types.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

README.md (3)
34-34: Refine phrasing for clarity and grammar
The current sentence

Save content from your own personal files (Documents, images, videos and supports 50+ file extensions) to your own personal knowledge base .
is awkward and contains redundancy (“and supports”). Consider rephrasing to improve readability and consistency, for example:
Save content from your personal files (documents, images, videos, etc.) — supporting **50+ file extensions** — to your knowledge base.
🧰 Tools

🪛 LanguageTool

[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...

(PERSONAL_PERSONNEL)

67-69: Link to configuration and ensure consistency with environment variable naming
This “Note” refers to your ETL service names (“LlamaCloud” and “Unstructured”), but the actual env var (ETL_SERVICE) expects uppercase values (LLAMACLOUD, UNSTRUCTURED). To avoid confusion:

Add a link to the .env.example files (surfsense_backend/.env.example, surfsense_web/.env.example)

Spell out the exact values users should set
[e.g.,]
> **Note**: File format support depends on the `ETL_SERVICE` setting (e.g., `LLAMACLOUD` or `UNSTRUCTURED`). See `surfsense_backend/.env.example` for details.
71-95: Enhance readability and maintainability of extension lists
Maintaining a long, flat list of extensions in markdown can be error-prone and hard for users to scan. You might consider:

Converting each category into a Markdown table with columns for LlamaCloud vs. Unstructured

Wrapping each category in a collapsible <details> block

Generating the lists automatically from a shared source or script
This will improve both UX and ease future updates.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b67b497 and 0dbcf56.

📒 Files selected for processing (1)

README.md (2 hunks)

🧰 Additional context used

🪛 LanguageTool

README.md

[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...

(PERSONAL_PERSONNEL)

feat: Removed Hard Dependency on Unstructured.io

- Added DOCLING as third ETL_SERVICE option (alongside UNSTRUCTURED/LLAMACLOUD) - Implemented add_received_file_document_using_docling function - Added Docling processing logic in documents_routes.py - Enhanced chunking with configurable overlap support - Added comprehensive document processing service - Supports both CPU and GPU processing with user selection Addresses MODSetter#161 - Add Docling Support as an ETL_SERVICE Follows same pattern as LlamaCloud integration (PR MODSetter#123)

feat: Removed Hard Dependency on Unstructured.io

- Added DOCLING as third ETL_SERVICE option (alongside UNSTRUCTURED/LLAMACLOUD) - Implemented add_received_file_document_using_docling function - Added Docling processing logic in documents_routes.py - Enhanced chunking with configurable overlap support - Added comprehensive document processing service - Supports both CPU and GPU processing with user selection Addresses MODSetter#161 - Add Docling Support as an ETL_SERVICE Follows same pattern as LlamaCloud integration (PR MODSetter#123)

feat: Removed Hard Dependency on Unstructured.io

73751c0

- Added Llamaparse Support :)

coderabbitai bot reviewed May 31, 2025

View reviewed changes

MODSetter added 2 commits May 30, 2025 19:27

readme fix

b67b497

bs fix

0dbcf56

coderabbitai bot reviewed May 31, 2025

View reviewed changes

vercel bot deployed to Preview May 31, 2025 02:30 View deployment

fix for content hashing

0365feb

coderabbitai bot reviewed May 31, 2025

View reviewed changes

vercel bot deployed to Preview May 31, 2025 02:32 View deployment

MODSetter merged commit 42b57e5 into main May 31, 2025
2 of 3 checks passed

MODSetter mentioned this pull request May 31, 2025

Remove Hard Dependency of unstructured.io API key #113

Closed

This was referenced Jun 3, 2025

fix: docs #140

Merged

chore: updated docs #148

Merged

AbdullahAlMousawi pushed a commit to AbdullahAlMousawi/SurfSense that referenced this pull request Jul 14, 2025

Merge pull request MODSetter#123 from MODSetter/dev

a258f71

feat: Removed Hard Dependency on Unstructured.io

AbdullahAlMousawi mentioned this pull request Jul 20, 2025

feat: Add Docling support as ETL_SERVICE option #211

Merged

4 tasks

CREDO23 pushed a commit to CREDO23/SurfSense that referenced this pull request Jul 25, 2025

Merge pull request MODSetter#123 from MODSetter/dev

628113d

feat: Removed Hard Dependency on Unstructured.io

coderabbitai bot mentioned this pull request Oct 24, 2025

feat: frontend docker to use nextjs production build #431

Merged

16 tasks

	raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}")
	raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}") from e

		llamacloud_markdown_documents: List of markdown content from LlamaCloud parsing
		search_space_id: ID of the search space

Uh oh!

Conversation

MODSetter commented May 31, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Types of changes

Testing

Checklist:

Summary by CodeRabbit

Uh oh!

vercel bot commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

recurseml bot commented May 31, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot May 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MODSetter commented May 31, 2025 •

edited by coderabbitai bot

Loading

vercel bot commented May 31, 2025 •

edited

Loading

coderabbitai bot commented May 31, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)