Skip to content

feat: Removed Hard Dependency on Unstructured.io#123

Merged
MODSetter merged 4 commits intomainfrom
dev
May 31, 2025
Merged

feat: Removed Hard Dependency on Unstructured.io#123
MODSetter merged 4 commits intomainfrom
dev

Conversation

@MODSetter
Copy link
Owner

@MODSetter MODSetter commented May 31, 2025

  • Added Llamaparse Support :)

Description

Removed Hard Dependency on Unstructured.io due to their recent limited SignUps

Motivation and Context

FIX # #113

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance improvement (non-breaking change which enhances performance)
  • Documentation update
  • Breaking change (fix or feature that would cause existing functionality to change)

Testing

  • I have tested these changes locally
  • I have added/updated unit tests
  • I have added/updated integration tests

Checklist:

  • My code follows the code style of this project
  • My change requires documentation updates
  • I have updated the documentation accordingly
  • My change requires dependency updates
  • I have updated the dependencies accordingly
  • My code builds clean without any errors or warnings
  • All new and existing tests passed

Summary by CodeRabbit

  • New Features

    • Added support for selecting between two document parsing services (Unstructured and LlamaCloud), enabling broader file format compatibility for uploads.
    • Expanded the range of supported file formats for document uploads, with LlamaCloud supporting 50+ formats and Unstructured supporting 34+ core formats.
    • The upload interface now dynamically adapts to show supported file types based on the selected parsing service.
  • Documentation

    • Updated installation and configuration guides to detail new environment variables and clarify API key requirements for each parsing service.
    • README now provides categorized, detailed lists of supported file formats for both services.
  • Chores

    • Example environment files updated to reflect new configuration options and variables.

@vercel
Copy link

vercel bot commented May 31, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
surf-sense-frontend ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 31, 2025 2:32am

@coderabbitai
Copy link

coderabbitai bot commented May 31, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

The changes introduce support for selecting between two document parsing services, "UNSTRUCTURED" and "LLAMACLOUD", via new environment variables in both backend and frontend configurations. The backend logic now conditionally processes uploaded files using the chosen service, and the frontend dynamically adjusts accepted file types. Documentation and example environment files are updated to reflect these options.

Changes

File(s) Change Summary
README.md Expanded and restructured supported file format documentation, distinguishing between LlamaCloud and Unstructured services, with categorized extension lists.
surfsense_backend/.env.example
surfsense_web/.env.example
Added ETL_SERVICE and corresponding API key variables for backend and frontend; reintroduced and reorganized UNSTRUCTURED_API_KEY; added LLAMA_CLOUD_API_KEY and NEXT_PUBLIC_ETL_SERVICE.
surfsense_backend/app/config/init.py Modified config to conditionally load the appropriate API key based on ETL_SERVICE value.
surfsense_backend/app/routes/documents_routes.py Updated imports and background processing logic to branch between Unstructured and LlamaCloud file processing based on ETL_SERVICE.
surfsense_backend/app/tasks/background_tasks.py Renamed and split background task functions for Unstructured and LlamaCloud; added new function to process LlamaCloud markdown documents.
surfsense_backend/pyproject.toml Added llama-cloud-services >=0.6.25 as a dependency.
surfsense_web/app/dashboard/[search_space_id]/documents/upload/page.tsx Refactored accepted file types logic to dynamically select supported formats based on NEXT_PUBLIC_ETL_SERVICE, expanding support for LlamaCloud.
surfsense_web/content/docs/docker-installation.mdx
surfsense_web/content/docs/manual-installation.mdx
Updated documentation to explain new ETL service selection, related API keys, and frontend variable; clarified configuration for both Unstructured and LlamaCloud services.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Frontend
    participant Backend
    participant LlamaCloud
    participant Unstructured

    User->>Frontend: Uploads file
    Frontend->>Backend: Sends file (with ETL_SERVICE info)
    alt ETL_SERVICE = LLAMACLOUD
        Backend->>LlamaCloud: Parse file
        LlamaCloud-->>Backend: Returns markdown documents
        Backend->>Backend: Process markdown docs (add_received_file_document_using_llamacloud)
    else ETL_SERVICE = UNSTRUCTURED
        Backend->>Unstructured: Parse file
        Unstructured-->>Backend: Returns processed elements
        Backend->>Backend: Process elements (add_received_file_document_using_unstructured)
    end
    Backend-->>Frontend: Acknowledge upload
Loading

Possibly related PRs

Suggested reviewers

  • MODSetter

Poem

A rabbit leaps through code anew,
With LlamaCloud and Unstructured too!
Now docs of many stripes and spots,
Can hop right in—no second thoughts.
Environment keys in tidy rows,
The upload garden wider grows.
🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0dbcf56 and 0365feb.

📒 Files selected for processing (1)
  • surfsense_backend/app/tasks/background_tasks.py (3 hunks)
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@recurseml
Copy link

recurseml bot commented May 31, 2025

✨ No issues found! Your code is sparkling clean! ✨

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
surfsense_backend/app/config/__init__.py (1)

101-101: Fix trailing whitespace issues.

Static analysis detected trailing whitespace on these lines. Please remove the trailing whitespace to maintain code quality standards.

 ETL_SERVICE = os.getenv("ETL_SERVICE")
-    
+
 if ETL_SERVICE == "UNSTRUCTURED":
     # Unstructured API Key
     UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")
-        
+        
 elif ETL_SERVICE == "LLAMACLOUD":
     # LlamaCloud API Key
     LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")
-        
-        
+        
+        

Also applies to: 105-105, 109-109

🧰 Tools
🪛 Pylint (3.3.7)

[convention] 101-101: Trailing whitespace

(C0303)

surfsense_backend/app/tasks/background_tasks.py (2)

292-292: Add a docstring to maintain consistency.

The function lacks a docstring while the new add_received_file_document_using_llamacloud function has one. For consistency and better documentation, please add a docstring.

 async def add_received_file_document_using_unstructured(
+    """
+    Process and store document content parsed by Unstructured.
+
+    Args:
+        session: Database session
+        file_name: Name of the processed file
+        unstructured_processed_elements: List of processed elements from Unstructured
+        search_space_id: ID of the search space
+
+    Returns:
+        Document object if successful, None if failed
+    """
🧰 Tools
🪛 Pylint (3.3.7)

[convention] 292-292: Missing function or method docstring

(C0116)


[refactor] 292-292: Too many local variables (16/15)

(R0914)


379-380: Remove misleading comment.

The comment suggests combining multiple documents, but the code just assigns a single string.

-        # Combine all markdown documents into one
         file_in_markdown = llamacloud_markdown_document
surfsense_backend/app/routes/documents_routes.py (1)

10-10: Split long import statement for better readability.

The import line exceeds 253 characters. Consider splitting it across multiple lines.

-from app.tasks.background_tasks import add_received_markdown_file_document, add_extension_received_document, add_received_file_document_using_unstructured, add_crawled_url_document, add_youtube_video_document, add_received_file_document_using_llamacloud
+from app.tasks.background_tasks import (
+    add_received_markdown_file_document,
+    add_extension_received_document,
+    add_received_file_document_using_unstructured,
+    add_received_file_document_using_llamacloud,
+    add_crawled_url_document,
+    add_youtube_video_document
+)
🧰 Tools
🪛 Pylint (3.3.7)

[convention] 10-10: Line too long (253/100)

(C0301)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5737ea8 and 73751c0.

⛔ Files ignored due to path filters (1)
  • surfsense_backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • README.md (2 hunks)
  • surfsense_backend/.env.example (1 hunks)
  • surfsense_backend/app/config/__init__.py (1 hunks)
  • surfsense_backend/app/routes/documents_routes.py (4 hunks)
  • surfsense_backend/app/tasks/background_tasks.py (2 hunks)
  • surfsense_backend/pyproject.toml (1 hunks)
  • surfsense_web/.env.example (1 hunks)
  • surfsense_web/app/dashboard/[search_space_id]/documents/upload/page.tsx (1 hunks)
  • surfsense_web/content/docs/docker-installation.mdx (2 hunks)
  • surfsense_web/content/docs/manual-installation.mdx (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
surfsense_backend/app/routes/documents_routes.py (1)
surfsense_backend/app/tasks/background_tasks.py (2)
  • add_received_file_document_using_unstructured (292-357)
  • add_received_file_document_using_llamacloud (360-434)
🪛 Pylint (3.3.7)
surfsense_backend/app/config/__init__.py

[convention] 101-101: Trailing whitespace

(C0303)


[convention] 105-105: Trailing whitespace

(C0303)


[convention] 109-109: Trailing whitespace

(C0303)


[convention] 110-110: Trailing whitespace

(C0303)

surfsense_backend/app/tasks/background_tasks.py

[convention] 292-292: Missing function or method docstring

(C0116)


[refactor] 292-292: Too many local variables (16/15)

(R0914)


[convention] 389-389: Trailing whitespace

(C0303)


[convention] 391-391: Line too long (107/100)

(C0301)


[refactor] 360-360: Too many local variables (16/15)

(R0914)


[warning] 391-391: Use lazy % formatting in logging functions

(W1203)


[warning] 434-434: Consider explicitly re-raising using 'raise RuntimeError(f'Failed to process file document using LlamaCloud: {str(e)}') from e'

(W0707)

surfsense_backend/app/routes/documents_routes.py

[convention] 10-10: Line too long (253/100)

(C0301)


[convention] 104-104: Trailing whitespace

(C0303)


[convention] 195-195: Trailing whitespace

(C0303)


[convention] 227-227: Trailing whitespace

(C0303)


[convention] 236-236: Trailing whitespace

(C0303)


[convention] 239-239: Trailing whitespace

(C0303)


[convention] 246-246: Trailing whitespace

(C0303)


[convention] 249-249: Trailing whitespace

(C0303)


[convention] 253-253: Trailing whitespace

(C0303)


[convention] 194-194: Import outside toplevel (langchain_unstructured.UnstructuredLoader)

(C0415)


[warning] 210-210: Reimport 'os' (imported line 18)

(W0404)


[convention] 210-210: Import outside toplevel (os)

(C0415)


[warning] 213-214: No exception type(s) specified

(W0702)


[convention] 224-224: Import outside toplevel (llama_cloud_services.LlamaParse)

(C0415)


[convention] 225-225: Import outside toplevel (llama_cloud_services.parse.utils.ResultType)

(C0415)


[warning] 241-241: Reimport 'os' (imported line 18)

(W0404)


[convention] 241-241: Import outside toplevel (os)

(C0415)


[warning] 244-245: No exception type(s) specified

(W0702)


[convention] 483-483: Trailing newlines

(C0305)

🪛 LanguageTool
README.md

[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...

(PERSONAL_PERSONNEL)

🪛 Ruff (0.11.9)
surfsense_backend/app/tasks/background_tasks.py

434-434: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

surfsense_backend/app/routes/documents_routes.py

211-214: Use contextlib.suppress(Exception) instead of try-except-pass

Replace with contextlib.suppress(Exception)

(SIM105)


213-213: Do not use bare except

(E722)


242-245: Use contextlib.suppress(Exception) instead of try-except-pass

Replace with contextlib.suppress(Exception)

(SIM105)


244-244: Do not use bare except

(E722)

🔇 Additional comments (11)
surfsense_backend/pyproject.toml (1)

20-20:

Details

✅ Verification successful

Verify the new dependency version and security status.

The addition of llama-cloud-services>=0.6.25 looks correct for supporting the new LlamaCloud ETL service. Please ensure this version exists and check for any security advisories.


🏁 Script executed:

#!/bin/bash
# Description: Verify llama-cloud-services package version and security status

# Check if the specified version exists on PyPI
curl -s https://pypi.org/pypi/llama-cloud-services/json | jq -r '.releases | keys[]' | grep -E '^0\.6\.(2[5-9]|[3-9][0-9])' | head -5

# Check for security advisories
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: PIP, package: "llama-cloud-services") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'

Length of output: 496


Dependency Approved: llama-cloud-services>=0.6.25

Version 0.6.25 is available on PyPI and no PIP security advisories were found.

• surfsense_backend/pyproject.toml (line 20): llama-cloud-services>=0.6.25

surfsense_web/.env.example (1)

2-3: LGTM! Frontend ETL service configuration added correctly.

The addition of NEXT_PUBLIC_ETL_SERVICE with clear value options aligns well with the backend configuration. The NEXT_PUBLIC_ prefix correctly exposes this to the client-side code in Next.js.

surfsense_backend/.env.example (1)

35-38: Well-organized configuration section for ETL service selection.

The new "File Parser Service" section provides clear organization and documentation for the ETL service options. The variable naming is consistent and the available options are clearly documented.

surfsense_web/content/docs/manual-installation.mdx (2)

64-66: Excellent documentation of the new ETL service configuration!

The documentation clearly explains the new ETL_SERVICE environment variable and the conditional API key requirements. This will help users understand they only need to obtain the API key for their chosen ETL service.


187-187: Good addition of frontend ETL service configuration.

The NEXT_PUBLIC_ETL_SERVICE variable properly ensures frontend-backend consistency for file format handling in the upload interface.

README.md (2)

34-34: Updated file extension count reflects expanded support.

The increase from "34+ file extensions" to "50+ file extensions" accurately reflects the expanded capabilities when using LlamaCloud as the ETL service.

🧰 Tools
🪛 LanguageTool

[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...

(PERSONAL_PERSONNEL)


69-96: Excellent categorization of supported file formats!

The detailed breakdown of supported file formats by ETL service and category (Documents & Text, Presentations, etc.) provides clear guidance for users. This makes it easy to understand the capabilities and differences between UNSTRUCTURED and LLAMACLOUD services.

surfsense_web/app/dashboard/[search_space_id]/documents/upload/page.tsx (1)

45-133: Excellent implementation of dynamic file type support!

The refactoring properly:

  • Separates audio files (always supported via Whisper) from document parsing files
  • Uses environment variable to conditionally select between LLAMACLOUD and UNSTRUCTURED file types
  • Includes comprehensive MIME type mappings for both services
  • Maintains clean separation of concerns with well-organized type definitions

The implementation correctly handles the frontend-backend consistency for file upload validation based on the configured ETL service.

surfsense_web/content/docs/docker-installation.mdx (2)

93-95: Consistent documentation across installation methods.

The ETL service configuration documentation is perfectly consistent with the manual installation guide, ensuring users have clear guidance regardless of their deployment method.


141-141: Frontend environment variable properly documented.

The NEXT_PUBLIC_ETL_SERVICE variable documentation correctly emphasizes the need for frontend-backend consistency and its impact on file format support.

surfsense_backend/app/routes/documents_routes.py (1)

250-260:

Details

❌ Incorrect review comment

Review document processing logic for LlamaCloud.

The loop processes each markdown document individually, but add_received_file_document_using_llamacloud expects a single document string, not a list. This could lead to:

  1. Multiple database entries for a single uploaded file
  2. Only the last document being retained if documents override each other

Consider either:

  • Combining all markdown documents before processing, or
  • Updating the function to handle multiple documents appropriately

🏁 Script executed:

#!/bin/bash
# Check how LlamaCloud's get_markdown_documents behaves
rg -A 10 "get_markdown_documents|aget_markdown_documents" --type py

Length of output: 1197


🏁 Script executed:

#!/bin/bash
set -e
# Find definition of add_received_file_document_using_llamacloud
rg -R "def add_received_file_document_using_llamacloud" -A 5 --type py

Length of output: 486


🏁 Script executed:

#!/bin/bash
set -e
# Find all occurrences of add_received_file_document_using_llamacloud to locate its definition
rg "add_received_file_document_using_llamacloud" -n --type py

Length of output: 607


🏁 Script executed:

#!/bin/bash
set -e
# Extract the full implementation of add_received_file_document_using_llamacloud
rg -n "async def add_received_file_document_using_llamacloud" -A 50 --type py surfsense_backend/app/tasks/background_tasks.py

Length of output: 2205


🏁 Script executed:

#!/bin/bash
set -e
# Show context around the usage of markdown_documents in documents_routes.py
rg -n "markdown_documents" -C 10 --type py surfsense_backend/app/routes/documents_routes.py

Length of output: 1184


Ignore loop retention concerns for LlamaCloud processing

The call to

markdown_documents = await result.aget_markdown_documents(split_by_page=False)

with split_by_page=False yields a single element list, so the for-loop executes exactly once. Also, add_received_file_document_using_llamacloud’s signature and implementation expect a single markdown string (not a list), so the current per-item invocation is correct and will not create duplicate or overridden entries. No changes required.

Likely an incorrect or invalid review comment.

🧰 Tools
🪛 Pylint (3.3.7)

[convention] 253-253: Trailing whitespace

(C0303)

Comment on lines +99 to +109
# ETL Service
ETL_SERVICE = os.getenv("ETL_SERVICE")

if ETL_SERVICE == "UNSTRUCTURED":
# Unstructured API Key
UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")

elif ETL_SERVICE == "LLAMACLOUD":
# LlamaCloud API Key
LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add validation and error handling for ETL service configuration.

The conditional loading logic is correct, but consider adding validation to ensure robustness:

  1. Validate ETL_SERVICE value - Currently no validation if an invalid value is provided
  2. Error handling for missing API keys - No check if the required API key is missing
  3. Default behavior - No fallback if ETL_SERVICE is not set
 # ETL Service
 ETL_SERVICE = os.getenv("ETL_SERVICE")
 
-if ETL_SERVICE == "UNSTRUCTURED":
+if ETL_SERVICE == "UNSTRUCTURED":
     # Unstructured API Key
     UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")
-    
-elif ETL_SERVICE == "LLAMACLOUD":
+    if not UNSTRUCTURED_API_KEY:
+        raise ValueError("UNSTRUCTURED_API_KEY is required when ETL_SERVICE is set to 'UNSTRUCTURED'")
+        
+elif ETL_SERVICE == "LLAMACLOUD":
     # LlamaCloud API Key
     LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")
-    
-    
+    if not LLAMA_CLOUD_API_KEY:
+        raise ValueError("LLAMA_CLOUD_API_KEY is required when ETL_SERVICE is set to 'LLAMACLOUD'")
+        
+elif ETL_SERVICE is not None:
+    raise ValueError(f"Invalid ETL_SERVICE value: '{ETL_SERVICE}'. Must be 'UNSTRUCTURED' or 'LLAMACLOUD'")
+else:
+    raise ValueError("ETL_SERVICE environment variable is required")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# ETL Service
ETL_SERVICE = os.getenv("ETL_SERVICE")
if ETL_SERVICE == "UNSTRUCTURED":
# Unstructured API Key
UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")
elif ETL_SERVICE == "LLAMACLOUD":
# LlamaCloud API Key
LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")
# ETL Service
ETL_SERVICE = os.getenv("ETL_SERVICE")
if ETL_SERVICE == "UNSTRUCTURED":
# Unstructured API Key
UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")
if not UNSTRUCTURED_API_KEY:
raise ValueError("UNSTRUCTURED_API_KEY is required when ETL_SERVICE is set to 'UNSTRUCTURED'")
elif ETL_SERVICE == "LLAMACLOUD":
# LlamaCloud API Key
LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")
if not LLAMA_CLOUD_API_KEY:
raise ValueError("LLAMA_CLOUD_API_KEY is required when ETL_SERVICE is set to 'LLAMACLOUD'")
elif ETL_SERVICE is not None:
raise ValueError(f"Invalid ETL_SERVICE value: '{ETL_SERVICE}'. Must be 'UNSTRUCTURED' or 'LLAMACLOUD'")
else:
raise ValueError("ETL_SERVICE environment variable is required")
🧰 Tools
🪛 Pylint (3.3.7)

[convention] 101-101: Trailing whitespace

(C0303)


[convention] 105-105: Trailing whitespace

(C0303)


[convention] 109-109: Trailing whitespace

(C0303)

🤖 Prompt for AI Agents
In surfsense_backend/app/config/__init__.py around lines 99 to 109, add
validation to check if ETL_SERVICE has a valid value (e.g., "UNSTRUCTURED" or
"LLAMACLOUD") and raise an error if it is invalid or not set. Also, after
loading the API key for the selected service, verify that the key is present and
raise an error if missing. Finally, implement a default behavior or error
handling for when ETL_SERVICE is not set to ensure robustness.

raise db_error
except Exception as e:
await session.rollback()
raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use explicit exception chaining.

When re-raising exceptions, use from e to preserve the exception chain for better debugging.

-        raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}")
+        raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}") from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}")
raise RuntimeError(f"Failed to process file document using LlamaCloud: {str(e)}") from e
🧰 Tools
🪛 Ruff (0.11.9)

434-434: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🪛 Pylint (3.3.7)

[warning] 434-434: Consider explicitly re-raising using 'raise RuntimeError(f'Failed to process file document using LlamaCloud: {str(e)}') from e'

(W0707)

🤖 Prompt for AI Agents
In surfsense_backend/app/tasks/background_tasks.py at line 434, the RuntimeError
is raised without explicit exception chaining. Modify the raise statement to use
'from e' to preserve the original exception context, improving debugging by
maintaining the exception chain.

Comment on lines +372 to +373
llamacloud_markdown_documents: List of markdown content from LlamaCloud parsing
search_space_id: ID of the search space
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix docstring parameter description.

The docstring incorrectly describes the parameter as a list when it's actually a single string.

-        llamacloud_markdown_documents: List of markdown content from LlamaCloud parsing
+        llamacloud_markdown_document: Markdown content from LlamaCloud parsing
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
llamacloud_markdown_documents: List of markdown content from LlamaCloud parsing
search_space_id: ID of the search space
llamacloud_markdown_document: Markdown content from LlamaCloud parsing
search_space_id: ID of the search space
🤖 Prompt for AI Agents
In surfsense_backend/app/tasks/background_tasks.py around lines 372 to 373, the
docstring incorrectly describes the parameter 'llamacloud_markdown_documents' as
a list, but it is actually a single string. Update the docstring to reflect that
'llamacloud_markdown_documents' is a string, not a list, ensuring the parameter
description accurately matches its type.

Comment on lines +213 to +214
except:
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Replace bare except with specific exception handling.

Using bare except: can mask unexpected errors. Use specific exception types or at least Exception.

-            except:
-                pass
+            except (OSError, IOError):
+                pass  # File might already be deleted

Also applies to: 244-245

🧰 Tools
🪛 Ruff (0.11.9)

211-214: Use contextlib.suppress(Exception) instead of try-except-pass

Replace with contextlib.suppress(Exception)

(SIM105)


213-213: Do not use bare except

(E722)

🪛 Pylint (3.3.7)

[warning] 213-214: No exception type(s) specified

(W0702)

🤖 Prompt for AI Agents
In surfsense_backend/app/routes/documents_routes.py at lines 213-214 and
244-245, replace the bare except clauses with specific exception handling by
catching Exception or more specific exceptions if known. This prevents masking
unexpected errors and improves error visibility and debugging.

Comment on lines +194 to +195
from langchain_unstructured import UnstructuredLoader
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Move imports to module level.

Importing modules inside functions is an anti-pattern that affects performance and readability. Move all imports to the top of the file.

Add these imports at the module level:

from langchain_unstructured import UnstructuredLoader
from llama_cloud_services import LlamaParse
from llama_cloud_services.parse.utils import ResultType

Then remove the redundant import os statements on lines 210 and 241 since os is already imported at line 18.

Also applies to: 210-210, 224-225, 241-241

🧰 Tools
🪛 Pylint (3.3.7)

[convention] 195-195: Trailing whitespace

(C0303)


[convention] 194-194: Import outside toplevel (langchain_unstructured.UnstructuredLoader)

(C0415)

🤖 Prompt for AI Agents
In surfsense_backend/app/routes/documents_routes.py around lines 194-195 and
also lines 210, 224-225, and 241, move the imports of UnstructuredLoader,
LlamaParse, and ResultType to the top of the file with other module-level
imports. Remove the redundant os imports at lines 210 and 241 since os is
already imported at line 18. This will improve performance and readability by
avoiding repeated imports inside functions.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (10)
README.md (10)

34-34: Revise phrasing for clarity and grammar
The parenthetical clause “(Documents, images, videos and supports 50+ file extensions)” reads awkwardly. Consider rephrasing to something like:
“Documents, images, and videos (supporting 50+ file extensions)”

🧰 Tools
🪛 LanguageTool

[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...

(PERSONAL_PERSONNEL)


72-72: Consider sorting file extensions alphabetically
Alphabetical ordering will improve scanability for end users.


74-74: Consider sorting file extensions alphabetically
Apply the same alphabetical ordering to the Unstructured list for consistency.


77-77: Consider sorting file extensions alphabetically
Consistency across lists helps users locate their format more quickly.


82-82: Consider sorting file extensions alphabetically
An ordered list will aid readability, especially as the list grows.


84-84: Consider sorting file extensions alphabetically
Match the ordering style used in other categories.


87-87: Consider sorting file extensions alphabetically
Helps users quickly locate a specific image format.


89-89: Consider sorting file extensions alphabetically
Maintain the same pattern across LlamaCloud and Unstructured lists.


91-92: Clarify scope and ordering for audio/video formats

  • Specify that these “Always Supported” formats apply to both ETL services.
  • Sort the extensions alphabetically for consistency.

95-95: Consider sorting file extensions alphabetically
Ensure consistency with other lists by ordering .eml, .msg, .p7s.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73751c0 and b67b497.

📒 Files selected for processing (1)
  • README.md (2 hunks)
🧰 Additional context used
🪛 LanguageTool
README.md

[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...

(PERSONAL_PERSONNEL)

🔇 Additional comments (8)
README.md (8)

67-67: Approve new section header
The “Supported File Extensions” section clearly groups all format lists under a single heading.


69-69: Verify ETL service support counts
The note states “LlamaCloud supports 100+ formats, while Unstructured supports 34+ core formats.” Ensure these numbers match the actual lists (and link to upstream docs if available).


71-71: Approve ‘Documents & Text’ subheading
This category clearly delineates text-based document formats.


76-76: Approve ‘Presentations’ subheading
Clear categorization for slide deck formats.


79-79: Approve Unstructured presentations list
The two core formats are correctly called out.


81-81: Approve ‘Spreadsheets & Data’ subheading
Appropriate grouping for tabular and data file types.


86-86: Approve ‘Images’ subheading
Image formats are now clearly separated.


94-94: Approve ‘Email & Communication’ subheading
This clearly identifies message-based file types.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
README.md (3)

34-34: Refine phrasing for clarity and grammar
The current sentence

Save content from your own personal files (Documents, images, videos and supports 50+ file extensions) to your own personal knowledge base .
is awkward and contains redundancy (“and supports”). Consider rephrasing to improve readability and consistency, for example:

Save content from your personal files (documents, images, videos, etc.) — supporting **50+ file extensions** — to your knowledge base.
🧰 Tools
🪛 LanguageTool

[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...

(PERSONAL_PERSONNEL)


67-69: Link to configuration and ensure consistency with environment variable naming
This “Note” refers to your ETL service names (“LlamaCloud” and “Unstructured”), but the actual env var (ETL_SERVICE) expects uppercase values (LLAMACLOUD, UNSTRUCTURED). To avoid confusion:

  • Add a link to the .env.example files (surfsense_backend/.env.example, surfsense_web/.env.example)
  • Spell out the exact values users should set
    [e.g.,]
> **Note**: File format support depends on the `ETL_SERVICE` setting (e.g., `LLAMACLOUD` or `UNSTRUCTURED`). See `surfsense_backend/.env.example` for details.

71-95: Enhance readability and maintainability of extension lists
Maintaining a long, flat list of extensions in markdown can be error-prone and hard for users to scan. You might consider:

  • Converting each category into a Markdown table with columns for LlamaCloud vs. Unstructured
  • Wrapping each category in a collapsible <details> block
  • Generating the lists automatically from a shared source or script
    This will improve both UX and ease future updates.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b67b497 and 0dbcf56.

📒 Files selected for processing (1)
  • README.md (2 hunks)
🧰 Additional context used
🪛 LanguageTool
README.md

[misspelling] ~34-~34: Did you mean the noun “personnel”?
Context: ...ng Support** Save content from your own personal files *(Documents, images, videos and s...

(PERSONAL_PERSONNEL)

@MODSetter MODSetter merged commit 42b57e5 into main May 31, 2025
2 of 3 checks passed
This was referenced Jun 3, 2025
AbdullahAlMousawi pushed a commit to AbdullahAlMousawi/SurfSense that referenced this pull request Jul 14, 2025
feat: Removed Hard Dependency on Unstructured.io
AbdullahAlMousawi added a commit to AbdullahAlMousawi/SurfSense that referenced this pull request Jul 20, 2025
- Added DOCLING as third ETL_SERVICE option (alongside UNSTRUCTURED/LLAMACLOUD)
- Implemented add_received_file_document_using_docling function
- Added Docling processing logic in documents_routes.py
- Enhanced chunking with configurable overlap support
- Added comprehensive document processing service
- Supports both CPU and GPU processing with user selection

Addresses MODSetter#161 - Add Docling Support as an ETL_SERVICE
Follows same pattern as LlamaCloud integration (PR MODSetter#123)
CREDO23 pushed a commit to CREDO23/SurfSense that referenced this pull request Jul 25, 2025
feat: Removed Hard Dependency on Unstructured.io
CREDO23 pushed a commit to CREDO23/SurfSense that referenced this pull request Jul 25, 2025
- Added DOCLING as third ETL_SERVICE option (alongside UNSTRUCTURED/LLAMACLOUD)
- Implemented add_received_file_document_using_docling function
- Added Docling processing logic in documents_routes.py
- Enhanced chunking with configurable overlap support
- Added comprehensive document processing service
- Supports both CPU and GPU processing with user selection

Addresses MODSetter#161 - Add Docling Support as an ETL_SERVICE
Follows same pattern as LlamaCloud integration (PR MODSetter#123)
aptdnfapt pushed a commit to aptdnfapt/SurfSense that referenced this pull request Oct 19, 2025
- Added DOCLING as third ETL_SERVICE option (alongside UNSTRUCTURED/LLAMACLOUD)
- Implemented add_received_file_document_using_docling function
- Added Docling processing logic in documents_routes.py
- Enhanced chunking with configurable overlap support
- Added comprehensive document processing service
- Supports both CPU and GPU processing with user selection

Addresses MODSetter#161 - Add Docling Support as an ETL_SERVICE
Follows same pattern as LlamaCloud integration (PR MODSetter#123)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant