Skip to content

Fix LibreOffice support for Office document processing #200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rex993
Copy link
Contributor

@rex993 rex993 commented Jun 15, 2025

  • Install libreoffice-nogui in Docker container
  • Add soffice availability check on initialization
  • Implement unified Office-to-PDF conversion method
  • Add support for DOCX, PPTX, and XLSX files
  • Include additional unstructured extras for Office formats
  • Add unstructured support for Office documents when ColPali is disabled

rex993 added 2 commits June 15, 2025 12:49
- Install libreoffice-nogui in Docker container
- Add soffice availability check on initialization
- Implement unified Office-to-PDF conversion method
- Add support for DOCX, PPTX, and XLSX files
- Include additional unstructured extras for Office formats
Copy link

jazzberry-ai bot commented Jun 15, 2025

Bug Report

Name Severity Example test case Description
Overly aggressive text sanitization Medium Upload a document with specific Unicode characters. The _sanitize_text function in core/parser/morphik_parser.py might remove valid Unicode characters, leading to data loss.
Unhandled soffice failure Medium Attempt to convert an Office document to PDF when the output directory is not writable. The _convert_office_to_pdf function in core/services/document_service.py doesn't handle the case where the soffice command fails to create the PDF file in the expected location, potentially leading to uninformative error messages.
Missing empty input file check Low Upload an empty Office document. The _convert_office_to_pdf function in core/services/document_service.py doesn't check if the input file is empty before calling soffice, potentially leading to unnecessary processing and empty PDF files.
Missing runtime soffice availability check Low Start the service and then uninstall LibreOffice. The _convert_office_to_pdf function in core/services/document_service.py only checks for soffice availability on initialization, and does not account for it becoming unavailable at a later time.

Comments? Email us.

Copy link

jazzberry-ai bot commented Jun 15, 2025

An error occured.

This error may be due to rate limits. If this error persists, please email us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant