Skip to content

Bug: Local folder adapter skips PDF files on subsequent syncs probably due to binary file detection #16

@Inzaghi1983b

Description

@Inzaghi1983b

Hi, thank you for your great app, unfortunately it does not work properly for me.

Description

The local folder adapter successfully uploads PDF files during the first sync, but skips them on all subsequent syncs, classifying them as "binary files".

Steps to Reproduce

  1. Configure local folder adapter with a folder containing PDF files
  2. Start the container - PDFs are uploaded successfully
  3. Restart the container or wait for next scheduled sync
  4. Check logs - PDFs are now skipped

Expected Behavior

PDF files should be uploaded consistently on every sync, as they are valid document types for knowledge bases and Open WebUI's RAG system can process them.

Actual Behavior

time="2025-11-15T19:19:16Z" level=debug msg="Skipping binary file: /sync-folder/document.pdf"
time="2025-11-15T19:19:16Z" level=debug msg="Skipping binary file: /sync-folder/contract.pdf"

PDFs are skipped due to the isBinaryFile() check in internal/adapter/local.go (lines 126-129 and 237-242).

Root Cause

The isBinaryFile() function checks for null bytes (0x00) in file content. PDF files naturally contain null bytes as part of their binary structure, causing them to be incorrectly classified as "binary files to skip".

// internal/adapter/local.go:237
func (l *LocalFolderAdapter) isBinaryFile(content []byte) bool {
    for i := 0; i < len(content) && i < 1024; i++ {
        if content[i] == 0 {  // PDFs contain null bytes!
            return true
        }
    }
    // ...
}

Environment

  • Version: ghcr.io/castai/openwebui-content-sync:latest (as of 2025-11-15)
  • Docker: Latest
  • OS: Synology DSM

Proposed Solution

Option 1 (Recommended): Remove binary file check entirely, as Open WebUI can handle all document types:

// Remove lines 126-129 in local.go

Option 2: Whitelist common document formats:

allowedExtensions := []string{".pdf", ".docx", ".doc", ".txt", ".md", ".csv"}
if l.isBinaryFile(content) && !hasAllowedExtension(path, allowedExtensions) {
    skip()
}

Option 3: Check file extension instead of content:

skipExtensions := []string{".exe", ".zip", ".tar", ".gz", ".jpg", ".png", ".gif"}
if hasSkipExtension(path, skipExtensions) {
    skip()
}

Impact

This bug prevents automatic synchronization of PDF documents, which are essential for legal, technical, and business knowledge bases.

Additional Context

First sync works because the file index is empty. Subsequent syncs read from the index but still run the binary check during file scanning, causing PDFs to be skipped.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions