Skip to content

fix: skip empty documents before vector embedding#35763

Merged
fatelei merged 4 commits into
langgenius:mainfrom
princepal9120:prince/filter-empty-vector-docs
May 4, 2026
Merged

fix: skip empty documents before vector embedding#35763
fatelei merged 4 commits into
langgenius:mainfrom
princepal9120:prince/filter-empty-vector-docs

Conversation

@princepal9120
Copy link
Copy Markdown
Contributor

Important

  1. Make sure you have read our contribution guidelines
  2. Ensure there is an associated issue and you have been assigned to it
  3. Use the correct syntax to link this PR: Fixes #<issue number>.

Summary

Fixes #35737.

This adds a defensive filter before vector embedding so blank text chunks are skipped instead of being sent to the embedding provider. It covers both Vector.create() and Vector.add_texts() so malformed chunker output cannot create MissingParameter: input[n].text failures during indexing.

Added unit coverage for mixed non-empty/empty inputs and all-empty inputs.

From Codex

Screenshots

Before After
Empty chunks could be sent to embedding providers and fail indexing. Empty chunks are skipped before embedding and vector creation.

Checklist

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran make lint && make type-check (backend) and cd web && pnpm exec vp staged (frontend) to appease the lint gods

Validation run:

  • git diff --check
  • python3 -m py_compile core/rag/datasource/vdb/vector_factory.py tests/unit_tests/core/rag/datasource/vdb/test_vector_factory.py
  • python3 -m ruff check core/rag/datasource/vdb/vector_factory.py tests/unit_tests/core/rag/datasource/vdb/test_vector_factory.py

Targeted pytest was attempted but the local environment could not finish dependency setup because the runner disk is at 99% and uv failed extracting mysql-connector-python with No space left on device; running with system Python then failed on missing graphon dependency.

@dosubot dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label May 1, 2026
@autofix-ci autofix-ci Bot requested review from Yeuoly and crazywoola as code owners May 2, 2026 15:51
Comment thread api/core/rag/datasource/vdb/vector_factory.py Outdated
@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, Vector.add_texts() runs duplicate_check before removing empty documents, causing unnecessary text_exists() calls for documents that will be skipped anyway.

Severity: remediation recommended | Category: performance

How to fix: Filter empty before duplicate_check

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

Vector.add_texts() performs duplicate checks before filtering out empty/blank documents. This can trigger unnecessary text_exists() calls against the vector store for chunks that will be dropped.

Issue Context

  • Empty documents are now filtered by _filter_empty_text_documents().
  • Duplicate checks use _filter_duplicate_texts() which calls self.text_exists(doc_id).

Fix Focus Areas

  • api/core/rag/datasource/vdb/vector_factory.py[217-226]
  • api/core/rag/datasource/vdb/vector_factory.py[278-288]

Expected change

Reorder logic in add_texts() to:

  1. filter empty documents
  2. then apply duplicate_check (if enabled)
  3. embed/create only if documents remain

Found by Qodo code review

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label May 4, 2026
@fatelei fatelei added this pull request to the merge queue May 4, 2026
Merged via the queue into langgenius:main with commit 4b7dc17 May 4, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

General Chunker returns empty chunks causing embedding failure (MissingParameter: input[n].text)

4 participants