fix: replace UnicodeDecodeError to prevent large payload logging #4071

CyMule · 2025-07-25T14:47:13Z

Replace UnicodeDecodeError with UnprocessableEntityError in encoding detection to avoid logging entire file contents. UnicodeDecodeError.object automatically stores complete input data, causing memory issues with large files in logging and error reporting systems.

yuming-long · 2025-07-25T20:27:58Z

unstructured/file_utils/encoding.py

+            raise UnprocessableEntityError(
+                f"File encoding detection failed: detected '{encoding}' but decode failed. "
+                f"File may be corrupted or in an unsupported format."
+            ) from None


why from None

APM would end up capturing the original exception which would contain the full bytes we tried to decode. The None prevents the exception chaining and importantly prevents the bytes from being passed to APM

gotcha thanks!

yuming-long · 2025-07-25T20:28:12Z

@claude review this?

claude · 2025-07-25T20:28:32Z

Claude finished @yuming-long's task —— View job

Code Review In Progress

Todo List:

Read and analyze all changed files
Review the encoding.py changes for correctness
Address the "from None" review comment question
Check test coverage and quality
Verify the memory issue fix is properly implemented
Provide comprehensive review feedback

Review Findings

✅ Excellent Fix for Memory Issue

The core change replacing UnicodeDecodeError with UnprocessableEntityError is well-justified. The PR correctly identifies that UnicodeDecodeError.object automatically stores the complete input data, which can cause serious memory issues with large files in logging systems.

✅ "from None" Usage is Appropriate

Addressing @yuming-long's question on line 111:

The from None syntax is correct here because:

Prevents exception chaining - Avoids keeping references to the original UnicodeDecodeError which contains the large file data in its .object attribute
Memory safety - This is the core purpose of the PR - to prevent memory cascade issues
Clean error messages - Users get clear, actionable error messages without confusing exception chains

✅ Comprehensive Test Coverage

The new tests in test_encoding.py are exceptional:

Tests both failure scenarios (charset detection + decode failure)
Memory safety verification - Explicitly checks exceptions don't have .object attribute
Size assertions - Verifies exceptions have small memory footprint (<10KB)
Realistic test data - Uses 1MB of invalid data to simulate real-world conditions

✅ Code Quality

Clear, detailed comments explaining the rationale
Consistent error messaging
Proper changelog documentation
Appropriate version bump

Minor Suggestions

Consider adding a brief docstring example in the functions showing the new exception type
The test could potentially verify that the original UnicodeDecodeError would have been larger (for demonstration)

Overall Assessment

LGTM ✅ This is a well-crafted fix that properly addresses a legitimate memory issue with thorough testing and clear documentation.

mateuszkuprowski

LGTM!
Love what you've done here!

yuming-long

🎉

fix: prevent memory cascade from UnicodeDecodeError in large files

667d060

CyMule changed the title ~~fix: prevent memory cascade from UnicodeDecodeError in large files~~ fix: replace UnicodeDecodeError to prevent large payload logging Jul 25, 2025

changelog

01cc9ef

CyMule marked this pull request as draft July 25, 2025 14:54

CyMule added 2 commits July 25, 2025 11:22

lint

e20602c

version

266ecb2

CyMule marked this pull request as ready for review July 25, 2025 18:32

CyMule added 3 commits July 25, 2025 15:09

add tests

3842090

lint

5005e47

try again

5fb9cfd

yuming-long reviewed Jul 25, 2025

View reviewed changes

mateuszkuprowski approved these changes Jul 25, 2025

View reviewed changes

yuming-long approved these changes Jul 25, 2025

View reviewed changes

CyMule added this pull request to the merge queue Jul 25, 2025

Merged via the queue into main with commit b8c14a7 Jul 25, 2025
44 checks passed

CyMule deleted the fix/encoding-memory-cascade branch July 25, 2025 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: replace UnicodeDecodeError to prevent large payload logging #4071

fix: replace UnicodeDecodeError to prevent large payload logging #4071

Uh oh!

CyMule commented Jul 25, 2025

Uh oh!

yuming-long Jul 25, 2025

Uh oh!

CyMule Jul 25, 2025 •

edited

Loading

Uh oh!

yuming-long Jul 25, 2025

Uh oh!

yuming-long commented Jul 25, 2025

Uh oh!

claude bot commented Jul 25, 2025 •

edited

Loading

Uh oh!

mateuszkuprowski left a comment

Uh oh!

yuming-long left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: replace UnicodeDecodeError to prevent large payload logging #4071

fix: replace UnicodeDecodeError to prevent large payload logging #4071

Uh oh!

Conversation

CyMule commented Jul 25, 2025

Uh oh!

yuming-long Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

CyMule Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuming-long Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

yuming-long commented Jul 25, 2025

Uh oh!

claude bot commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review In Progress

Todo List:

Review Findings

✅ Excellent Fix for Memory Issue

✅ "from None" Usage is Appropriate

✅ Comprehensive Test Coverage

✅ Code Quality

Minor Suggestions

Overall Assessment

Uh oh!

mateuszkuprowski left a comment

Choose a reason for hiding this comment

Uh oh!

yuming-long left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CyMule Jul 25, 2025 •

edited

Loading

claude bot commented Jul 25, 2025 •

edited

Loading