Skip to content

Conversation

@CyMule
Copy link
Contributor

@CyMule CyMule commented Jul 25, 2025

Replace UnicodeDecodeError with UnprocessableEntityError in encoding detection to avoid logging entire file contents. UnicodeDecodeError.object automatically stores complete input data, causing memory issues with large files in logging and error reporting systems.

@CyMule CyMule changed the title fix: prevent memory cascade from UnicodeDecodeError in large files fix: replace UnicodeDecodeError to prevent large payload logging Jul 25, 2025
@CyMule CyMule marked this pull request as draft July 25, 2025 14:54
@CyMule CyMule marked this pull request as ready for review July 25, 2025 18:32
raise UnprocessableEntityError(
f"File encoding detection failed: detected '{encoding}' but decode failed. "
f"File may be corrupted or in an unsupported format."
) from None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why from None

Copy link
Contributor Author

@CyMule CyMule Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APM would end up capturing the original exception which would contain the full bytes we tried to decode. The None prevents the exception chaining and importantly prevents the bytes from being passed to APM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha thanks!

@yuming-long
Copy link
Contributor

@claude review this?

@claude
Copy link

claude bot commented Jul 25, 2025

Claude finished @yuming-long's task —— View job


Code Review In Progress

Todo List:

  • Read and analyze all changed files
  • Review the encoding.py changes for correctness
  • Address the "from None" review comment question
  • Check test coverage and quality
  • Verify the memory issue fix is properly implemented
  • Provide comprehensive review feedback

Review Findings

Excellent Fix for Memory Issue

The core change replacing UnicodeDecodeError with UnprocessableEntityError is well-justified. The PR correctly identifies that UnicodeDecodeError.object automatically stores the complete input data, which can cause serious memory issues with large files in logging systems.

"from None" Usage is Appropriate

Addressing @yuming-long's question on line 111:

The from None syntax is correct here because:

  1. Prevents exception chaining - Avoids keeping references to the original UnicodeDecodeError which contains the large file data in its .object attribute
  2. Memory safety - This is the core purpose of the PR - to prevent memory cascade issues
  3. Clean error messages - Users get clear, actionable error messages without confusing exception chains

Comprehensive Test Coverage

The new tests in test_encoding.py are exceptional:

  • Tests both failure scenarios (charset detection + decode failure)
  • Memory safety verification - Explicitly checks exceptions don't have .object attribute
  • Size assertions - Verifies exceptions have small memory footprint (<10KB)
  • Realistic test data - Uses 1MB of invalid data to simulate real-world conditions

Code Quality

  • Clear, detailed comments explaining the rationale
  • Consistent error messaging
  • Proper changelog documentation
  • Appropriate version bump

Minor Suggestions

  1. Consider adding a brief docstring example in the functions showing the new exception type
  2. The test could potentially verify that the original UnicodeDecodeError would have been larger (for demonstration)

Overall Assessment

LGTM ✅ This is a well-crafted fix that properly addresses a legitimate memory issue with thorough testing and clear documentation.


Copy link
Contributor

@mateuszkuprowski mateuszkuprowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Love what you've done here!

Copy link
Contributor

@yuming-long yuming-long left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@CyMule CyMule added this pull request to the merge queue Jul 25, 2025
Merged via the queue into main with commit b8c14a7 Jul 25, 2025
44 checks passed
@CyMule CyMule deleted the fix/encoding-memory-cascade branch July 25, 2025 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants