- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.1k
email date format flexibility #4072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves email date parsing flexibility in the email partitioner to handle non-standard date formats that occur in .eml files, preventing downgrades to hi-res partitioning due to parsing failures.
- Replaces email.utils.parsedate_to_datetime()withdateutil.parser.parse()for more robust date parsing
- Adds exception handling to gracefully fall back when date parsing fails
- Includes test coverage for both ISO-8601 and RFC-2822 date formats
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description | 
|---|---|
| unstructured/partition/email.py | Updates date parsing logic to use dateutil.parser with fallback handling | 
| test_unstructured/partition/test_email.py | Adds parameterized tests for various date formats | 
| example-docs/eml/test-iso-8601-date.eml | Test file with ISO-8601 date format | 
| example-docs/eml/test-rfc2822-date.eml | Test file with RFC-2822 date format | 
| CHANGELOG.md | Documents the enhancement | 
| unstructured/version.py | Version bump to 0.18.13 | 
        
          
                unstructured/partition/email.py
              
                Outdated
          
        
      | @classmethod | ||
| def iter_elements( | ||
| cls, attachment: EmailMessage, ctx: EmailPartitioningContext | ||
| cls, attachment: MIMEPart, ctx: EmailPartitioningContext | 
    
      
    
      Copilot
AI
    
    
    
      Aug 8, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type annotation change from EmailMessage to MIMEPart appears unrelated to the date parsing enhancement and could introduce breaking changes. This change should be verified to ensure it doesn't break existing functionality or should be moved to a separate PR.
        
          
                unstructured/partition/email.py
              
                Outdated
          
        
      | @classmethod | ||
| def iter_elements( | ||
| cls, attachment: EmailMessage, ctx: EmailPartitioningContext | ||
| cls, attachment: MIMEPart, ctx: EmailPartitioningContext | 
    
      
    
      Copilot
AI
    
    
    
      Aug 8, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type annotation change from EmailMessage to MIMEPart is inconsistent with the PR's stated purpose of improving date parsing and could introduce breaking changes. The change should be justified or moved to a separate PR.
| @claude please review | 
| Claude finished @potter-potter's task —— View job PR Review: Email Date Format FlexibilityTodo List: 
 SummaryThis PR successfully addresses the issue of email date parsing failures when processing .eml files with non-standard date formats. The change from  Code Review AnalysisMain Implementation Changes (unstructured/partition/email.py:284-289): try:
    sent_date = parser.parse(date_str)
except (parser.ParserError, TypeError, ValueError):
    return NoneKey Benefits: 
 Technical AssessmentSecurity: ✅ No concerns - uses established library safely Test QualityThe parameterized tests effectively cover: 
 Minor Recommendations
 Overall Assessment✅ APPROVED - This is a solid enhancement that solves a real problem. The implementation is correct, well-tested, and maintains backward compatibility while adding robustness for non-standard date formats. The approach of using  | 
| try: | ||
| sent_date = parser.parse(date_str) | ||
| except (parser.ParserError, TypeError, ValueError): | ||
| return None | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a test for this route?
This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: potter-potter <potter-potter@users.noreply.github.com>
This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: potter-potter <potter-potter@users.noreply.github.com>
we are seeing some .eml files come through the VLM partitioner. Which then downgrades to hi-res i believe.
For some reason they have a date format that is not standard email format. But it is still legitimate.
This uses a more robust date package to parse the date. This package is already installed.