Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ImportDocument] Fix parsing of watermarked PDF files #3029

Merged
merged 4 commits into from
Dec 3, 2024

Conversation

Powlinett
Copy link
Member

@Powlinett Powlinett commented Nov 25, 2024

Proposed changes

  • Change LAParams from pdfminer lib in order to parse all text elements
  • Iterate recursively to ensure that text elements are handled at any layer

Related issues

Checklist

  • I consider the submitted work as finished
  • I tested the code for its functionality using different use cases
  • I added/update the relevant documentation (either on github or on notion)
    • Investigation page has been opened on Notion
  • Where necessary I refactored code to improve the overall quality

Further comments

A test file is provided in the Notion investigation page in order to test the feature.

+ change LAParams to parsed more text elements
@Powlinett Powlinett added filigran team use to identify PR from the Filigran team do not merge Do not merge this PR until this tag will be removed labels Nov 25, 2024
@Powlinett Powlinett self-assigned this Nov 25, 2024
@Powlinett Powlinett merged commit 213b7bc into master Dec 3, 2024
4 checks passed
@Powlinett Powlinett deleted the issue/2469-import-document branch December 3, 2024 16:22
@helene-nguyen helene-nguyen removed the do not merge Do not merge this PR until this tag will be removed label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
filigran team use to identify PR from the Filigran team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[importDocument] - Unable to extract information from PDF with a watermarking image
3 participants