Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: Simplify file identifiers generation #2003

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

exiledkingcc
Copy link
Contributor

@exiledkingcc exiledkingcc commented Jul 22, 2023

The PDF standard says:

The calculation of the file identifier need not be reproducible; all that matters is that the identifier is likely to be
unique. For example, two implementations of the preceding algorithm might use different formats for the
current time, causing them to produce different file identifiers for the same file created at the same time, but the
uniqueness of the identifier is not affected.

the identifiers are also be used for encryption.

Why this PR improves pypdf:

  • Performance: The identifier generation before this PR costs too much for big pdf files.
  • Deterministic execution: For aes encrypted pdf, it's not deterministic anyway.
  • Changing hash when encrypting: When PdfWriter.encrypt is called, the identifiers are genearated by uncrypted pdf stream,
    then PdfWriter.write called, the content of pdf file is encrypted, so the hash changed.
    for encrypted pdf, identifiers must be generated before write to stream, since the identifier will be used to calculate the key,
    so the identifiers cannot be the hash of pdf stream content.

@codecov
Copy link

codecov bot commented Jul 22, 2023

Codecov Report

Attention: Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 95.13%. Comparing base (4a41c53) to head (e325c79).

Files Patch % Lines
pypdf/_writer.py 90.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2003      +/-   ##
==========================================
- Coverage   95.14%   95.13%   -0.02%     
==========================================
  Files          51       51              
  Lines        8551     8551              
  Branches     1706     1707       +1     
==========================================
- Hits         8136     8135       -1     
  Misses        261      261              
- Partials      154      155       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MartinThoma MartinThoma changed the title MAINT: simplify file identifiers generation MAINT: Simplify file identifiers generation Jul 23, 2023
return ByteStringObject(_rolling_checksum(stream).encode("utf8"))
def _compute_document_identifier(self) -> ByteStringObject:
md5 = hashlib.md5()
md5.update(str(time.time()).encode("utf-8"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes document-generation non-deterministic, right?

@MartinThoma MartinThoma added the on-hold PR requests that need clarification before they can be merged.A comment must give details label Jul 29, 2023
MartinThoma added a commit that referenced this pull request Dec 23, 2023
See #2003

Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>
MartinThoma added a commit that referenced this pull request Dec 23, 2023
See #2003

Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>
pypdf/_writer.py Outdated Show resolved Hide resolved
MartinThoma added a commit that referenced this pull request Dec 23, 2023
#2003

Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>
MartinThoma added a commit that referenced this pull request Dec 23, 2023
#2003

Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>
@@ -1246,7 +1244,7 @@ def generate_file_identifiers(self) -> None:
id2 = self._compute_document_identifier()
else:
id1 = self._compute_document_identifier()
id2 = id1
id2 = ByteStringObject(id1.original_bytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id1 is a ByteStringObject already. So .original_bytes just returns id1. Then wrapping it in ByteStringObject doesn't do anything, right?

return ByteStringObject(_rolling_checksum(stream).encode("utf8"))
md5 = hashlib.md5()
md5.update(str(time.time()).encode("utf-8"))
md5.update(str(self.fileobj).encode("utf-8"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is self.fileobj equivalent to self._write_pdf_structure(stream)?

@py-pdf py-pdf deleted a comment from exiledkingcc Jul 20, 2024
@py-pdf py-pdf deleted a comment from exiledkingcc Jul 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on-hold PR requests that need clarification before they can be merged.A comment must give details
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants