MAINT: Simplify file identifiers generation #2003

exiledkingcc · 2023-07-22T16:16:42Z

The PDF standard says:

The calculation of the file identifier need not be reproducible; all that matters is that the identifier is likely to be
unique. For example, two implementations of the preceding algorithm might use different formats for the
current time, causing them to produce different file identifiers for the same file created at the same time, but the
uniqueness of the identifier is not affected.

the identifiers are also be used for encryption.

Why this PR improves pypdf:

Performance: The identifier generation before this PR costs too much for big pdf files.
Deterministic execution: For aes encrypted pdf, it's not deterministic anyway.
Changing hash when encrypting: When PdfWriter.encrypt is called, the identifiers are genearated by uncrypted pdf stream,
then PdfWriter.write called, the content of pdf file is encrypted, so the hash changed.
for encrypted pdf, identifiers must be generated before write to stream, since the identifier will be used to calculate the key,
so the identifiers cannot be the hash of pdf stream content.

codecov · 2023-07-22T16:30:50Z

Codecov Report

Attention: Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 95.13%. Comparing base (4a41c53) to head (e325c79).

Files	Patch %	Lines
pypdf/_writer.py	90.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2003      +/-   ##
==========================================
- Coverage   95.14%   95.13%   -0.02%     
==========================================
  Files          51       51              
  Lines        8551     8551              
  Branches     1706     1707       +1     
==========================================
- Hits         8136     8135       -1     
  Misses        261      261              
- Partials      154      155       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

MartinThoma · 2023-07-29T09:24:31Z

pypdf/_writer.py

-        return ByteStringObject(_rolling_checksum(stream).encode("utf8"))
+    def _compute_document_identifier(self) -> ByteStringObject:
+        md5 = hashlib.md5()
+        md5.update(str(time.time()).encode("utf-8"))


This makes document-generation non-deterministic, right?

See #2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>

pypdf/_writer.py

#2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>

MartinThoma · 2023-12-23T20:14:45Z

pypdf/_writer.py

@@ -1246,7 +1244,7 @@ def generate_file_identifiers(self) -> None:
            id2 = self._compute_document_identifier()
        else:
            id1 = self._compute_document_identifier()
-            id2 = id1
+            id2 = ByteStringObject(id1.original_bytes)


id1 is a ByteStringObject already. So .original_bytes just returns id1. Then wrapping it in ByteStringObject doesn't do anything, right?

MartinThoma · 2023-12-23T20:17:45Z

pypdf/_writer.py

-        return ByteStringObject(_rolling_checksum(stream).encode("utf8"))
+        md5 = hashlib.md5()
+        md5.update(str(time.time()).encode("utf-8"))
+        md5.update(str(self.fileobj).encode("utf-8"))


Is self.fileobj equivalent to self._write_pdf_structure(stream)?

MAINT: simplify file identifiers generation

5fd1e91

MartinThoma changed the title ~~MAINT: simplify file identifiers generation~~ MAINT: Simplify file identifiers generation Jul 23, 2023

MartinThoma reviewed Jul 29, 2023

View reviewed changes

MartinThoma added the on-hold PR requests that need clarification before they can be merged.A comment must give details label Jul 29, 2023

exiledkingcc force-pushed the simplify branch 2 times, most recently from 9092a14 to 5fd1e91 Compare September 11, 2023 06:55

exiledkingcc and others added 3 commits September 11, 2023 15:09

Merge remote-tracking branch 'origin/main' into simplify

741185d

Merge branch 'main' into simplify

b1b5b61

Merge branch 'main' into simplify

ffd4407

MartinThoma added a commit that referenced this pull request Dec 23, 2023

STY: Add PdfWriter._ID attribute

44574a2

See #2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>

MartinThoma mentioned this pull request Dec 23, 2023

STY: Add PdfWriter._ID attribute #2361

Merged

MartinThoma added a commit that referenced this pull request Dec 23, 2023

STY: Add PdfWriter._ID attribute (#2361)

beca111

See #2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>

Merge branch 'main' into simplify

7a286b9

MartinThoma reviewed Dec 23, 2023

View reviewed changes

pypdf/_writer.py Outdated Show resolved Hide resolved

Update pypdf/_writer.py

f095cdc

MartinThoma added a commit that referenced this pull request Dec 23, 2023

STY: File identifier generation restructuring

3d84ba8

#2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>

MartinThoma mentioned this pull request Dec 23, 2023

STY: File identifier generation restructuring #2362

Merged

MartinThoma added a commit that referenced this pull request Dec 23, 2023

STY: File identifier generation restructuring (#2362)

ec85a27

#2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>

Merge branch 'main' into simplify

f8c7bf6

MartinThoma mentioned this pull request Dec 23, 2023

DOC: Quote specs in generate_file_identifiers #2363

Merged

Merge branch 'main' into simplify

40bb17f

MartinThoma reviewed Dec 23, 2023

View reviewed changes

py-pdf deleted a comment from exiledkingcc Jul 20, 2024

Merge branch 'main' into simplify

e325c79

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: Simplify file identifiers generation #2003

MAINT: Simplify file identifiers generation #2003

exiledkingcc commented Jul 22, 2023 •

edited by MartinThoma

Loading

codecov bot commented Jul 22, 2023 •

edited

Loading

MartinThoma Jul 29, 2023

MartinThoma Dec 23, 2023

MartinThoma Dec 23, 2023

MAINT: Simplify file identifiers generation #2003

Are you sure you want to change the base?

MAINT: Simplify file identifiers generation #2003

Conversation

exiledkingcc commented Jul 22, 2023 • edited by MartinThoma Loading

codecov bot commented Jul 22, 2023 • edited Loading

Codecov Report

MartinThoma Jul 29, 2023

Choose a reason for hiding this comment

MartinThoma Dec 23, 2023

Choose a reason for hiding this comment

MartinThoma Dec 23, 2023

Choose a reason for hiding this comment

exiledkingcc commented Jul 22, 2023 •

edited by MartinThoma

Loading

codecov bot commented Jul 22, 2023 •

edited

Loading