Skip to content

bpo-45863: tarfile: don't zero out header fields unnecessarily #29693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 9, 2022

Conversation

jmroot
Copy link
Contributor

@jmroot jmroot commented Nov 22, 2021

Numeric fields of type float, notably mtime, can't be represented
exactly in the ustar header, so the pax header is used. But it is
helpful to set them to the nearest int (i.e. second rather than
nanosecond precision mtimes) in the ustar header as well, for the
benefit of unarchivers that don't understand the pax header.

https://bugs.python.org/issue45863

@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Dec 23, 2021
@jmroot
Copy link
Contributor Author

jmroot commented Jan 13, 2022

Commenting to clear stale flag since this is still awaiting review.

@jmroot
Copy link
Contributor Author

jmroot commented Feb 5, 2022

@vstinner Changes made to address concerns brought up in our conversation on IRC.

Numeric fields of type float, notably mtime, can't be represented
exactly in the ustar header, so the pax header is used. But it is
helpful to set them to the nearest int (i.e. second rather than
nanosecond precision mtimes) in the ustar header as well, for the
benefit of unarchivers that don't understand the pax header.

Add test for tarfile.TarInfo.create_pax_header to confirm correct
behaviour.
@jmroot jmroot requested a review from vstinner February 8, 2022 10:27
b'././@PaxHeader' + bytes(86) \
+ b'0000000\x000000000\x000000000\x0000000000020\x0000000000000\x00010205\x00 x' \
+ bytes(100) + b'ustar\x0000'+ bytes(247) \
+ b'16 mtime=1000.1\n' + bytes(496) + b'foo' + bytes(97) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

info contains mtime rounded to an int, but in the binary header, mtime can be a floatting point number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. The mtime is rounded in the info after calling create_pax_header, and appears in the overall header twice: unrounded in the pax extended header (formatted as a decimal number in a UTF-8 string) and rounded in the ustar header (octal 1750, seen in the next line).


# The existing pax header has priority.
if needs_pax and name not in pax_headers:
pax_headers[name] = str(val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does PAX have a limit for the number of digits after the dot for a floating point number?

For example, mtime=1/7 is formatted as 0.14285714285714285. The number 0.142,857,142,857,142,85 has 17 digits after the dot. A resolution of 10^-17 seconds, whereas the most accurate clocks have a resolution of 1 ns (10^-9) on Linux.

$ echo Hello > document.txt
$ tar -H posix -cf archive.tar document.txt
$ hexdump -C archive.tar 
(...)00000200  33 30 20 6d 74 69 6d 65  3d 31 36 34 34 33 32 35  |30 mtime=1644325|
00000210  30 33 33 2e 31 38 32 37  34 34 32 32 38 0a 33 30  |033.182744228.30|
00000220  20 61 74 69 6d 65 3d 31  36 34 34 33 32 35 30 33  | atime=164432503|
00000230  33 2e 31 38 32 37 34 34  32 32 38 0a 33 30 20 63  |3.182744228.30 c|
00000240  74 69 6d 65 3d 31 36 34  34 33 32 35 30 33 33 2e  |time=1644325033.|
00000250  31 38 32 37 34 34 32 32  38 0a 00 00 00 00 00 00  |182744228.......|
00000260  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
(...)

$ python3
>>> import os
>>> st=os.stat("document.txt")
>>> st.st_atime_ns
1644325037796763404
>>> st.st_mtime_ns
1644325033182744228
>>> st.st_ctime_ns
1644325033182744228

I get:

  • mtime=1644325033.182744228
  • atime=1644325033.182744228
  • ctime=1644325033.182744228

9 digits after the dot for mtime, atime and ctime.

Maybe for these 9 header names, round to 9 digits if there are more than 9 digits?

Maybe it's ok to have more digits. I don't know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard doesn't specify a maximum number of digits. It suggests to use enough digits to allow the timestamp to be reconstructed exactly (if extracting on a system with the same timestamp precision as the originating system). Most implementations use nanosecond precision.

Doing additional rounding of the value in the pax extended header would be a behavioural change beyond what I'm setting out to do here. I suspect any potential issues with using more digits than an extractor can handle would be rare anyway, since you would have to deliberately use a made-up number rather than just dividing st_mtime_ns by 10^9.

# values that should be kept
t = tarfile.TarInfo()
t.name = "foo"
t.mtime = 1000.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1000.1 cannot be represented exactly in binary IEEE 754:

>>> (1001.1).hex()
'0x1.f48cccccccccdp+9'

Maybe to avoid any rounding issue in the test, we could use 1000.5 number which can be represented exaclty?

>>> (1000.5).hex()
'0x1.f440000000000p+9'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make that change if you want, but there are going to be quite a few real-world timestamps that don't have an exact representation in floating point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And come to think of it, 1000.5 has the disadvantage of being exactly halfway between two integers, so the result of rounding it changes depending on the prevailing rounding strategy. I think round() is supposed to guarantee round-to-even, but it still adds a little ambiguity for humans.

All in all, switching to fixed-point would probably be a good idea. But that's again rather out of scope for what I'm trying to accomplish here. :)

@vstinner
Copy link
Member

vstinner commented Feb 9, 2022

Can you please add a NEWS entry for this change? See for blurb or blurb-it.

With no NEWS, I'm ok to merge the change in main, but not to backport it.

@jmroot
Copy link
Contributor Author

jmroot commented Feb 9, 2022

Happy to write a NEWS entry, but TBH I was put off by the tool dependencies apparently needed to generate the required filename.

@jmroot
Copy link
Contributor Author

jmroot commented Feb 9, 2022

OK, I see now that blurb-it is an option, which is at least a little more convenient than installing blurb locally.

@jmroot
Copy link
Contributor Author

jmroot commented Feb 9, 2022

Sorry for manually squashing and force pushing BTW, I initially missed the part in the docs where it says not to.

@jmroot
Copy link
Contributor Author

jmroot commented Feb 9, 2022

@vstinner News entry added and CI has passed.

@vstinner vstinner merged commit bf2d44f into python:main Feb 9, 2022
@vstinner vstinner added needs backport to 3.9 only security fixes needs backport to 3.10 only security fixes labels Feb 9, 2022
@miss-islington
Copy link
Contributor

Thanks @jmroot for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.9.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

@miss-islington
Copy link
Contributor

Thanks @jmroot for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.10.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

@bedevere-bot
Copy link

GH-31232 is a backport of this pull request to the 3.9 branch.

@bedevere-bot bedevere-bot removed the needs backport to 3.9 only security fixes label Feb 9, 2022
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Feb 9, 2022
…nGH-29693)

Numeric fields of type float, notably mtime, can't be represented
exactly in the ustar header, so the pax header is used. But it is
helpful to set them to the nearest int (i.e. second rather than
nanosecond precision mtimes) in the ustar header as well, for the
benefit of unarchivers that don't understand the pax header.

Add test for tarfile.TarInfo.create_pax_header to confirm correct
behaviour.
(cherry picked from commit bf2d44f)

Co-authored-by: Joshua Root <jmr@macports.org>
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Feb 9, 2022
…nGH-29693)

Numeric fields of type float, notably mtime, can't be represented
exactly in the ustar header, so the pax header is used. But it is
helpful to set them to the nearest int (i.e. second rather than
nanosecond precision mtimes) in the ustar header as well, for the
benefit of unarchivers that don't understand the pax header.

Add test for tarfile.TarInfo.create_pax_header to confirm correct
behaviour.
(cherry picked from commit bf2d44f)

Co-authored-by: Joshua Root <jmr@macports.org>
@bedevere-bot bedevere-bot removed the needs backport to 3.10 only security fixes label Feb 9, 2022
@bedevere-bot
Copy link

GH-31233 is a backport of this pull request to the 3.10 branch.

miss-islington added a commit that referenced this pull request Feb 9, 2022
)

Numeric fields of type float, notably mtime, can't be represented
exactly in the ustar header, so the pax header is used. But it is
helpful to set them to the nearest int (i.e. second rather than
nanosecond precision mtimes) in the ustar header as well, for the
benefit of unarchivers that don't understand the pax header.

Add test for tarfile.TarInfo.create_pax_header to confirm correct
behaviour.
(cherry picked from commit bf2d44f)

Co-authored-by: Joshua Root <jmr@macports.org>
miss-islington added a commit that referenced this pull request Feb 9, 2022
)

Numeric fields of type float, notably mtime, can't be represented
exactly in the ustar header, so the pax header is used. But it is
helpful to set them to the nearest int (i.e. second rather than
nanosecond precision mtimes) in the ustar header as well, for the
benefit of unarchivers that don't understand the pax header.

Add test for tarfile.TarInfo.create_pax_header to confirm correct
behaviour.
(cherry picked from commit bf2d44f)

Co-authored-by: Joshua Root <jmr@macports.org>
@jmroot jmroot deleted the tarfile-headers branch February 9, 2022 21:36
@jmroot
Copy link
Contributor Author

jmroot commented Feb 9, 2022

Thanks!

hello-adam pushed a commit to hello-adam/cpython that referenced this pull request Jun 2, 2022
…nGH-29693)

Numeric fields of type float, notably mtime, can't be represented
exactly in the ustar header, so the pax header is used. But it is
helpful to set them to the nearest int (i.e. second rather than
nanosecond precision mtimes) in the ustar header as well, for the
benefit of unarchivers that don't understand the pax header.

Add test for tarfile.TarInfo.create_pax_header to confirm correct
behaviour.
(cherry picked from commit bf2d44f)

Co-authored-by: Joshua Root <jmr@macports.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Stale PR or inactive for long period of time.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants