-
-
Notifications
You must be signed in to change notification settings - Fork 31.6k
bpo-45863: tarfile: don't zero out header fields unnecessarily #29693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b6a4c39
to
8aedc8a
Compare
This PR is stale because it has been open for 30 days with no activity. |
Commenting to clear stale flag since this is still awaiting review. |
8aedc8a
to
8e5afe3
Compare
@vstinner Changes made to address concerns brought up in our conversation on IRC. |
Numeric fields of type float, notably mtime, can't be represented exactly in the ustar header, so the pax header is used. But it is helpful to set them to the nearest int (i.e. second rather than nanosecond precision mtimes) in the ustar header as well, for the benefit of unarchivers that don't understand the pax header. Add test for tarfile.TarInfo.create_pax_header to confirm correct behaviour.
8e5afe3
to
0684c9a
Compare
b'././@PaxHeader' + bytes(86) \ | ||
+ b'0000000\x000000000\x000000000\x0000000000020\x0000000000000\x00010205\x00 x' \ | ||
+ bytes(100) + b'ustar\x0000'+ bytes(247) \ | ||
+ b'16 mtime=1000.1\n' + bytes(496) + b'foo' + bytes(97) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
info contains mtime rounded to an int, but in the binary header, mtime can be a floatting point number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. The mtime is rounded in the info after calling create_pax_header
, and appears in the overall header twice: unrounded in the pax extended header (formatted as a decimal number in a UTF-8 string) and rounded in the ustar header (octal 1750, seen in the next line).
|
||
# The existing pax header has priority. | ||
if needs_pax and name not in pax_headers: | ||
pax_headers[name] = str(val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does PAX have a limit for the number of digits after the dot for a floating point number?
For example, mtime=1/7 is formatted as 0.14285714285714285
. The number 0.142,857,142,857,142,85
has 17 digits after the dot. A resolution of 10^-17 seconds, whereas the most accurate clocks have a resolution of 1 ns (10^-9) on Linux.
$ echo Hello > document.txt
$ tar -H posix -cf archive.tar document.txt
$ hexdump -C archive.tar
(...)00000200 33 30 20 6d 74 69 6d 65 3d 31 36 34 34 33 32 35 |30 mtime=1644325|
00000210 30 33 33 2e 31 38 32 37 34 34 32 32 38 0a 33 30 |033.182744228.30|
00000220 20 61 74 69 6d 65 3d 31 36 34 34 33 32 35 30 33 | atime=164432503|
00000230 33 2e 31 38 32 37 34 34 32 32 38 0a 33 30 20 63 |3.182744228.30 c|
00000240 74 69 6d 65 3d 31 36 34 34 33 32 35 30 33 33 2e |time=1644325033.|
00000250 31 38 32 37 34 34 32 32 38 0a 00 00 00 00 00 00 |182744228.......|
00000260 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
(...)
$ python3
>>> import os
>>> st=os.stat("document.txt")
>>> st.st_atime_ns
1644325037796763404
>>> st.st_mtime_ns
1644325033182744228
>>> st.st_ctime_ns
1644325033182744228
I get:
mtime=1644325033.182744228
atime=1644325033.182744228
ctime=1644325033.182744228
9 digits after the dot for mtime, atime and ctime.
Maybe for these 9 header names, round to 9 digits if there are more than 9 digits?
Maybe it's ok to have more digits. I don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard doesn't specify a maximum number of digits. It suggests to use enough digits to allow the timestamp to be reconstructed exactly (if extracting on a system with the same timestamp precision as the originating system). Most implementations use nanosecond precision.
Doing additional rounding of the value in the pax extended header would be a behavioural change beyond what I'm setting out to do here. I suspect any potential issues with using more digits than an extractor can handle would be rare anyway, since you would have to deliberately use a made-up number rather than just dividing st_mtime_ns
by 10^9.
# values that should be kept | ||
t = tarfile.TarInfo() | ||
t.name = "foo" | ||
t.mtime = 1000.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000.1
cannot be represented exactly in binary IEEE 754:
>>> (1001.1).hex()
'0x1.f48cccccccccdp+9'
Maybe to avoid any rounding issue in the test, we could use 1000.5
number which can be represented exaclty?
>>> (1000.5).hex()
'0x1.f440000000000p+9'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make that change if you want, but there are going to be quite a few real-world timestamps that don't have an exact representation in floating point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And come to think of it, 1000.5 has the disadvantage of being exactly halfway between two integers, so the result of rounding it changes depending on the prevailing rounding strategy. I think round()
is supposed to guarantee round-to-even, but it still adds a little ambiguity for humans.
All in all, switching to fixed-point would probably be a good idea. But that's again rather out of scope for what I'm trying to accomplish here. :)
Can you please add a NEWS entry for this change? See for blurb or blurb-it. With no NEWS, I'm ok to merge the change in main, but not to backport it. |
Happy to write a NEWS entry, but TBH I was put off by the tool dependencies apparently needed to generate the required filename. |
OK, I see now that blurb-it is an option, which is at least a little more convenient than installing blurb locally. |
Sorry for manually squashing and force pushing BTW, I initially missed the part in the docs where it says not to. |
@vstinner News entry added and CI has passed. |
GH-31232 is a backport of this pull request to the 3.9 branch. |
…nGH-29693) Numeric fields of type float, notably mtime, can't be represented exactly in the ustar header, so the pax header is used. But it is helpful to set them to the nearest int (i.e. second rather than nanosecond precision mtimes) in the ustar header as well, for the benefit of unarchivers that don't understand the pax header. Add test for tarfile.TarInfo.create_pax_header to confirm correct behaviour. (cherry picked from commit bf2d44f) Co-authored-by: Joshua Root <jmr@macports.org>
…nGH-29693) Numeric fields of type float, notably mtime, can't be represented exactly in the ustar header, so the pax header is used. But it is helpful to set them to the nearest int (i.e. second rather than nanosecond precision mtimes) in the ustar header as well, for the benefit of unarchivers that don't understand the pax header. Add test for tarfile.TarInfo.create_pax_header to confirm correct behaviour. (cherry picked from commit bf2d44f) Co-authored-by: Joshua Root <jmr@macports.org>
GH-31233 is a backport of this pull request to the 3.10 branch. |
) Numeric fields of type float, notably mtime, can't be represented exactly in the ustar header, so the pax header is used. But it is helpful to set them to the nearest int (i.e. second rather than nanosecond precision mtimes) in the ustar header as well, for the benefit of unarchivers that don't understand the pax header. Add test for tarfile.TarInfo.create_pax_header to confirm correct behaviour. (cherry picked from commit bf2d44f) Co-authored-by: Joshua Root <jmr@macports.org>
) Numeric fields of type float, notably mtime, can't be represented exactly in the ustar header, so the pax header is used. But it is helpful to set them to the nearest int (i.e. second rather than nanosecond precision mtimes) in the ustar header as well, for the benefit of unarchivers that don't understand the pax header. Add test for tarfile.TarInfo.create_pax_header to confirm correct behaviour. (cherry picked from commit bf2d44f) Co-authored-by: Joshua Root <jmr@macports.org>
Thanks! |
…nGH-29693) Numeric fields of type float, notably mtime, can't be represented exactly in the ustar header, so the pax header is used. But it is helpful to set them to the nearest int (i.e. second rather than nanosecond precision mtimes) in the ustar header as well, for the benefit of unarchivers that don't understand the pax header. Add test for tarfile.TarInfo.create_pax_header to confirm correct behaviour. (cherry picked from commit bf2d44f) Co-authored-by: Joshua Root <jmr@macports.org>
Numeric fields of type float, notably mtime, can't be represented
exactly in the ustar header, so the pax header is used. But it is
helpful to set them to the nearest int (i.e. second rather than
nanosecond precision mtimes) in the ustar header as well, for the
benefit of unarchivers that don't understand the pax header.
https://bugs.python.org/issue45863