Skip to content

Tar: Refine GNU timestamp handling #115211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 10, 2025

Conversation

carlossanlop
Copy link
Contributor

@carlossanlop carlossanlop commented May 1, 2025

Bugs being fixed in this PR:

  • atime and ctime are problematic fields because the GNU spec decided to write them in the header location where other formats expect to find the prefix field. This causes other tools to consider our entries malformed when we write those two values (the filenames were being prefixed with the octal bytes of the atime and/or ctime). In previous fixes we were being conservative and setting the value to zeros, but that wasn't enough. The proper fix is to not set these two fields to default(DateTime.MinValue) by default and translate that value internally to null bytes. There are only two cases where we will write an actual value for atime and ctime:

    1. When the user sets them manually.
    2. When writing with TarWriter a GNU entry that came from another archive (using a TarReader), and the entry happens to have these values set.
  • The checksum was being written with an extra trailing null byte. The spec indicates that the checksum field needs to be written in octal, have leading zero characters, and end with a null byte and a space. When we were internally converting the decimal value to octal, the returned span was getting a null char added at the end. This was not necessary for checksum, but it is necessary for other fields.

  • In the GNU format, if the name field was 100 chars in length including the terminating null char, we were writing a GNU LongPath entry to store that long name field. The bug was that other formats don't count the null terminating character towards these 100 fields. This was causing us to write extra LongPath entries that would increase the size of the archive unnecessarily. The fix is to exclude the nullchar from that count, and if the length is exactly 100 chars, write name in the regular header field, no extra entries needed.

- LongLink/LongPath entries should have their atime and ctime values set to null characters. Many tools think the prefix field is being used instead of the body, which causes them to prepend the atime or ctime raw value as part of the path.
- Regular GNU entries should also write their atime and ctime as nulls if the value is UnixEpoch. Once again, same problem, some tools think the prefix value is being used and prepend that to the path.
- Also, let's be even more strict about the uid, gid, uname and gname stored in LongLink/LongPath entries. While I don't think these cause issues, we currently don't match what other tools do: set root as the owner of these entries.
…h. The user should manually set the value of these properties, and they should know there are tools that cannot handle these fields.
Add some improvements around checksum validation to make it easier to update values in the future.
@carlossanlop carlossanlop added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it area-System.Formats.Tar labels May 1, 2025
@carlossanlop carlossanlop self-assigned this May 1, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-formats-tar
See info in area-owners.md if you want to be subscribed.

@@ -780,11 +783,22 @@ private int WritePosixAndGnuSharedFields(Span<byte> buffer)
return checksum;
}

private int WriteAsGnuTimestamp(DateTimeOffset timestamp, Span<byte> buffer)
{
if (_typeFlag is TarEntryType.LongLink or TarEntryType.LongPath || timestamp == DateTimeOffset.UnixEpoch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as the other PR: there should be no need to handle DateTimeOffset.UnixEpoch different from any other date.

Copy link
Contributor Author

@carlossanlop carlossanlop May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, after analyzing GNU archives generated by other tools, we realized two things:

  • That atime and ctime are rarely set by other tools when used to created archives with entries in GNU format.

  • That when other tools find an atime and/or ctime in GNU entries, they will misinterpret the value and think its the prefix field that is being set. This causes those tools to prepend this "prefix" to the name field. The consecuence is that the structure of the archive ends up being incorrect, and extraction creates unexpected directories.

The main reason to do all this is because we are still seeing failures in VMR in various Linux distros due to the above.

We found passages in the GNU tar docs indicating that it was not a good decision to put these GNU fields where other entry formats expect a prefix, because this causes incompatibility problems.

I understand your guidance for mtime, it was the right thing to do for that field, and that improvement was preserved from the previous PR. But it should not apply to atime and ctime: UnixEpoch should be treated as nulls.

For these reasons, we have decided to avoid setting these values by default. The user will now have to manually set them if they so desire, and we will add warnings in our docs that this could cause incompatibility issues. We will also mark this as a breaking change in .NET 10.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, you're using UnixEpoch as a sentinel value to know if atime and ctime should not be written.

That atime and ctime are rarely set by other tools when used to created archives with entries in GNU format.

Why don't you follow this behavior and omit writing atime and ctime in the GNU format?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omit writing atime and ctime in the GNU format?

The rationale for this: it's worse for a user that didn't read the documentation and potentially introduce custom handling for the GNU format to end up with an archive that some tools can't handle than to end up with an archive that doesn't have atime/ctime (which is what the other tools do for this format).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you follow this behavior and omit writing atime and ctime in the GNU format?

That's what this PR is attempting to do. Not set it to default anywhere from now on.

But to make sure I understand your question - are you talking about setting it to UnixEpoch, but instead set it to default(DateTimeOffset), which is equivalent to MinValue? It's a struct so I need to set it to something.

I assume you are considering UnixEpoch as a valid value that users might want to set at some point and have it show up in atime and ctime.

Please correct me if I'm misunderstanding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what this PR is attempting to do. Not set it to default anywhere from now on.

Except when converting from pax to gnu, but from the other comment, I think that's also wrong, and will make sure to omit passing the atime/ctime from pax to gnu too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not set it to default anywhere from now on.

Yes, defaulting to not writing these is good.

are you talking about setting it to UnixEpoch, but instead set it to default(DateTimeOffset), which is equivalent to MinValue? It's a struct so I need to set it to something.

Yes, this is a better value to mean: don't write.

nit: I think using default and default(...) in the code would be more readible than using MinValue.

I think the handling of this value should be agnostic of the format. That is: it should not be specific to the WriteAsGnuTimestamp method. It should apply to any call to WriteAsTimestamp.

Why don't you follow this behavior and omit writing atime and ctime in the GNU format?

The suggestion is to only ever write atime/ctime when the format is Pax.

I don't think the other formats have real support for these timestamps which means that writing them leads to tar files that some tools won't accept.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can use default.

I see what you mean about never setting atime/ctime except in Pax. That makes sense. The more I read about atime/ctime in the various docs/specs, the more I find that adding atime/ctime where prefix would normally be is generally considered a design mistake and is not common to use it. It confirms what you've been saying.

The only other case where we should always write atime/ctime is when writing entries from one TarReader into a TarWriter, and an entry from TarReader happens to have atime and/or ctime set.

{
if (_typeFlag is TarEntryType.LongLink or TarEntryType.LongPath || timestamp == DateTimeOffset.UnixEpoch)
{
buffer.Clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are various methods in this file that assume the buffer to be zero-ed out beforehand. So the call to buffer.Clear() should not be needed.

I assume the intent of this change is to not write dates for TarEntryType.LongLink and TarEntryType.LongPath. Instead of introducing WriteAsGnuTimestamp, consider moving it to the caller:

private int WriteGnuFields(Span<byte> buffer)
{
    int checksum = 0;
    if (_typeFlag is not TarEntryType.LongLink and not TarEntryType.LongPath)
    {
      checksum += WriteAsTimestamp(_aTime, buffer.Slice(FieldLocations.ATime, FieldLengths.ATime));
      checksum += WriteAsTimestamp(_cTime, buffer.Slice(FieldLocations.CTime, FieldLengths.CTime));
    }
    ...

Copy link
Contributor Author

@carlossanlop carlossanlop May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to answer to this comment, which happened before the above discussion and clarification of the intent of this PR: Which is to not write ctime and atime anywhere by default.

…at other tools do when generating entries whose name field is exactly 100 bytes without a nullchar: they don't need a GNU LongPath entry.
@carlossanlop carlossanlop removed NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it labels May 7, 2025
@carlossanlop carlossanlop marked this pull request as ready for review May 7, 2025 05:10
@carlossanlop
Copy link
Contributor Author

I marked this as ready for review and I filled out the description template to explain the 3 bugs being fixed in this PR.

@carlossanlop carlossanlop changed the title [Draft] Tar: Refine GNU timestamp handling Tar: Refine GNU timestamp handling May 7, 2025
@carlossanlop
Copy link
Contributor Author

Can I please get a review?

/// <remarks>When creating an instance using the <see cref="GnuTarEntry(TarEntryType, string)"/> constructor, only the following entry types are supported:
/// <remarks>
/// <para>When creating an instance of <see cref="GnuTarEntry"/> using this constructor, the <see cref="AccessTime"/> and <see cref="ChangeTime"/> properties are set to <see langword="default" />, which in the entry header <c>atime</c> and <c>ctime</c> fields is written as null bytes. This ensures compatibility with other tools that are unable to read the <c>atime</c> and <c>ctime</c> in <see cref="TarEntryFormat.Gnu"/> entries, as these two fields are not POSIX compatible because other formats expect the <c>prefix</c> field in the same header location where <see cref="TarEntryFormat.Gnu"/> writes <c>atime</c> and <c>ctime</c>.</para>
/// <para>When creating an instance using the <see cref="GnuTarEntry(TarEntryType, string)"/> constructor, only the following entry types are supported:</para>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, I'm surprised this constructor initializes timestamps to UtcNow.
I would expect this class to be purely a holder of data that is passed to it.

checksum += WriteAsTimestamp(_cTime, buffer.Slice(FieldLocations.CTime, FieldLengths.CTime));
int checksum = 0;

if (_typeFlag is not TarEntryType.LongLink and not TarEntryType.LongPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo .NET implementation could choose to not write these timestamps and tell users to use the Pax format if they want these stored.

It would simplify the overall logic, and eliminate the risk of creating archives some tools won't accept.

Copy link
Contributor Author

@carlossanlop carlossanlop May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo .NET implementation could choose to not write these timestamps and tell users to use the Pax format if they want these stored.

It would simplify the overall logic, and eliminate the risk of creating archives some tools won't accept.

In other words, are you suggesting we make the AccessTime and ChangeTime properties of GnuTarEntry completely no-op?

I'm not convinced that's the best approach for every case. I would prefer to let the user decide in at least a few cases.

Let me list what the PR currently proposes:

  • PaxTarEntry copy constructor: If we call this by passing a GnuTarEntry that has atime/ctime set, we preserve them in ExtendedAttributes.
  • Regular GnuTarEntry constructors: set them to default.
  • Obtaining an entry from TarReader: Preserve atime/ctime if they exist in the entry returned by GetNextEntry.
  • Writing with TarWriter.WriteEntry(GnuTarEntry): We preserve atime/ctime if they are present in the passed GnuTarEntry instance.
  • TarFile: any interaction with GnuTarEntry instances that contain atime/ctime set will preserve them.

Now, this is what I could change:

  • GnuTarEntry copy constructor: Currently, if we call this copy constructor by passing another GnuTarEntry that has atime/ctime set, these two values would be copied too. Instead, I can change that so that we set atime/ctime to default too, and we document this in the constructor summary and remarks. If the user really wants to preserve them, then we document that they need to manually set via AccessTime and ChangeTime them after creating the new copied instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm proposing that only for Pax these timestamps would be written. The properties on the GnuTarEntry class could either throw on set, or change the backing value.

I think it's not clear for a user that setting the properties can lead to tar files that are not accepted by some tools. I find a broken tar file worse than one that no longer includes the last access and creation times.

Copy link
Contributor Author

@carlossanlop carlossanlop May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with other folks in the team. We think we should:

  • Let users set AccessTime and ChangeTime manually if they so desire. We just need to be very explicit of the disadvantages of using these fields, both in the summary and in the remarks.
  • Preserve atime/ctime as is when using TarWriter.WriteEntry(GnuTarEntry). We should not lose data.
  • The GnuTarEntry copy constructor should only preserve the atime/ctime fields if the passed entry instance is also GnuTarEntry. Once again, preservation of data. The only other case would be when copying from PAX to GNU, but in that case, we won't transfer atime/ctime from the ExtendedAttributes to the GNU entry. Document this as well. Edit: Ah, there's the case of converting a GnuTarEntry to PAX. In that case, we also should preserve atime and ctime.
  • When creating a GnuTarEntry from an item in the filesystem, do not collect atime/ctime. If the user wants it, they'll have to set it manually. We should document this too.
  • All other cases must set atime/ctime to default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main thing were we would make a different trade off is that I'd never write a an atime/ctime in non-pax. I don't think you can count on users to read the documentation for knowing the impact of using these fields. As a maintainer, it's your decision.

When creating a GnuTarEntry from an item in the filesystem, do not collect atime/ctime. If the user wants it, they'll have to set it manually. We should document this too.

fyi, tar doesn't include atime/ctime by default for pax.

Shouldn't UstarTarEntry and V7TarEntry have the same behavior as GnuTarEntry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. I opened an issue to track this comment so I can address it in another PR. I'd like to do tests first. #115434

{
if (_aTime != default)
{
checksum += WriteAsTimestamp(_aTime, buffer.Slice(FieldLocations.ATime, FieldLengths.ATime));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to move the handling of the default value inside the WriteAsTimestamp method so it applies to any timestamp written.

Similarly, a ReadTimestamp method can be made responsible for handling it, including the mapping of all null bytes (which is now handled in ReadGnuAttributes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but mtime uses it too. Shouldn't mtime write all zeros if the value to write is MinValue? For that reason, I decided to leave it like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't mtime write all zeros if the value to write is MinValue?

I don't think that is the case for PAX currently. That is the suggestion.

And also for reading, all zero bytes would be mapped to default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. I'd like to double check with other tools to confirm they won't complain if they see mtime set to all nulls.

I opened an issue to track this and test it with other tools: #115434

Copy link
Member

@ericstj ericstj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src changes look good. I scanned through tests and left a couple comments, my assumption is that we got a lot of good compatibility coverage with the manual round tripping tests that were done and confirmed to produce zero-change to existing archives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants