Skip to content

Conversation

@steveloughran
Copy link
Contributor

  • changes the default value to keep
  • doesn't log it at info
  • updated docs
  • cut all marker tool commands from the SDK qualification commands *

How was this patch tested?

  • module tests in progress
  • also plan to do some CLI commands to see the log is quiet.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

* changes the default value
* doesn't log it at info
* updated docs
* cut all marker tool commands from the SDK qualification commands
*

Change-Id: I093a452193e404a7227fb79329b4b9b63da2ea99
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 6s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+1 💚 @author 0m 1s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 37m 41s trunk passed
+1 💚 compile 0m 38s trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 0m 29s trunk passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 0m 32s trunk passed
+1 💚 mvnsite 0m 36s trunk passed
+1 💚 javadoc 0m 27s trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 30s trunk passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 12s trunk passed
+1 💚 shadedclient 23m 39s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 26s the patch passed
+1 💚 compile 0m 31s the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 0m 31s the patch passed
+1 💚 compile 0m 24s the patch passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 javac 0m 24s the patch passed
-1 ❌ blanks 0m 0s /blanks-eol.txt The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
+1 💚 checkstyle 0m 18s the patch passed
+1 💚 mvnsite 0m 28s the patch passed
+1 💚 javadoc 0m 14s the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 22s the patch passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 3s the patch passed
+1 💚 shadedclient 23m 28s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 2m 20s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 33s The patch does not generate ASF License warnings.
99m 20s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5689/1/artifact/out/Dockerfile
GITHUB PR #5689
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux 967319691b4d 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 8fd2f4f
Default Java Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5689/1/testReport/
Max. process+thread count 577 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5689/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@ayushtkn
Copy link
Member

Just Passing by
AFAIK
Changing configurations default is an incompatible change and can be done only for minor release, so you can do it only for 3.4.0
Hadoop-defined properties (names and meanings) SHALL be considered [Public](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html#Public) and [Stable](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html#Stable). The units implied by a Hadoop-defined property MUST NOT change, even across major versions. Default values of Hadoop-defined properties SHALL be considered [Public](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html#Public) and [Evolving](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html#Evolving).

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html#Hadoop_Configuration_Files

@steveloughran
Copy link
Contributor Author

@ayushtkn really? default values are immutable. not sure about that as a lot of things change implicitly, or for good reason "the defaults weren't good".

while the names+meanings+units are all stable, default values are public/evolving

@ayushtkn
Copy link
Member

yep, it is like that, lot of discussions and tickets around this, example: HDFS-13505, this is also marked as incompatible and was pushed only to trunk. This comment also says the same thing (https://issues.apache.org/jira/browse/HDFS-13505?focusedCommentId=16854777&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16854777)

may be the reason is like if someone had a use case like he explicitly wanted the conf to be "delete" for whatever reasons and the default value was also "delete", he didn't configure it considering the default value, now if you change it to "keep", that guy who explicitly wanted the value to be delete, he has to change and have to configure it to "delete" to preserve his old behaviour.

Not against this change, just telling the generic stuff around config defaults, what I have read or know about the compat :-)

@steveloughran
Copy link
Contributor Author

noted. well, let's target 3.4 at the very least and tag as incompatible

Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, if the tests are green

@steveloughran
Copy link
Contributor Author

Thanks. I think the tests were all good but will rerun to be 100% sure.

Comment on lines +33 to +39
| Branch | Compatible Since | Supported | Released |
|------------|------------------|-----------|----------|
| Hadoop 2.x | 2.10.2 | Read-only | 05/2022 |
| Hadoop 3.0 | n/a | WONTFIX | |
| Hadoop 3.1 | n/a | WONTFIX | |
| Hadoop 3.2 | 3.2.2 | Read-only | 01/2022 |
| Hadoop 3.3 | 3.3.1 | Done | 01/2021 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this with the extra info.

Do we know why the Hadoop webpages aren't formatting the original table? https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/directory_markers.html#The_Problem_with_Directory_Markers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, there was some format error already fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, that's great then

Comment on lines 17 to 19
This document discusses an performance feature of the S3A
connector: directory markers are not deleted unless the
client is explicitly configured to do so.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this PR gets updated, small, typo to fix

Suggested change
This document discusses an performance feature of the S3A
connector: directory markers are not deleted unless the
client is explicitly configured to do so.
This document discusses a performance feature of the S3A
connector: directory markers are not deleted unless the
client is explicitly configured to do so.

Comment on lines +179 to 180
And, historically, when a path is listed, if a marker to that path is found, *it
has been interpreted as an empty directory.*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This isn't added in this PR but...) is this really true?

I tried an integ test using listFiles on the Hadoop 3.0 code base. It seemed happy. Is it worth being specific with what will or won't make this assumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is for some specific codepaths which do a probe which explicitly looks for empty dirs, rm and mv in particular.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

For public documentation, probably makes sense to keep it as is. It was just a little confusing when digging into the detail.

markers are managed when new files are created

*Default* `delete`: a request is issued to delete any parental directory markers
1.`delete`: a request is issued to delete any parental directory markers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

markdown won't like this

Suggested change
1.`delete`: a request is issued to delete any parental directory markers
1. `delete`: a request is issued to delete any parental directory markers

@dannycjones
Copy link
Contributor

Sorry for jumping in at the last minute with these.

Basically want to try and make sure users are able to reason as easily as possible about when its safe to flip over from delete to keep.

@steveloughran
Copy link
Contributor Author

safe when you don't have incompatible clients trying to write to the same directory tree, more specifically using rename and delete, which effectively means "hive, mapreduce, spark jobs etc". the s3a committer doesn't use rename to commit work, but when spark starts a job it will delete the dest directory -so if delete() mistakes a marker at the root for an empty dir, it may not delete the old data.

Reran CLI examples and updated the output to be current.

Change-Id: I203a24502a7c3b855e39d8a99ccd07949aa410c0
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 4s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 42m 5s trunk passed
+1 💚 compile 0m 37s trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 0m 30s trunk passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 0m 31s trunk passed
+1 💚 mvnsite 0m 37s trunk passed
+1 💚 javadoc 0m 28s trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 29s trunk passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 14s trunk passed
+1 💚 shadedclient 23m 40s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 27s the patch passed
+1 💚 compile 0m 31s the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 0m 31s the patch passed
+1 💚 compile 0m 24s the patch passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 javac 0m 24s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 19s the patch passed
+1 💚 mvnsite 0m 29s the patch passed
+1 💚 javadoc 0m 13s the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 22s the patch passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 4s the patch passed
+1 💚 shadedclient 23m 10s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 2m 20s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 34s The patch does not generate ASF License warnings.
103m 13s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5689/2/artifact/out/Dockerfile
GITHUB PR #5689
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux f4f59a935c3f 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 1b33b71
Default Java Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5689/2/testReport/
Max. process+thread count 588 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5689/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

@dannycjones: you happy with the changes now?
I've got Ayush's upvote already

Copy link
Contributor

@dannycjones dannycjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, reviewed just now. LGTM!

@steveloughran steveloughran merged commit 7bb09f1 into apache:trunk Jun 8, 2023
steveloughran added a commit to steveloughran/hadoop that referenced this pull request Jul 19, 2023
…rything but the switch

This change has all of PR apache#5689 *except* for changing the
default value of marker retention from keep to delete.

1. leaves the default value of fs.s3a.directory.marker.retention
   at "delete"
2. no longer prints a message when an S3A FS instances is
   instantiated with any option other than delete.
3. Updates the directory marker documentation

Switching to marker retention improves performance
on any S3 bucket as there are no needless marker DELETE requests
-leading to a reduction in write IOPS and and any delays waiting
for the DELETE call to finish.

There are *very* significant improvements on versioned buckets,
where tombstone markers slow down LIST operations: the more
tombstones there are, the worse query planning gets.

Having versioning enabled on production stores is the foundation
of any data protection strategy, so this has tangible benefits
in production.

Marker deletion is *not* compatible with older hadoop releases;
specifically
- Hadoop branch 2 < 2.10.2
- Any release of Hadoop 3.0.x and Hadoop 3.1.x
- Hadoop 3.2.0 and 3.2.1
- Hadoop 3.3.0
Incompatible releases have no problems reading data in stores
where markers are retained, but can get confused when deleting
or renaming directories.

Contributed by Steve Loughran

Change-Id: Ic9a05357a4b1b1ff6dfecf8b0f30e1eeedb2fe75
@steveloughran
Copy link
Contributor Author

Update: I have a pr of this for branch-3.3 which does everything but changing the default/documenting this change

#5859

this is so those releases stop warning about incompatibility and to keep the code more in sync between branches.

steveloughran added a commit to steveloughran/hadoop that referenced this pull request Jul 19, 2023
…rything but the switch

This change has all of PR apache#5689 *except* for changing the
default value of marker retention from keep to delete.

1. leaves the default value of fs.s3a.directory.marker.retention
   at "delete"
2. no longer prints a message when an S3A FS instances is
   instantiated with any option other than delete.
3. Updates the directory marker documentation

Switching to marker retention improves performance
on any S3 bucket as there are no needless marker DELETE requests
-leading to a reduction in write IOPS and and any delays waiting
for the DELETE call to finish.

There are *very* significant improvements on versioned buckets,
where tombstone markers slow down LIST operations: the more
tombstones there are, the worse query planning gets.

Having versioning enabled on production stores is the foundation
of any data protection strategy, so this has tangible benefits
in production.

Marker deletion is *not* compatible with older hadoop releases;
specifically
- Hadoop branch 2 < 2.10.2
- Any release of Hadoop 3.0.x and Hadoop 3.1.x
- Hadoop 3.2.0 and 3.2.1
- Hadoop 3.3.0
Incompatible releases have no problems reading data in stores
where markers are retained, but can get confused when deleting
or renaming directories.

Contributed by Steve Loughran

Change-Id: Ic9a05357a4b1b1ff6dfecf8b0f30e1eeedb2fe75
steveloughran added a commit to steveloughran/hadoop that referenced this pull request Jul 20, 2023
…rything but the switch

This change has all of PR apache#5689 *except* for changing the
default value of marker retention from keep to delete.

1. leaves the default value of fs.s3a.directory.marker.retention
   at "delete"
2. no longer prints a message when an S3A FS instances is
   instantiated with any option other than delete.
3. Updates the directory marker documentation

Switching to marker retention improves performance
on any S3 bucket as there are no needless marker DELETE requests
-leading to a reduction in write IOPS and and any delays waiting
for the DELETE call to finish.

There are *very* significant improvements on versioned buckets,
where tombstone markers slow down LIST operations: the more
tombstones there are, the worse query planning gets.

Having versioning enabled on production stores is the foundation
of any data protection strategy, so this has tangible benefits
in production.

Marker deletion is *not* compatible with older hadoop releases;
specifically
- Hadoop branch 2 < 2.10.2
- Any release of Hadoop 3.0.x and Hadoop 3.1.x
- Hadoop 3.2.0 and 3.2.1
- Hadoop 3.3.0
Incompatible releases have no problems reading data in stores
where markers are retained, but can get confused when deleting
or renaming directories.

Contributed by Steve Loughran

Change-Id: Ic9a05357a4b1b1ff6dfecf8b0f30e1eeedb2fe75
steveloughran added a commit that referenced this pull request Jul 20, 2023
…rything but the switch (#5859)



This change has all of PR #5689 *except* for changing the
default value of marker retention from keep to delete.

1. leaves the default value of fs.s3a.directory.marker.retention
   at "delete"
2. no longer prints a message when an S3A FS instances is
   instantiated with any option other than delete.
3. Updates the directory marker documentation

Switching to marker retention improves performance
on any S3 bucket as there are no needless marker DELETE requests
-leading to a reduction in write IOPS and and any delays waiting
for the DELETE call to finish.

There are *very* significant improvements on versioned buckets,
where tombstone markers slow down LIST operations: the more
tombstones there are, the worse query planning gets.

Having versioning enabled on production stores is the foundation
of any data protection strategy, so this has tangible benefits
in production.

Marker deletion is *not* compatible with older hadoop releases;
specifically
- Hadoop branch 2 < 2.10.2
- Any release of Hadoop 3.0.x and Hadoop 3.1.x
- Hadoop 3.2.0 and 3.2.1
- Hadoop 3.3.0
Incompatible releases have no problems reading data in stores
where markers are retained, but can get confused when deleting
or renaming directories.

Contributed by Steve Loughran
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants