-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-18752. Change fs.s3a.directory.marker.retention to "keep" #5689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-18752. Change fs.s3a.directory.marker.retention to "keep" #5689
Conversation
* changes the default value * doesn't log it at info * updated docs * cut all marker tool commands from the SDK qualification commands * Change-Id: I093a452193e404a7227fb79329b4b9b63da2ea99
|
💔 -1 overall
This message was automatically generated. |
|
Just Passing by |
|
@ayushtkn really? default values are immutable. not sure about that as a lot of things change implicitly, or for good reason "the defaults weren't good". while the names+meanings+units are all stable, default values are public/evolving |
|
yep, it is like that, lot of discussions and tickets around this, example: HDFS-13505, this is also marked as incompatible and was pushed only to trunk. This comment also says the same thing (https://issues.apache.org/jira/browse/HDFS-13505?focusedCommentId=16854777&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16854777) may be the reason is like if someone had a use case like he explicitly wanted the conf to be "delete" for whatever reasons and the default value was also "delete", he didn't configure it considering the default value, now if you change it to "keep", that guy who explicitly wanted the value to be delete, he has to change and have to configure it to "delete" to preserve his old behaviour. Not against this change, just telling the generic stuff around config defaults, what I have read or know about the compat :-) |
|
noted. well, let's target 3.4 at the very least and tag as incompatible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, if the tests are green
|
Thanks. I think the tests were all good but will rerun to be 100% sure. |
| | Branch | Compatible Since | Supported | Released | | ||
| |------------|------------------|-----------|----------| | ||
| | Hadoop 2.x | 2.10.2 | Read-only | 05/2022 | | ||
| | Hadoop 3.0 | n/a | WONTFIX | | | ||
| | Hadoop 3.1 | n/a | WONTFIX | | | ||
| | Hadoop 3.2 | 3.2.2 | Read-only | 01/2022 | | ||
| | Hadoop 3.3 | 3.3.1 | Done | 01/2021 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this with the extra info.
Do we know why the Hadoop webpages aren't formatting the original table? https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/directory_markers.html#The_Problem_with_Directory_Markers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, there was some format error already fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, that's great then
| This document discusses an performance feature of the S3A | ||
| connector: directory markers are not deleted unless the | ||
| client is explicitly configured to do so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this PR gets updated, small, typo to fix
| This document discusses an performance feature of the S3A | |
| connector: directory markers are not deleted unless the | |
| client is explicitly configured to do so. | |
| This document discusses a performance feature of the S3A | |
| connector: directory markers are not deleted unless the | |
| client is explicitly configured to do so. |
| And, historically, when a path is listed, if a marker to that path is found, *it | ||
| has been interpreted as an empty directory.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This isn't added in this PR but...) is this really true?
I tried an integ test using listFiles on the Hadoop 3.0 code base. It seemed happy. Is it worth being specific with what will or won't make this assumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is for some specific codepaths which do a probe which explicitly looks for empty dirs, rm and mv in particular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK.
For public documentation, probably makes sense to keep it as is. It was just a little confusing when digging into the detail.
| markers are managed when new files are created | ||
|
|
||
| *Default* `delete`: a request is issued to delete any parental directory markers | ||
| 1.`delete`: a request is issued to delete any parental directory markers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
markdown won't like this
| 1.`delete`: a request is issued to delete any parental directory markers | |
| 1. `delete`: a request is issued to delete any parental directory markers |
|
Sorry for jumping in at the last minute with these. Basically want to try and make sure users are able to reason as easily as possible about when its safe to flip over from |
|
safe when you don't have incompatible clients trying to write to the same directory tree, more specifically using rename and delete, which effectively means "hive, mapreduce, spark jobs etc". the s3a committer doesn't use rename to commit work, but when spark starts a job it will delete the dest directory -so if delete() mistakes a marker at the root for an empty dir, it may not delete the old data. |
Reran CLI examples and updated the output to be current. Change-Id: I203a24502a7c3b855e39d8a99ccd07949aa410c0
|
💔 -1 overall
This message was automatically generated. |
|
@dannycjones: you happy with the changes now? |
dannycjones
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, reviewed just now. LGTM!
…rything but the switch This change has all of PR apache#5689 *except* for changing the default value of marker retention from keep to delete. 1. leaves the default value of fs.s3a.directory.marker.retention at "delete" 2. no longer prints a message when an S3A FS instances is instantiated with any option other than delete. 3. Updates the directory marker documentation Switching to marker retention improves performance on any S3 bucket as there are no needless marker DELETE requests -leading to a reduction in write IOPS and and any delays waiting for the DELETE call to finish. There are *very* significant improvements on versioned buckets, where tombstone markers slow down LIST operations: the more tombstones there are, the worse query planning gets. Having versioning enabled on production stores is the foundation of any data protection strategy, so this has tangible benefits in production. Marker deletion is *not* compatible with older hadoop releases; specifically - Hadoop branch 2 < 2.10.2 - Any release of Hadoop 3.0.x and Hadoop 3.1.x - Hadoop 3.2.0 and 3.2.1 - Hadoop 3.3.0 Incompatible releases have no problems reading data in stores where markers are retained, but can get confused when deleting or renaming directories. Contributed by Steve Loughran Change-Id: Ic9a05357a4b1b1ff6dfecf8b0f30e1eeedb2fe75
|
Update: I have a pr of this for branch-3.3 which does everything but changing the default/documenting this change this is so those releases stop warning about incompatibility and to keep the code more in sync between branches. |
…rything but the switch This change has all of PR apache#5689 *except* for changing the default value of marker retention from keep to delete. 1. leaves the default value of fs.s3a.directory.marker.retention at "delete" 2. no longer prints a message when an S3A FS instances is instantiated with any option other than delete. 3. Updates the directory marker documentation Switching to marker retention improves performance on any S3 bucket as there are no needless marker DELETE requests -leading to a reduction in write IOPS and and any delays waiting for the DELETE call to finish. There are *very* significant improvements on versioned buckets, where tombstone markers slow down LIST operations: the more tombstones there are, the worse query planning gets. Having versioning enabled on production stores is the foundation of any data protection strategy, so this has tangible benefits in production. Marker deletion is *not* compatible with older hadoop releases; specifically - Hadoop branch 2 < 2.10.2 - Any release of Hadoop 3.0.x and Hadoop 3.1.x - Hadoop 3.2.0 and 3.2.1 - Hadoop 3.3.0 Incompatible releases have no problems reading data in stores where markers are retained, but can get confused when deleting or renaming directories. Contributed by Steve Loughran Change-Id: Ic9a05357a4b1b1ff6dfecf8b0f30e1eeedb2fe75
…rything but the switch This change has all of PR apache#5689 *except* for changing the default value of marker retention from keep to delete. 1. leaves the default value of fs.s3a.directory.marker.retention at "delete" 2. no longer prints a message when an S3A FS instances is instantiated with any option other than delete. 3. Updates the directory marker documentation Switching to marker retention improves performance on any S3 bucket as there are no needless marker DELETE requests -leading to a reduction in write IOPS and and any delays waiting for the DELETE call to finish. There are *very* significant improvements on versioned buckets, where tombstone markers slow down LIST operations: the more tombstones there are, the worse query planning gets. Having versioning enabled on production stores is the foundation of any data protection strategy, so this has tangible benefits in production. Marker deletion is *not* compatible with older hadoop releases; specifically - Hadoop branch 2 < 2.10.2 - Any release of Hadoop 3.0.x and Hadoop 3.1.x - Hadoop 3.2.0 and 3.2.1 - Hadoop 3.3.0 Incompatible releases have no problems reading data in stores where markers are retained, but can get confused when deleting or renaming directories. Contributed by Steve Loughran Change-Id: Ic9a05357a4b1b1ff6dfecf8b0f30e1eeedb2fe75
…rything but the switch (#5859) This change has all of PR #5689 *except* for changing the default value of marker retention from keep to delete. 1. leaves the default value of fs.s3a.directory.marker.retention at "delete" 2. no longer prints a message when an S3A FS instances is instantiated with any option other than delete. 3. Updates the directory marker documentation Switching to marker retention improves performance on any S3 bucket as there are no needless marker DELETE requests -leading to a reduction in write IOPS and and any delays waiting for the DELETE call to finish. There are *very* significant improvements on versioned buckets, where tombstone markers slow down LIST operations: the more tombstones there are, the worse query planning gets. Having versioning enabled on production stores is the foundation of any data protection strategy, so this has tangible benefits in production. Marker deletion is *not* compatible with older hadoop releases; specifically - Hadoop branch 2 < 2.10.2 - Any release of Hadoop 3.0.x and Hadoop 3.1.x - Hadoop 3.2.0 and 3.2.1 - Hadoop 3.3.0 Incompatible releases have no problems reading data in stores where markers are retained, but can get confused when deleting or renaming directories. Contributed by Steve Loughran
How was this patch tested?
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?