-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-19543. [ABFS][FnsOverBlob] Remove Duplicates from Blob Endpoint Listing Across Iterations #7614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return continuation; | ||
| } | ||
|
|
||
| private void filterDuplicateEntriesForBlobClient( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add javadocs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
| } | ||
|
|
||
| @Test | ||
| public void testDuplicateEntriesAcrossListBlobIterations() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add one more test where there are more than one files in the directory and max list result is 1 and multiple list status calls and verify the overall file statuses don't have any duplicate entries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplicate can only happen in case of non-empty explicit directories. The store won't allow me to create duplicate paths otherwise.
What I can do is modify this test to have more than one entry. A mix of all types of entries and make sure we still don't have any duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified test as above
This comment was marked as outdated.
This comment was marked as outdated.
| // In both cases, we will add this. | ||
| nameToEntryMap.put(entryName, fileStatus); | ||
| fileStatuses.add(fileStatus); | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be replaced with else if instead of nested if.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken
| * @return the list of FileStatus objects | ||
| */ | ||
| public List<FileStatus> getFileStatusList() { | ||
| public List<VersionedFileStatus> getFileStatusList() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc update is needed.
Returns list of VersionedFileStatus objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken
| * @param fileStatusList the list of FileStatus objects | ||
| */ | ||
| public void setFileStatusList(final List<FileStatus> fileStatusList) { | ||
| public void setFileStatusList(final List<VersionedFileStatus> fileStatusList) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, javadoc update is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
|
||
| import org.apache.hadoop.fs.FileStatus; | ||
|
|
||
| public class ListUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add javadocs for class and its functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
|
|
||
| public class ListUtils { | ||
|
|
||
| public static List<FileStatus> getUniqueListResult(List<FileStatus> originalList) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic looks a bit complex to read, can be refactored. One possible solution could be
public static List getUniqueListResult(List originalList) {
if (originalList == null || originalList.isEmpty()) {
return originalList;
}
List uniqueList = new ArrayList<>();
TreeMap<String, FileStatus> currentGroupMap = new TreeMap<>();
String currentPrefix = null;
for (FileStatus fileStatus : originalList) {
String fileName = fileStatus.getPath().getName();
if (currentPrefix == null || !fileName.startsWith(currentPrefix)) {
// Start of a new group
currentPrefix = fileName;
currentGroupMap.clear();
}
if (!currentGroupMap.containsKey(fileName)) {
currentGroupMap.put(fileName, fileStatus);
uniqueList.add(fileStatus);
}
}
return uniqueList;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice Suggestion.
Taken
| List<VersionedFileStatus> fileStatusListInCurrItr = listResponseData.getFileStatusList(); | ||
| if (fileStatusListInCurrItr != null && !fileStatusListInCurrItr.isEmpty()) { | ||
| fileStatuses.addAll(fileStatusListInCurrItr); | ||
| fileStatusList.addAll(fileStatusListInCurrItr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we are iterating over a list of VersionedFileStatus and maintaing a list of FileStatus, shouldn't the new list also be of VersionedFileStatus elements ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elements are always of type VersionedFileStatus only. Just the reference type is getting changed because store.listStatus() works with FileStatus only. The final list to be returned has to be of type FileStatus. Although the objects hold by it are always VersionedFileStatus and can be type casted wherever needed.
| originalList.add(getFileStatusObject(new Path("/abc"))); | ||
| originalList.add(getFileStatusObject(new Path("/abc.bak1"))); | ||
| originalList.add(getFileStatusObject(new Path("/abc"))); | ||
| validateList( originalList, 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra spaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken
This comment was marked as outdated.
This comment was marked as outdated.
| assertExplicitDirectoryFileStatus(fileStatuses[6], fs.makeQualified(new Path("/e"))); | ||
|
|
||
| // Assert that there are no duplicates in the returned file statuses. | ||
| for (int i = 0; i < fileStatuses.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we use a set here instead of a double loop?
something like-
Set uniquePaths = new HashSet<>();
for (FileStatus fileStatus : fileStatuses) {
Assertions.assertThat(uniquePaths.add(fileStatus.getPath()))
.describedAs("Duplicate entries found")
.isTrue();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion.
Taken
| public class TestListUtils { | ||
|
|
||
| @Test | ||
| public void testRemoveDuplicates() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing java doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
| validateList(originalList, 2); | ||
| } | ||
|
|
||
| private void validateList(List<FileStatus> originalList, int expectedSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java doc missing for this and below method as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
LGTM ! |
bhattmanish98
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM
This comment was marked as outdated.
This comment was marked as outdated.
|
🎊 +1 overall
This message was automatically generated. |
============================================================
|
…t Listing Across Iterations (apache#7614) Contributed by Anuj Modi Reviewed by Anmol Asrani, Manish Bhatt, Manika Joshi Signed off by Anuj Modi<anujmodi@apache.org>
Description of PR
JIRA: https://issues.apache.org/jira/browse/HADOOP-19543
On FNS-Blob, the List Blobs API is known to return duplicate entries for non-empty explicit directories. One entry corresponds to the directory itself, and another corresponds to the marker blob that the driver internally creates and maintains to mark that path as a directory. We already know about this behaviour, and it was handled to remove such duplicate entries from the set of entries that were returned as part of current list iterations.
Due to a possible partition split, if such duplicate entries happen to be returned in separate iterations, there is no handling on this, and the caller might get back the result with duplicate entries, as happened in this case. The logic to remove duplicates was designed before the realization of the partition split.
This PR fixes this bug
How was this patch tested?
A new test for the failing scenario was added and existing test suite was ran to validate changes across all combinations.