Skip to content

Conversation

@anujmodi2021
Copy link
Contributor

@anujmodi2021 anujmodi2021 commented Apr 15, 2025

Description of PR

JIRA: https://issues.apache.org/jira/browse/HADOOP-19543
On FNS-Blob, the List Blobs API is known to return duplicate entries for non-empty explicit directories. One entry corresponds to the directory itself, and another corresponds to the marker blob that the driver internally creates and maintains to mark that path as a directory. We already know about this behaviour, and it was handled to remove such duplicate entries from the set of entries that were returned as part of current list iterations.

Due to a possible partition split, if such duplicate entries happen to be returned in separate iterations, there is no handling on this, and the caller might get back the result with duplicate entries, as happened in this case. The logic to remove duplicates was designed before the realization of the partition split.

This PR fixes this bug

How was this patch tested?

A new test for the failing scenario was added and existing test suite was ran to validate changes across all combinations.

@anujmodi2021 anujmodi2021 changed the title [WIP] List Duplicate Bug Fix HADOOP-19543. [ABFS][FnsOverBlob] Remove Duplicates from Blob Endpoint Listing Across Iterations Apr 15, 2025
return continuation;
}

private void filterDuplicateEntriesForBlobClient(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add javadocs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

}

@Test
public void testDuplicateEntriesAcrossListBlobIterations() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add one more test where there are more than one files in the directory and max list result is 1 and multiple list status calls and verify the overall file statuses don't have any duplicate entries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplicate can only happen in case of non-empty explicit directories. The store won't allow me to create duplicate paths otherwise.
What I can do is modify this test to have more than one entry. A mix of all types of entries and make sure we still don't have any duplicates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified test as above

@hadoop-yetus

This comment was marked as outdated.

// In both cases, we will add this.
nameToEntryMap.put(entryName, fileStatus);
fileStatuses.add(fileStatus);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be replaced with else if instead of nested if.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken

* @return the list of FileStatus objects
*/
public List<FileStatus> getFileStatusList() {
public List<VersionedFileStatus> getFileStatusList() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc update is needed.
Returns list of VersionedFileStatus objects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken

* @param fileStatusList the list of FileStatus objects
*/
public void setFileStatusList(final List<FileStatus> fileStatusList) {
public void setFileStatusList(final List<VersionedFileStatus> fileStatusList) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, javadoc update is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken

@hadoop-yetus

This comment was marked as outdated.

@anujmodi2021 anujmodi2021 marked this pull request as ready for review April 15, 2025 14:42
@hadoop-yetus

This comment was marked as outdated.

@anujmodi2021 anujmodi2021 marked this pull request as draft April 16, 2025 06:48
@anujmodi2021 anujmodi2021 changed the title HADOOP-19543. [ABFS][FnsOverBlob] Remove Duplicates from Blob Endpoint Listing Across Iterations WIP: HADOOP-19543. [ABFS][FnsOverBlob] Remove Duplicates from Blob Endpoint Listing Across Iterations Apr 16, 2025
@anujmodi2021 anujmodi2021 changed the title WIP: HADOOP-19543. [ABFS][FnsOverBlob] Remove Duplicates from Blob Endpoint Listing Across Iterations HADOOP-19543. WIP: [ABFS][FnsOverBlob] Remove Duplicates from Blob Endpoint Listing Across Iterations Apr 16, 2025
@hadoop-yetus

This comment was marked as outdated.

@anujmodi2021

This comment was marked as outdated.

@anujmodi2021 anujmodi2021 changed the title HADOOP-19543. WIP: [ABFS][FnsOverBlob] Remove Duplicates from Blob Endpoint Listing Across Iterations HADOOP-19543. [ABFS][FnsOverBlob] Remove Duplicates from Blob Endpoint Listing Across Iterations Apr 17, 2025
@anujmodi2021 anujmodi2021 marked this pull request as ready for review April 17, 2025 05:49

import org.apache.hadoop.fs.FileStatus;

public class ListUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add javadocs for class and its functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


public class ListUtils {

public static List<FileStatus> getUniqueListResult(List<FileStatus> originalList) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks a bit complex to read, can be refactored. One possible solution could be

public static List getUniqueListResult(List originalList) {
if (originalList == null || originalList.isEmpty()) {
return originalList;
}

List uniqueList = new ArrayList<>();
TreeMap<String, FileStatus> currentGroupMap = new TreeMap<>();

String currentPrefix = null;

for (FileStatus fileStatus : originalList) {
String fileName = fileStatus.getPath().getName();

if (currentPrefix == null || !fileName.startsWith(currentPrefix)) {
  // Start of a new group
  currentPrefix = fileName;
  currentGroupMap.clear();
}

if (!currentGroupMap.containsKey(fileName)) {
  currentGroupMap.put(fileName, fileStatus);
  uniqueList.add(fileStatus);
}

}

return uniqueList;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice Suggestion.
Taken

List<VersionedFileStatus> fileStatusListInCurrItr = listResponseData.getFileStatusList();
if (fileStatusListInCurrItr != null && !fileStatusListInCurrItr.isEmpty()) {
fileStatuses.addAll(fileStatusListInCurrItr);
fileStatusList.addAll(fileStatusListInCurrItr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we are iterating over a list of VersionedFileStatus and maintaing a list of FileStatus, shouldn't the new list also be of VersionedFileStatus elements ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elements are always of type VersionedFileStatus only. Just the reference type is getting changed because store.listStatus() works with FileStatus only. The final list to be returned has to be of type FileStatus. Although the objects hold by it are always VersionedFileStatus and can be type casted wherever needed.

originalList.add(getFileStatusObject(new Path("/abc")));
originalList.add(getFileStatusObject(new Path("/abc.bak1")));
originalList.add(getFileStatusObject(new Path("/abc")));
validateList( originalList, 5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra spaces

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken

@hadoop-yetus

This comment was marked as outdated.

assertExplicitDirectoryFileStatus(fileStatuses[6], fs.makeQualified(new Path("/e")));

// Assert that there are no duplicates in the returned file statuses.
for (int i = 0; i < fileStatuses.length; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use a set here instead of a double loop?
something like-
Set uniquePaths = new HashSet<>();
for (FileStatus fileStatus : fileStatuses) {
Assertions.assertThat(uniquePaths.add(fileStatus.getPath()))
.describedAs("Duplicate entries found")
.isTrue();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion.
Taken

public class TestListUtils {

@Test
public void testRemoveDuplicates() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing java doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

validateList(originalList, 2);
}

private void validateList(List<FileStatus> originalList, int expectedSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java doc missing for this and below method as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@hadoop-yetus

This comment was marked as outdated.

@hadoop-yetus

This comment was marked as outdated.

@hadoop-yetus

This comment was marked as outdated.

@anmolanmol1234
Copy link
Contributor

LGTM !

Copy link
Contributor

@bhattmanish98 bhattmanish98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@hadoop-yetus

This comment was marked as outdated.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 48s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 2 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 41m 49s trunk passed
+1 💚 compile 0m 43s trunk passed with JDK Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04
+1 💚 compile 0m 36s trunk passed with JDK Private Build-1.8.0_442-8u442-b06us1-0ubuntu120.04-b06
+1 💚 checkstyle 0m 32s trunk passed
+1 💚 mvnsite 0m 42s trunk passed
+1 💚 javadoc 0m 40s trunk passed with JDK Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04
+1 💚 javadoc 0m 32s trunk passed with JDK Private Build-1.8.0_442-8u442-b06us1-0ubuntu120.04-b06
+1 💚 spotbugs 1m 9s trunk passed
+1 💚 shadedclient 39m 5s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 31s the patch passed
+1 💚 compile 0m 34s the patch passed with JDK Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04
+1 💚 javac 0m 34s the patch passed
+1 💚 compile 0m 29s the patch passed with JDK Private Build-1.8.0_442-8u442-b06us1-0ubuntu120.04-b06
+1 💚 javac 0m 29s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 20s the patch passed
+1 💚 mvnsite 0m 31s the patch passed
+1 💚 javadoc 0m 28s the patch passed with JDK Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04
+1 💚 javadoc 0m 25s the patch passed with JDK Private Build-1.8.0_442-8u442-b06us1-0ubuntu120.04-b06
+1 💚 spotbugs 1m 7s the patch passed
+1 💚 shadedclient 38m 22s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 3m 1s hadoop-azure in the patch passed.
+1 💚 asflicense 0m 35s The patch does not generate ASF License warnings.
134m 13s
Subsystem Report/Notes
Docker ClientAPI=1.49 ServerAPI=1.49 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7614/11/artifact/out/Dockerfile
GITHUB PR #7614
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux fcdab91dfebb 5.15.0-131-generic #141-Ubuntu SMP Fri Jan 10 21:18:28 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / d21f9f6
Default Java Private Build-1.8.0_442-8u442-b06us1-0ubuntu120.04-b06
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_442-8u442-b06us1-0ubuntu120.04-b06
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7614/11/testReport/
Max. process+thread count 524 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7614/11/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@anujmodi2021
Copy link
Contributor Author

============================================================
HNS-OAuth-DFS

[ERROR] org.apache.hadoop.fs.azurebfs.ITestAzureBlobFileSystemRename.testRenamePathRetryIdempotency Time elapsed: 1.173 s <<< ERROR!

[WARNING] Tests run: 175, Failures: 0, Errors: 0, Skipped: 3
[ERROR] Tests run: 798, Failures: 0, Errors: 1, Skipped: 154
[WARNING] Tests run: 157, Failures: 0, Errors: 0, Skipped: 7
[WARNING] Tests run: 272, Failures: 0, Errors: 0, Skipped: 23

============================================================
HNS-SharedKey-DFS

[ERROR] org.apache.hadoop.fs.azurebfs.ITestAzureBlobFileSystemRename.testRenamePathRetryIdempotency Time elapsed: 1.058 s <<< ERROR!

[WARNING] Tests run: 175, Failures: 0, Errors: 0, Skipped: 4
[ERROR] Tests run: 801, Failures: 0, Errors: 1, Skipped: 107
[WARNING] Tests run: 157, Failures: 0, Errors: 0, Skipped: 7
[WARNING] Tests run: 272, Failures: 0, Errors: 0, Skipped: 10

============================================================
NonHNS-SharedKey-DFS

[WARNING] Tests run: 175, Failures: 0, Errors: 0, Skipped: 10
[WARNING] Tests run: 640, Failures: 0, Errors: 0, Skipped: 212
[WARNING] Tests run: 157, Failures: 0, Errors: 0, Skipped: 8
[WARNING] Tests run: 272, Failures: 0, Errors: 0, Skipped: 11

============================================================
AppendBlob-HNS-OAuth-DFS

[ERROR] org.apache.hadoop.fs.azurebfs.ITestAzureBlobFileSystemRename.testRenamePathRetryIdempotency Time elapsed: 1.059 s <<< ERROR!

[WARNING] Tests run: 175, Failures: 0, Errors: 0, Skipped: 3
[ERROR] Tests run: 798, Failures: 0, Errors: 1, Skipped: 165
[WARNING] Tests run: 134, Failures: 0, Errors: 0, Skipped: 8
[WARNING] Tests run: 272, Failures: 0, Errors: 0, Skipped: 23

============================================================
NonHNS-SharedKey-Blob

[WARNING] Tests run: 175, Failures: 0, Errors: 0, Skipped: 10
[WARNING] Tests run: 643, Failures: 0, Errors: 0, Skipped: 141
[WARNING] Tests run: 157, Failures: 0, Errors: 0, Skipped: 3
[WARNING] Tests run: 272, Failures: 0, Errors: 0, Skipped: 11

============================================================
NonHNS-OAuth-DFS

[WARNING] Tests run: 175, Failures: 0, Errors: 0, Skipped: 10
[WARNING] Tests run: 637, Failures: 0, Errors: 0, Skipped: 214
[WARNING] Tests run: 157, Failures: 0, Errors: 0, Skipped: 8
[WARNING] Tests run: 272, Failures: 0, Errors: 0, Skipped: 24

============================================================
NonHNS-OAuth-Blob

[WARNING] Tests run: 175, Failures: 0, Errors: 0, Skipped: 10
[WARNING] Tests run: 640, Failures: 0, Errors: 0, Skipped: 143
[WARNING] Tests run: 157, Failures: 0, Errors: 0, Skipped: 3
[WARNING] Tests run: 272, Failures: 0, Errors: 0, Skipped: 24

============================================================
AppendBlob-NonHNS-OAuth-Blob

[WARNING] Tests run: 175, Failures: 0, Errors: 0, Skipped: 10
[WARNING] Tests run: 638, Failures: 0, Errors: 0, Skipped: 161
[WARNING] Tests run: 134, Failures: 0, Errors: 0, Skipped: 4
[WARNING] Tests run: 272, Failures: 0, Errors: 0, Skipped: 24

============================================================
HNS-Oauth-DFS-IngressBlob

[ERROR] org.apache.hadoop.fs.azurebfs.ITestAzureBlobFileSystemRename.testRenamePathRetryIdempotency Time elapsed: 1.864 s <<< ERROR!

[WARNING] Tests run: 175, Failures: 0, Errors: 0, Skipped: 3
[ERROR] Tests run: 672, Failures: 0, Errors: 1, Skipped: 161
[WARNING] Tests run: 157, Failures: 0, Errors: 0, Skipped: 7
[WARNING] Tests run: 272, Failures: 0, Errors: 0, Skipped: 23

============================================================
NonHNS-OAuth-DFS-IngressBlob

[WARNING] Tests run: 175, Failures: 0, Errors: 0, Skipped: 10
[WARNING] Tests run: 637, Failures: 0, Errors: 0, Skipped: 212
[WARNING] Tests run: 157, Failures: 0, Errors: 0, Skipped: 8
[WARNING] Tests run: 272, Failures: 0, Errors: 0, Skipped: 24

@anujmodi2021 anujmodi2021 merged commit 810c42f into apache:trunk Apr 18, 2025
4 checks passed
anujmodi2021 added a commit to ABFSDriver/AbfsHadoop that referenced this pull request Apr 18, 2025
…t Listing Across Iterations (apache#7614)

Contributed by Anuj Modi
Reviewed by Anmol Asrani, Manish Bhatt, Manika Joshi

Signed off by Anuj Modi<anujmodi@apache.org>
anujmodi2021 added a commit that referenced this pull request Apr 18, 2025
…t Listing Across Iterations (#7614) (#7632)

Contributed by Anuj Modi
Reviewed by Anmol Asrani, Manish Bhatt, Manika Joshi

Signed off by Anuj Modi<anujmodi@apache.org>
@anujmodi2021 anujmodi2021 deleted the listDuplicate branch April 29, 2025 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants