-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-17428. ABFS: Implementation for getContentSummary #2549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
TEST RESULTS HNS Account Location: East US 2
|
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
...doop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ContentSummaryProcessor.java
Outdated
Show resolved
Hide resolved
...doop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ContentSummaryProcessor.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystem.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystem.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystem.java
Outdated
Show resolved
Hide resolved
...doop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ContentSummaryProcessor.java
Outdated
Show resolved
Hide resolved
...doop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ContentSummaryProcessor.java
Outdated
Show resolved
Hide resolved
...doop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ContentSummaryProcessor.java
Outdated
Show resolved
Hide resolved
...op-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/TestGetContentSummary.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the comments
...doop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ContentSummaryProcessor.java
Outdated
Show resolved
Hide resolved
...doop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ContentSummaryProcessor.java
Outdated
Show resolved
Hide resolved
...doop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ContentSummaryProcessor.java
Outdated
Show resolved
Hide resolved
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is potentially going to speed up hive unmanaged tables because they call getContentSummary() somethings (which is really wasteful, obviously).
I like the design and how are you have pulled the processing out of the main store class for ease of maintenance/test.
Looking at the code, we need a shared thread pool for the store; other patches are pushing that need as well, for example openFile().withStatus
One thing I'd like is for the store/filesystem IOStatistics to collect some stats on use and duration of getContent calls, so we can determine how often this is being used and how much of a delay it is causing if the application is blocking awaiting the answer. Like I say: hive need to get rid of this –and once they become more aware of the costs they may consider it.
tracingContext); | ||
} catch (InterruptedException e) { | ||
LOG.debug("Thread interrupted"); | ||
throw new IOException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InterruptedIOException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
throw new IOException(e); | ||
} catch(ExecutionException ex) { | ||
LOG.debug("GetContentSummary failed with error: {}", ex.getMessage()); | ||
throw new IOException(ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer PathIOException with path included
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
import org.apache.hadoop.fs.FileStatus; | ||
import org.apache.hadoop.fs.Path; | ||
|
||
public class ContentSummaryProcessor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: javadocs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added description
private final ExecutorService executorService = new ThreadPoolExecutor( | ||
CORE_POOL_SIZE, MAX_THREAD_COUNT, KEEP_ALIVE_TIME, TimeUnit.SECONDS, | ||
new SynchronousQueue<>()); | ||
private final CompletionService<Void> completionService = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abfs Store to create the executor so that all ops which do async IO (this, block uploads, prefetch, openFile async...) can share a single pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Changes: abfsStore creates thread pool executor; each individual getContentSummary call creates a separate completionService instance (which uses the common thread pool)
throws IOException, InterruptedException { | ||
FileStatus[] fileStatuses = abfsStore.listStatus(path, tracingContext); | ||
|
||
for (FileStatus fileStatus : fileStatuses) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you supported paged results, you could start queuing subdir work while still iterating through the list.
In directories with many pages of results including directories, this could speed up processing as subdirectory scanning could start as soon as of the first page of results had been retrieved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we should be able to incorporate that using the listiterator. Thanks, will make the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to confirm the advantage of processing page-wise listStatus results; would like to know your opinion. Analyzed time taken by direct liststatus call vs using listiterator (queueing subdir while iterating), but getting ambiguous results.
The tests used involved creating a directory tree and calling GetContentSummary on the top folder, as the primary use of this api might be on the root of an account.
Expt 1: Directory tree with 12 levels (tree height=12), where each level comprises one dir and 1-2 files.
Expt 2: Same 12-level structure as 1, with a branch (of 2 subdir levels) around the mid-level, i.e., two subdirs at level 5, each having a subdir. All directories in the tree have ~15 files
Expt 3: Same as expt 2, but with each dir having more than 5000 files (will result in liststatus results being fetched in multiple pages)
The analysis was done for both lexicographical positions of directory with respect to files at the same level, as it determines whether the directory is fetched first. The time taken was calculated as the time between the first ListStatus REST call and the DeleteFileSystem call (post the last LS) => this will eliminate differences in file/dir creation time.
Expt number Dir after files Dir before files
1 LS (few ms) LS
2 LS (0.5s) Itr (8.7s)
3 LS (3s) Itr (4.5s)
LS(t) -> Normal direct ListStatus call was faster by t
Itr(t) -> ListIterator was faster by t
Using iterator seems beneficial for some scenarios, will go ahead with it.
import static org.apache.hadoop.fs.azurebfs.constants.FileSystemConfigurations.DEFAULT_AZURE_LIST_MAX_RESULTS; | ||
import static org.apache.hadoop.test.LambdaTestUtils.intercept; | ||
|
||
public class TestGetContentSummary extends AbstractAbfsIntegrationTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this runs against a real store, name needs to begin as ITest
to make this a unit test, look at how ITestAbfsListStatusRemoteIterator returns a mock listing API.
you could simply generate a deep/wide tree of results for listings to find, either by creating a tree of entries to simulate a directory tree; or programmatically creating them (e.g declaring that each dir has 10 subdirs and 100 files & so returning them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed, will send requests to store
AzureBlobFileSystem fs = getFileSystem(); | ||
fs.mkdirs(new Path("/testFolder")); | ||
Path filePath = new Path("/testFolder/testFile"); | ||
fs.create(filePath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ContentTestUtils has helper methods here
fs.create()
returns a stream you can write to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
final List<Future<Void>> tasks = new ArrayList<>(); | ||
ExecutorService es = Executors.newFixedThreadPool(10); | ||
for (int i = 0; i < numFiles; i++) { | ||
final Path fileName = new Path(directory + "/test" + i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to
new Path(directory, String.format("test-%04d", i));.
ensures ordering of paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
fa72f2d
to
16d9436
Compare
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
afraid my manifest committer changes have broken this. can you rebase and we can get this in. |
Adding implementation for the HDFS method getContentSummary, which takes in a Path argument and returns details such as file/directory count and space utilized under that path.
Tests added to check information in the returned ContentSummary instance for files and directories at different levels in a filesystem.