Skip to content

HADOOP-17428. ABFS: Implementation for getContentSummary #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 45 commits into
base: trunk
Choose a base branch
from

Conversation

sumangala-patki
Copy link
Collaborator

No description provided.

@sumangala-patki sumangala-patki changed the title ABFS: Implementation for getContentSummary HADOOP-17428. ABFS: Implementation for getContentSummary Dec 15, 2020
Copy link

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, misted this completely.

It turns out that hive still uses this call when looking at unmanaged tables. This is a fundamental issue which hive should fix.

In the meantime, parallel scanning can help. indeed, if you could send a summary from the server even better.

(One thing I'd like there is to see if hive can cope just with the number and size of files; this can help in the scan, depending on how its done)

For ABFS, parallel tree scan seems the best client-side approach.
For S3A we do something serialized but not parallelized.

If I could switch to only returning file count/size I could do deep scans.

In cloudstore I mix this: an initial treescan of a few levels, feeding in to a thread pool -under which a deep list(path, recursive=true) scan is kicked off.
Variable performance, as it depends on the tree structure to be most efficient.

Returning to this patch...

Maybe it should go into Hadoop-common under org.apache.hadoop.fs.impl for reuse. That complicates the code as reusability complicates things. For that reason, I'm going to say don't bother. Really, it's hive's job to fix their code.

listing should be done through an incremental call. on a very wide directory, this will allow subdir scans to be kicked off before the full dir listing has finished.

Duration tracking IOStatistics on use and duration of the call will help assess use and cost. There is already [a statistic name](https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/statistics/StoreStatisticNames.java#L67 for this).

the ABFS store needs its own thread pool. I think the ongoing work on stream read/write parallelisation optimisation can add this, if it's not there already.
A reference to the thread pool can be passed in, with some restrictions on what
fraction of it can be used for the scan.

private final AtomicLong totalBytes = new AtomicLong(0L);
private final AtomicInteger numTasks = new AtomicInteger(0);
private final ListingSupport abfsStore;
private final ExecutorService executorService = new ThreadPoolExecutor(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the FS itself needs to create a thread pool to which it can hand a subset off to processors. Adds a single place to control size, and by allowing for reuse across operations, reduces startup costs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, a thread pool per FS would be better. Will make the change in the official PR (apache#2549)

import static org.apache.hadoop.fs.azurebfs.constants.FileSystemConfigurations.DEFAULT_AZURE_LIST_MAX_RESULTS;
import static org.apache.hadoop.test.LambdaTestUtils.intercept;

public class TestGetContentSummary extends AbstractAbfsIntegrationTest {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. should be ITest prefix
  2. will need to use a subdir to avoid problems on parallel runs
    As the operation takes an interface purely to the list callbacks, it should be possible to simulate a large scan just through generating a listing, without needing a store. This'd be useful as it will run under yetus

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. will rename
  2. will use the path() method to create unique paths under the test dir
    Simulating a scan for yetus run - need to find out on how it's done, will add

AzureBlobFileSystem fs = getFileSystem();
fs.mkdirs(new Path("/testFolder"));
Path filePath = new Path("/testFolder/testFile");
fs.create(filePath);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this returns a stream which must be closed; the append() isnt needed unless that is what you want to test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, we need to write to the file but append isn't required. Will use the stream obtained from create and close it post write

for (Future<Void> task : tasks) {
task.get();
}
FSDataOutputStream out = getFileSystem()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use touch()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do in official PR

tracingContext);
} catch (InterruptedException e) {
LOG.debug("Thread interrupted");
throw new IOException(e);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InterruptedIOException

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will modify

throw new IOException(e);
} catch(ExecutionException ex) {
LOG.debug("GetContentSummary failed with error: {}", ex.getMessage());
throw new IOException(ex);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer PathIOException with path included

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will modify

@sumangala-patki
Copy link
Collaborator Author

Hi @steveloughran, thank you for reviewing the PR. Unfortunately, this happens to be a draft PR I had opened to the trunk of my forked repo (and not apache hadoop) and kept as Draft since it was not the official one. Though the content is updated since it's the same branch, the official PR is at link. Sorry for not marking this as closed; I did not realize it would be readily visible for review. Moreover, the PR link in the JIRA has disappeared, probably because GitHub/Hadoop seem to have recently changed the PR title format.
I will address your comments above; please feel free to add any further comments on the official PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants