-
Notifications
You must be signed in to change notification settings - Fork 2
HADOOP-17428. ABFS: Implementation for getContentSummary #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, misted this completely.
It turns out that hive still uses this call when looking at unmanaged tables. This is a fundamental issue which hive should fix.
In the meantime, parallel scanning can help. indeed, if you could send a summary from the server even better.
(One thing I'd like there is to see if hive can cope just with the number and size of files; this can help in the scan, depending on how its done)
For ABFS, parallel tree scan seems the best client-side approach.
For S3A we do something serialized but not parallelized.
If I could switch to only returning file count/size I could do deep scans.
In cloudstore I mix this: an initial treescan of a few levels, feeding in to a thread pool -under which a deep list(path, recursive=true) scan is kicked off.
Variable performance, as it depends on the tree structure to be most efficient.
Returning to this patch...
Maybe it should go into Hadoop-common under org.apache.hadoop.fs.impl
for reuse. That complicates the code as reusability complicates things. For that reason, I'm going to say don't bother. Really, it's hive's job to fix their code.
listing should be done through an incremental call. on a very wide directory, this will allow subdir scans to be kicked off before the full dir listing has finished.
Duration tracking IOStatistics on use and duration of the call will help assess use and cost. There is already [a statistic name](https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/statistics/StoreStatisticNames.java#L67 for this).
the ABFS store needs its own thread pool. I think the ongoing work on stream read/write parallelisation optimisation can add this, if it's not there already.
A reference to the thread pool can be passed in, with some restrictions on what
fraction of it can be used for the scan.
private final AtomicLong totalBytes = new AtomicLong(0L); | ||
private final AtomicInteger numTasks = new AtomicInteger(0); | ||
private final ListingSupport abfsStore; | ||
private final ExecutorService executorService = new ThreadPoolExecutor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the FS itself needs to create a thread pool to which it can hand a subset off to processors. Adds a single place to control size, and by allowing for reuse across operations, reduces startup costs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, a thread pool per FS would be better. Will make the change in the official PR (apache#2549)
import static org.apache.hadoop.fs.azurebfs.constants.FileSystemConfigurations.DEFAULT_AZURE_LIST_MAX_RESULTS; | ||
import static org.apache.hadoop.test.LambdaTestUtils.intercept; | ||
|
||
public class TestGetContentSummary extends AbstractAbfsIntegrationTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- should be ITest prefix
- will need to use a subdir to avoid problems on parallel runs
As the operation takes an interface purely to the list callbacks, it should be possible to simulate a large scan just through generating a listing, without needing a store. This'd be useful as it will run under yetus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- will rename
- will use the path() method to create unique paths under the test dir
Simulating a scan for yetus run - need to find out on how it's done, will add
AzureBlobFileSystem fs = getFileSystem(); | ||
fs.mkdirs(new Path("/testFolder")); | ||
Path filePath = new Path("/testFolder/testFile"); | ||
fs.create(filePath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this returns a stream which must be closed; the append() isnt needed unless that is what you want to test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, we need to write to the file but append isn't required. Will use the stream obtained from create and close it post write
for (Future<Void> task : tasks) { | ||
task.get(); | ||
} | ||
FSDataOutputStream out = getFileSystem() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use touch()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do in official PR
tracingContext); | ||
} catch (InterruptedException e) { | ||
LOG.debug("Thread interrupted"); | ||
throw new IOException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InterruptedIOException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will modify
throw new IOException(e); | ||
} catch(ExecutionException ex) { | ||
LOG.debug("GetContentSummary failed with error: {}", ex.getMessage()); | ||
throw new IOException(ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer PathIOException with path included
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will modify
Hi @steveloughran, thank you for reviewing the PR. Unfortunately, this happens to be a draft PR I had opened to the trunk of my forked repo (and not apache hadoop) and kept as Draft since it was not the official one. Though the content is updated since it's the same branch, the official PR is at link. Sorry for not marking this as closed; I did not realize it would be readily visible for review. Moreover, the PR link in the JIRA has disappeared, probably because GitHub/Hadoop seem to have recently changed the PR title format. |
fa72f2d
to
16d9436
Compare
No description provided.