HADOOP-17428. ABFS: Implementation for getContentSummary #4

sumangala-patki · 2020-12-11T11:25:18Z

No description provided.

steveloughran

sorry, misted this completely.

It turns out that hive still uses this call when looking at unmanaged tables. This is a fundamental issue which hive should fix.

In the meantime, parallel scanning can help. indeed, if you could send a summary from the server even better.

(One thing I'd like there is to see if hive can cope just with the number and size of files; this can help in the scan, depending on how its done)

For ABFS, parallel tree scan seems the best client-side approach.
For S3A we do something serialized but not parallelized.

If I could switch to only returning file count/size I could do deep scans.

In cloudstore I mix this: an initial treescan of a few levels, feeding in to a thread pool -under which a deep list(path, recursive=true) scan is kicked off.
Variable performance, as it depends on the tree structure to be most efficient.

Returning to this patch...

Maybe it should go into Hadoop-common under org.apache.hadoop.fs.impl for reuse. That complicates the code as reusability complicates things. For that reason, I'm going to say don't bother. Really, it's hive's job to fix their code.

listing should be done through an incremental call. on a very wide directory, this will allow subdir scans to be kicked off before the full dir listing has finished.

Duration tracking IOStatistics on use and duration of the call will help assess use and cost. There is already [a statistic name](https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/statistics/StoreStatisticNames.java#L67 for this).

the ABFS store needs its own thread pool. I think the ongoing work on stream read/write parallelisation optimisation can add this, if it's not there already.
A reference to the thread pool can be passed in, with some restrictions on what
fraction of it can be used for the scan.

steveloughran · 2021-08-19T12:52:37Z

...doop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ContentSummaryProcessor.java

+  private final AtomicLong totalBytes = new AtomicLong(0L);
+  private final AtomicInteger numTasks = new AtomicInteger(0);
+  private final ListingSupport abfsStore;
+  private final ExecutorService executorService = new ThreadPoolExecutor(


I think the FS itself needs to create a thread pool to which it can hand a subset off to processors. Adds a single place to control size, and by allowing for reuse across operations, reduces startup costs

true, a thread pool per FS would be better. Will make the change in the official PR (apache#2549)

steveloughran · 2021-08-19T12:55:43Z

...hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/TestGetContentSummary.java

+import static org.apache.hadoop.fs.azurebfs.constants.FileSystemConfigurations.DEFAULT_AZURE_LIST_MAX_RESULTS;
+import static org.apache.hadoop.test.LambdaTestUtils.intercept;
+
+public class TestGetContentSummary extends AbstractAbfsIntegrationTest {


should be ITest prefix

will need to use a subdir to avoid problems on parallel runs
As the operation takes an interface purely to the list callbacks, it should be possible to simulate a large scan just through generating a listing, without needing a store. This'd be useful as it will run under yetus

will rename

will use the path() method to create unique paths under the test dir
Simulating a scan for yetus run - need to find out on how it's done, will add

steveloughran · 2021-08-19T12:56:51Z

...hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/TestGetContentSummary.java

+    AzureBlobFileSystem fs = getFileSystem();
+    fs.mkdirs(new Path("/testFolder"));
+    Path filePath = new Path("/testFolder/testFile");
+    fs.create(filePath);


this returns a stream which must be closed; the append() isnt needed unless that is what you want to test

agree, we need to write to the file but append isn't required. Will use the stream obtained from create and close it post write

steveloughran · 2021-08-19T12:57:51Z

...hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/TestGetContentSummary.java

+    for (Future<Void> task : tasks) {
+      task.get();
+    }
+    FSDataOutputStream out = getFileSystem()


just use touch()

will do in official PR

steveloughran · 2021-08-19T12:58:11Z

hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystem.java

+          tracingContext);
+    } catch (InterruptedException e) {
+      LOG.debug("Thread interrupted");
+      throw new IOException(e);


InterruptedIOException

will modify

steveloughran · 2021-08-19T12:58:35Z

hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystem.java

+      throw new IOException(e);
+    } catch(ExecutionException ex) {
+      LOG.debug("GetContentSummary failed with error: {}", ex.getMessage());
+      throw new IOException(ex);


prefer PathIOException with path included

will modify

sumangala-patki · 2021-08-21T07:39:04Z

Hi @steveloughran, thank you for reviewing the PR. Unfortunately, this happens to be a draft PR I had opened to the trunk of my forked repo (and not apache hadoop) and kept as Draft since it was not the official one. Though the content is updated since it's the same branch, the official PR is at link. Sorry for not marking this as closed; I did not realize it would be readily visible for review. Moreover, the PR link in the JIRA has disappeared, probably because GitHub/Hadoop seem to have recently changed the PR title format.
I will address your comments above; please feel free to add any further comments on the official PR.

sumangala17 added 2 commits December 11, 2020 16:52

add getContentSummary and prelim test

8708856

remove gfs call

01aba06

sumangala-patki changed the title ~~ABFS: Implementation for getContentSummary~~ HADOOP-17428. ABFS: Implementation for getContentSummary Dec 15, 2020

sumangala17 and others added 27 commits December 15, 2020 12:41

add tests

2876a8b

pr draft

a9960da

checkstyle fix

30cf195

linkedBlockingQ + junit test fix

95d1396

linkedBlockingQ + junit test fix (#5)

03d342c

using executors

bb55b14

using executors

a9e94a9

run()->call(), terminate condition, add invalid path test

06609da

pr revw + checkstyle

1433c85

Merge branch 'trunk' into HADOOP-17428

27b6007

findbugs use future returned

d747f06

completion service + temp concurrency tests

be2daf0

pr revw + exec test

96cd2b9

clean up

e3eaca7

minor changes

94a95df

rm thread test

48d0607

checkstyle

a10be00

Merge branch 'trunk' into HADOOP-17428

636b434

revw changes + doc

bc276b2

javadoc

744f8c4

trigger yetus

9070413

Merge branch 'trunk' into HADOOP-17428

657d7ea

use listingsupport to abstract store

041d9bc

merge

9c92338

checkstyle

4be7b19

Merge branch 'trunk' into HADOOP-17428

7a2e218

import order

d21b58a

sumangala17 added 7 commits March 31, 2021 20:59

Merge branch 'trunk' into HADOOP-17428

9b2723b

log ex

2378431

Merge branch 'trunk' into HADOOP-17428

fe71af1

rm abfs cs

fa34b57

test fix

aa48086

clean up

f320785

merge with tc

be0e94c

steveloughran suggested changes Aug 19, 2021

View reviewed changes

sumangala17 added 8 commits August 23, 2021 14:46

Merge branch 'trunk' into HADOOP-17428

2104268

address revw comments

a718cbd

review comments part 2: move executor->abfsStore

c9d65aa

Merge branch 'trunk' into HADOOP-17428

4003aff

use iterator + rejected-ex handler

3039f7f

undo extra formatting

8259a2e

more formatting

b64b492

format

16d9436

sumangala-patki force-pushed the HADOOP-17428 branch from fa72f2d to 16d9436 Compare September 13, 2021 04:53

fix merge conflict

137627d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HADOOP-17428. ABFS: Implementation for getContentSummary #4

HADOOP-17428. ABFS: Implementation for getContentSummary #4

sumangala-patki commented Dec 11, 2020

Uh oh!

steveloughran left a comment

Uh oh!

steveloughran Aug 19, 2021

Uh oh!

sumangala-patki Aug 21, 2021

Uh oh!

steveloughran Aug 19, 2021

Uh oh!

sumangala-patki Aug 21, 2021

Uh oh!

steveloughran Aug 19, 2021

Uh oh!

sumangala-patki Aug 21, 2021

Uh oh!

steveloughran Aug 19, 2021

Uh oh!

sumangala-patki Aug 21, 2021

Uh oh!

steveloughran Aug 19, 2021

Uh oh!

sumangala-patki Aug 21, 2021

Uh oh!

steveloughran Aug 19, 2021

Uh oh!

sumangala-patki Aug 21, 2021

Uh oh!

sumangala-patki commented Aug 21, 2021

Uh oh!

Uh oh!

HADOOP-17428. ABFS: Implementation for getContentSummary #4

Are you sure you want to change the base?

HADOOP-17428. ABFS: Implementation for getContentSummary #4

Conversation

sumangala-patki commented Dec 11, 2020

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sumangala-patki commented Aug 21, 2021

Uh oh!

Uh oh!