Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-17833. Improve Magic Committer performance #3289

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Aug 10, 2021

Speeding up the magic committer with key changes being

  • All writes under __magic trigger marker retention
    (no DELETEs after file/dir creation)
  • create(path, overwrite) skips all overwrite checks, including
    the LIST call intended to stop files being created over dirs
  • thread pool used for more parallelism in task commit.
  • open() of .pending/pendingset files will skip HEAD calls if going straight from dir listing; will save 1 HEAD/task in job commit

This is still WiP as it needs

  • cost tests to verify the optimisations are active
  • testing through spark

Lots of changes in the tests because the committer has added
a CommitContext class which manages the lifecycle of
the thread pool and a set of thread local JSON serializers;
this is now what is passed around in internal committer methods,
so breaking tests calling in to them.

It is a better design (one we should have done from the start);
manifest committer is even better as all its operations "stages"
are modular. Just means that a lot of tests stopped compiling.
And as usual, mock tests played up.

Removed the injection/handling of inconsistent S3
from the committer tests. Not needed, and simply complicating
the code needlessly.

  1. Incremental listings of directories with processing as the pages of results come in
  2. Maximised parallelism parsing and processing files.
  3. per thread json serializers; no sharing or repeated creation.
  4. when creating files under __magic paths, all safety checks (overwriting files, overwriting dirs) are skipped. This benefits parquet files even more than most as it seems to be saving files with overwrite=false
  5. We also skip deleting marker paths even if the FS isn't configured to do this, knowing that in job commit all these markers will be purged. This behaviour is also possible through the createFile() API, which is used when saving the intermediate manifests and the _SUCCESS file.

The write optimisations should provide significant benefits when writing files: at least one LIST per creation, for parquet a HEAD too, and O(depth) delete calls which just generates write load, risk of throttling etc. There is still more IO taking place for each magic file than writing a small simple file (it is always multipart and we have to add a marker file too). Spark also has to add an extra getXAttrs call for each file when building its intermediate stats.

Also of note

  • The committer can write the summary _SUCCESS file to the path fs.s3a.committer.summary.report.directory, which can be in a different file system/bucket if desired, with the jobid as the filename. This can be used to collect all statistics of jobs even though switch over right at the same directory tree. This is the same as the Manifest Committer -that just collects more stats on its operations.
  • There's reuse of the hadoop common code and statistic names from the ManifestCommitter.
  • The hadoop-aws maven build blocks all import of mapreduce code in the production source except in the s3a.committer source tree, and even there we are selective and exclude all classes we know get referenced elsewhere.

@steveloughran steveloughran added the fs/s3 changes related to hadoop-aws; submitter must declare test endpoint label Aug 10, 2021
@steveloughran steveloughran marked this pull request as draft August 10, 2021 18:37
@steveloughran
Copy link
Contributor Author

TESTING IN PROGRESS

@steveloughran steveloughran force-pushed the s3/HADOOP-17833-magic-committer-performance branch from 208929e to a2166e1 Compare August 11, 2021 11:25
@apache apache deleted a comment from hadoop-yetus Aug 19, 2021
@steveloughran
Copy link
Contributor Author

+plan to lift some of the statistic names from the manifest committer and do the same reporting as in manifest committer. will also include list costs in results. (side issue, thinking of whether the json deserializer could build stats on reading costs, which can then be collected too to measure cost of ser/deser and, by collecting stream read/write costs, those steps

@steveloughran steveloughran force-pushed the s3/HADOOP-17833-magic-committer-performance branch from d6a0dcf to f850021 Compare August 23, 2021 12:20
@apache apache deleted a comment from hadoop-yetus Aug 23, 2021
@apache apache deleted a comment from hadoop-yetus Aug 24, 2021
@apache apache deleted a comment from hadoop-yetus Aug 24, 2021
@apache apache deleted a comment from hadoop-yetus Aug 24, 2021
@apache apache deleted a comment from hadoop-yetus Aug 25, 2021
@steveloughran steveloughran marked this pull request as ready for review August 28, 2021 13:45
/**
* Create a stub commit context for tests.
* There's no job context and the thread pool is
* not set up.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it is if a count is passed in.

@@ -86,6 +86,6 @@ log4j.logger.org.apache.hadoop.fs.s3a.S3AStorageStatistics=INFO

# Auditing operations in all detail
# Log before a request is made to S3
#log4j.logger.org.apache.hadoop.fs.s3a.audit=DEBUG
log4j.logger.org.apache.hadoop.fs.s3a.audit=DEBUG
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think I'll pull this out. Handy for me, but still a bit noisy

@steveloughran
Copy link
Contributor Author

This is at a point where it's ready for some review and any benchmarking people can do. I've cut out a lot of HTTP IO per file create and load

@mukund-thakur @mehakmeet @dongjoon-hyun @bogthe

The only other big thing to consider is could we route the parallel POST calls in job commit through a fork/join thread pool and does that deliver better throughput due to the fact there's no need to yield to the OS scheduler to pick up the next bit of work.

I am happy to do a live shared screen review of this PR next week, if people want do discuss things that way

@dongjoon-hyun
Copy link
Member

Thank you, @steveloughran !

@apache apache deleted a comment from hadoop-yetus Aug 31, 2021
@steveloughran
Copy link
Contributor Author

steveloughran commented Sep 7, 2021

hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/auth/delegation/ITestSessionDelegationTokens.java:207:6:[deprecation] <T>assertThat(String,T,Matcher<? super T>) in Assert has been deprecated
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/auth/delegation/ITestSessionDelegationInFileystem.java:260:44:[unchecked] unchecked cast
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3ATemporaryCredentials.java:204:6:[deprecation] <T>assertThat(String,T,Matcher<? super T>) in Assert has been deprecated

./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/CommitContext.java:385:    public PoolSubmitter(ExecutorService executor) {:5: Redundant 'public' modifier. [RedundantModifier]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/commit/ITestCommitOperationCost.java:212:          );: 'method call rparen' has incorrect indentation level 10, expected level should be 6. [Indentation]

@apache apache deleted a comment from hadoop-yetus Sep 10, 2021
@steveloughran
Copy link
Contributor Author

Data has 42 rows clustered true for 20000000
Generating table call_center in database to s3a://perf-team-west1-bucket/perf-team-data/tpcds/magic2/sf1000-parquet/useDecimal=true,useDate=true,filterNull=false/call_center with save mode Overwrite.
java.lang.NullPointerException
  at org.apache.hadoop.fs.s3a.commit.CommitContext.<init>(CommitContext.java:128)
  at org.apache.hadoop.fs.s3a.commit.CommitOperations.createCommitContext(CommitOperations.java:658)
  at org.apache.hadoop.fs.s3a.commit.AbstractS3ACommitter.initiateJobOperation(AbstractS3ACommitter.java:796)
  at org.apache.hadoop.fs.s3a.commit.AbstractS3ACommitter.abortJob(AbstractS3ACommitter.java:840)
  at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:224)
  at org.apache.spark.internal.io.cloud.PathOutputCommitProtocol.abortJob(PathOutputCommitProtocol.scala:206)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:202)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:169)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:141)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:165)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

@steveloughran steveloughran force-pushed the s3/HADOOP-17833-magic-committer-performance branch from 9d37211 to 30748c9 Compare April 7, 2022 20:41
@apache apache deleted a comment from hadoop-yetus Apr 8, 2022
@steveloughran
Copy link
Contributor Author

steveloughran commented Apr 13, 2022

./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/CommitContext.java:348:  private class PoolSubmitter implements TaskPool.Submitter, Closeable {: Class PoolSubmitter should be declared as final. [FinalClass]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/files/PersistentCommitData.java:105:    return serializer.load(fs, path,status);:36: ',' is not followed by whitespace. [WhitespaceAfter]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/CreateFileBuilder.java:22:import java.util.Collections;:8: Unused import - java.util.Collections. [UnusedImports]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/MkdirOperation.java:190:    void createFakeDirectory(final Path dir) throws IOException;:30: Redundant 'final' modifier. [RedundantModifier]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/WriteOperationHelper.java:326:   * {@link S3AFileSystem#finishedWrite(String, long, String, String, org.apache.hadoop.fs.s3a.impl.PutObjectOptions)}: Line is longer than 100 characters (found 118). [LineLength]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/commit/ITestCommitOperationCost.java:256:          commitOperations.commitOrFail(singleCommit);: 'block' child has incorrect indentation level 10, expected level should be 6. [Indentation]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/commit/ITestCommitOperationCost.java:257:          IOStatistics st = commitOperations.getIOStatistics();: 'block' child has incorrect indentation level 10, expected level should be 6. [Indentation]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/commit/ITestCommitOperationCost.java:258:          return ioStatisticsToPrettyString(st);: 'block' child has incorrect indentation level 10, expected level should be 6. [Indentation]
./hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/performance/ITestS3ADeleteCost.java:284:        );: 'method call rparen' has incorrect indentation level 8, expected level should be 4. [Indentation]

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java:1442: warning: no @throws for java.io.IOException
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/MagicCommitIntegration.java:94: warning: no @param for trackerStatistics
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/files/PersistentCommitData.java:121: warning: no @param for path
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/magic/MagicCommitTracker.java:80: warning: no @param for trackerStatistics


Code | Warning
-- | --
IS | Inconsistent synchronization of org.apache.hadoop.fs.s3a.commit.CommitContext.outerSubmitter; locked 60% of time
  | Bug type IS2_INCONSISTENT_SYNC (click for details)In class org.apache.hadoop.fs.s3a.commit.CommitContextField org.apache.hadoop.fs.s3a.commit.CommitContext.outerSubmitterSynchronized 60% of the timeUnsynchronized access at CommitContext.java:[line 291]Unsynchronized access at CommitContext.java:[line 170]Synchronized access at CommitContext.java:[line 332]Synchronized access at CommitContext.java:[line 330]Synchronized access at CommitContext.java:[line 332]



Code	Warning
IS	Inconsistent synchronization of org.apache.hadoop.fs.s3a.commit.CommitContext.outerSubmitter; locked 60% of time
[Bug type IS2_INCONSISTENT_SYNC (click for details)](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3289/11/artifact/out/new-spotbugs-hadoop-tools_hadoop-aws.html#IS2_INCONSISTENT_SYNC)
In class org.apache.hadoop.fs.s3a.commit.CommitContext
Field org.apache.hadoop.fs.s3a.commit.CommitContext.outerSubmitter
Synchronized 60% of the time
Unsynchronized access at CommitContext.java:[line 291]
Unsynchronized access at CommitContext.java:[line 170]
Synchronized access at CommitContext.java:[line 332]
Synchronized access at CommitContext.java:[line 330]
Synchronized access at CommitContext.java:[line 332]

@steveloughran steveloughran force-pushed the s3/HADOOP-17833-magic-committer-performance branch from dd5ef82 to 87ae7e5 Compare April 27, 2022 17:25
@steveloughran steveloughran force-pushed the s3/HADOOP-17833-magic-committer-performance branch from 87ae7e5 to 3b13315 Compare May 5, 2022 10:27
@apache apache deleted a comment from hadoop-yetus May 6, 2022
Change-Id: I95412b3fe13a54389521423a4a8f5a9d6e1209da
Added enforcer rules to restrict use of mapred imports
in production code to selected packages and classes
under oah.fs.s3a.commit

replaced an import in CommitterConstants with the string

moved some private committer implementation classes into
a new package oah.fs.s3a.commit.impl, a package which
is allowed to use mr classes.

This locks down use of mapred code and reduces the risk that
there may be transitive dependency on the libraries in the filesystem
class itself.

Change-Id: Idc0120d3903b9d7e88267384a5380884f1b4e374
Change-Id: Ibda8ca966b6973c4c5a398bf1691325eff7b4388
as part of this change, reviewing HADOOP-17584

problem here is that magic committer MR recovery would use same job
ID for path, and so if a task attempt ta0 from job attempt1 were to
commit, it would scan its directory tree and find any files created by
attempt 0.

- "Job path" is the path with job id/uuid only.
- Job attempt path is used for a subdirectory
- magic committer works with a job attempt path, so is unique on a second
  attempt
- job abort/cleanup cleans up the job path, not just the single event
- and for magic committer, everything in _temporary

Before anyone complains that the magic committer will be deleting more stuff
in job cleanup, it was already stopping uploads and deleting the __magic
dir.

Change-Id: I69e65059517c1e4ca087b58606d381c79ada7505
Wrapping up createFile() with the ability to set headers on the file

builder.must("fs.s3a.create.header.my-header","my-value");

this will set a user metadata entry "my-header" which xattr will return
as "header.my-header".

This forced me to make PutOptions something passed around more rigorously and
used as the parameter class to innerCreateFile. This is ultimately good as it
lets us set new options later..encryption, storage class etc. if we so choose.

also: s3a path capabilities let you probe for this prefix, as well as the
performance one...think this should be the standard behaviour for custom options.

Change-Id: Icb5a1e923a3bc50f42e825d34b16da0cf622c289
Change-Id: If4dbd151c4099618f5e1a26f0229d019893d367e
Change-Id: I8a3272b4653fedd023d2490a77ca91d8baef1129
Change-Id: Ie944c2fc1cb7f1634076a7df37148b51d7f5a741
rebase onto trunk with  HADOOP-12020/storage class; use PutObjectOptions
as the place to pass the option from createFile() to the requests.

The feature itself is not implemented, just prepared for.

Change-Id: I5d2e57e672f49021c3cb8bfa3b1a3daf09c61a38
@steveloughran steveloughran force-pushed the s3/HADOOP-17833-magic-committer-performance branch from 73aaec9 to 69eedd9 Compare June 8, 2022 19:00
Change-Id: Ia2fe020bb3672fefd42f1ceec8af0337c4f00b73
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 47s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 3s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+0 🆗 xmllint 0m 0s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 40 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 37s Maven dependency ordering for branch
+1 💚 mvninstall 24m 41s trunk passed
+1 💚 compile 23m 8s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 compile 20m 36s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 4m 32s trunk passed
+1 💚 mvnsite 5m 14s trunk passed
+1 💚 javadoc 4m 15s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 3m 55s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 7m 3s trunk passed
+1 💚 shadedclient 22m 7s branch has no errors when building and testing our client artifacts.
-0 ⚠️ patch 22m 39s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 32s Maven dependency ordering for patch
+1 💚 mvninstall 2m 30s the patch passed
+1 💚 compile 22m 15s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javac 22m 15s root-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 generated 0 new + 2889 unchanged - 3 fixed = 2889 total (was 2892)
+1 💚 compile 20m 32s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 20m 32s root-jdkPrivateBuild-1.8.0_312-8u312-b07-0ubuntu120.04-b07 with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu120.04-b07 generated 0 new + 2687 unchanged - 3 fixed = 2687 total (was 2690)
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 6s /results-checkstyle-root.txt root: The patch generated 1 new + 30 unchanged - 6 fixed = 31 total (was 36)
+1 💚 mvnsite 5m 14s the patch passed
+1 💚 javadoc 4m 3s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 3m 53s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 7m 32s the patch passed
+1 💚 shadedclient 22m 9s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 26s hadoop-common in the patch passed.
+1 💚 unit 6m 45s hadoop-mapreduce-client-core in the patch passed.
+1 💚 unit 3m 12s hadoop-aws in the patch passed.
+1 💚 asflicense 1m 37s The patch does not generate ASF License warnings.
260m 8s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3289/32/artifact/out/Dockerfile
GITHUB PR #3289
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint xmllint
uname Linux 9d1cb87b1ef4 4.15.0-169-generic #177-Ubuntu SMP Thu Feb 3 10:50:38 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / c44f869
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3289/32/testReport/
Max. process+thread count 2008 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3289/32/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@apache apache deleted a comment from hadoop-yetus Jun 10, 2022
@apache apache deleted a comment from hadoop-yetus Jun 10, 2022
@steveloughran
Copy link
Contributor Author

testing: s3 london

@apache apache deleted a comment from hadoop-yetus Jun 10, 2022
Copy link
Contributor

@mukund-thakur mukund-thakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Ran the IT's as well. No failures.

conscious decision to choose speed over safety and
that the outcome was their own fault.

Accordingly: *Use if and only if you are confident that the conditions are met.*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I was worried about inconsistencies leading to escalations but by the end I think we are clear enough. Nice doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we aren't actually that more vulnerable than when someone creates a file under a file, which they can do today.

@@ -592,13 +634,41 @@ public void jobCompleted(boolean success) {
}

/**
* Begin the final commit.
* Crate a commit context for a job or task.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: create

Change-Id: Iade3656fdf8c57284e7360c317e629b1cac8b109
@mukund-thakur
Copy link
Contributor

LGTM +1
Thanks @steveloughran, there are 3 pending checkstyle issue though.

@steveloughran
Copy link
Contributor Author

thanks
checkstyles are fixed locally. only two are real checkstyles, the other is a link length in the javadocs.

Change-Id: I2dedac520cd5d22317f4fc7170ca25786c647004
Change-Id: I359af3578fc6dd7d669a0762a8869e4cd3c1ea7c
@steveloughran steveloughran merged commit e199da3 into apache:trunk Jun 17, 2022
steveloughran added a commit to steveloughran/hadoop that referenced this pull request Jun 20, 2022
Speed up the magic committer with key changes being

* Writes under __magic always retain directory markers

* File creation under __magic skips all overwrite checks,
  including the LIST call intended to stop files being
        created over dirs.
* mkdirs under __magic probes the path for existence
  but does not look any further.

Extra parallelism in task and job commit directory scanning
Use of createFile and openFile with parameters which all for
HEAD checks to be skipped.

The committer can write the summary _SUCCESS file to the path
`fs.s3a.committer.summary.report.directory`, which can be in a
different file system/bucket if desired, using the job id as
the filename.

Also: HADOOP-15460. S3A FS to add `fs.s3a.create.performance`

Application code can set the createFile() option
fs.s3a.create.performance to true to disable the same
safety checks when writing under magic directories.
Use with care.

The createFile option prefix `fs.s3a.create.header.`
can be used to add custom headers to S3 objects when
created.

Contributed by Steve Loughran.

Change-Id: I9e086423f02eb25b6e70fc1c12a13e0a5afe9cb9
@apache apache deleted a comment from hadoop-yetus Jun 20, 2022
@apache apache deleted a comment from hadoop-yetus Jun 20, 2022
@apache apache deleted a comment from hadoop-yetus Jun 20, 2022
steveloughran added a commit that referenced this pull request Jun 21, 2022
Speed up the magic committer with key changes being

* Writes under __magic always retain directory markers

* File creation under __magic skips all overwrite checks,
  including the LIST call intended to stop files being
        created over dirs.
* mkdirs under __magic probes the path for existence
  but does not look any further.

Extra parallelism in task and job commit directory scanning
Use of createFile and openFile with parameters which all for
HEAD checks to be skipped.

The committer can write the summary _SUCCESS file to the path
`fs.s3a.committer.summary.report.directory`, which can be in a
different file system/bucket if desired, using the job id as
the filename.

Also: HADOOP-15460. S3A FS to add `fs.s3a.create.performance`

Application code can set the createFile() option
fs.s3a.create.performance to true to disable the same
safety checks when writing under magic directories.
Use with care.

The createFile option prefix `fs.s3a.create.header.`
can be used to add custom headers to S3 objects when
created.

Contributed by Steve Loughran.
HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022
Speed up the magic committer with key changes being

* Writes under __magic always retain directory markers

* File creation under __magic skips all overwrite checks,
  including the LIST call intended to stop files being
	created over dirs.
* mkdirs under __magic probes the path for existence
  but does not look any further.  	

Extra parallelism in task and job commit directory scanning
Use of createFile and openFile with parameters which all for
HEAD checks to be skipped.

The committer can write the summary _SUCCESS file to the path
`fs.s3a.committer.summary.report.directory`, which can be in a
different file system/bucket if desired, using the job id as
the filename. 

Also: HADOOP-15460. S3A FS to add `fs.s3a.create.performance`

Application code can set the createFile() option
fs.s3a.create.performance to true to disable the same
safety checks when writing under magic directories.
Use with care.

The createFile option prefix `fs.s3a.create.header.`
can be used to add custom headers to S3 objects when
created.


Contributed by Steve Loughran.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fs/s3 changes related to hadoop-aws; submitter must declare test endpoint
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants