Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KYUUBI #6691] A new Spark SQL command to merge small files #6695

Closed
wants to merge 35 commits into from

Conversation

gabrywu
Copy link
Member

@gabrywu gabrywu commented Sep 13, 2024

🔍 Description

Issue References 🔗

This pull request closing #6691

Describe Your Solution 🔧

There are many cases in which a SQL generate small files, we MUST merge them into bigger ones.
I create a new Spark SQL command to merge small files, which doesn't read-write all of the records of a table, it just merges files in a binary level. Take a CSV table for example, it only appends the byte array from one file to another one, without reading & writing records

Syntax here

compact table table_name [INTO ${targetFileSize} ${targetFileSizeUnit} ] [ cleanup | retain | list ]
-- targetFileSizeUnit can be 'm','mb'
-- cleanup means cleaning compact staging folders, which contains original small files, default behavior
-- retain means retaining compact staging folders, for testing, and we can recover with the staging data
-- list means this command only get the merging result, and don't run actually
recover compact table table_name
-- recover a table if compact table command fails

Types of changes 🔖

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Related Unit Tests


Checklist 📝

Be nice. Be informative.

@pan3793
Copy link
Member

pan3793 commented Sep 13, 2024

cc @ulysses-you

@gabrywu gabrywu changed the title Compact table A new Spark SQL command to merge small files Sep 13, 2024
@gabrywu gabrywu requested a review from cxzl25 September 15, 2024 06:11
@codecov-commenter
Copy link

codecov-commenter commented Sep 15, 2024

Codecov Report

Attention: Patch coverage is 0% with 582 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (353877b) to head (fd39f66).
Report is 26 commits behind head on master.

Files with missing lines Patch % Lines
...ache/kyuubi/sql/compact/SmallFileCollectExec.scala 0.00% 75 Missing ⚠️
.../kyuubi/sql/compact/merge/AbstractFileMerger.scala 0.00% 61 Missing ⚠️
...yuubi/sql/compact/RecoverCompactTableCommand.scala 0.00% 46 Missing ⚠️
...ache/kyuubi/sql/compact/CompactTableResolver.scala 0.00% 43 Missing ⚠️
...a/org/apache/kyuubi/sql/compact/CompactTable.scala 0.00% 38 Missing ⚠️
...e/kyuubi/sql/compact/merge/ParquetFileMerger.scala 0.00% 36 Missing ⚠️
...apache/kyuubi/sql/compact/SmallFileMergeExec.scala 0.00% 34 Missing ⚠️
...uubi/sql/compact/CachePerformanceViewCommand.scala 0.00% 33 Missing ⚠️
.../apache/kyuubi/sql/compact/CompactTableUtils.scala 0.00% 33 Missing ⚠️
...a/org/apache/spark/sql/SparkInternalExplorer.scala 0.00% 26 Missing ⚠️
... and 12 more
Additional details and impacted files
@@           Coverage Diff           @@
##           master   #6695    +/-   ##
=======================================
  Coverage    0.00%   0.00%            
=======================================
  Files         682     706    +24     
  Lines       42192   42861   +669     
  Branches     5755    5851    +96     
=======================================
- Misses      42192   42861   +669     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gabrywu
Copy link
Member Author

gabrywu commented Sep 15, 2024

@cxzl25 @ulysses-you all tests pass

@github-actions github-actions bot added the kind:documentation Documentation is a feature! label Sep 16, 2024
@ulysses-you
Copy link
Contributor

thank you @gabrywu , about the syntax, is there any reason to introduce compact table ? Why not use procedure like

Call kyuubi.compact(table => 'xx', targetSizeInBytes => 'xxx', mode => 'xxx')

@gabrywu
Copy link
Member Author

gabrywu commented Sep 18, 2024

thank you @gabrywu , about the syntax, is there any reason to introduce compact table ? Why not use procedure like

Call kyuubi.compact(table => 'xx', targetSizeInBytes => 'xxx', mode => 'xxx')

The answer is same to the question, why use procedure?
I created this command based on spark2.x when the call procedure was not the mainstream proposal.
And by the way, a call procedure is becoming a mainstream proposal?

@gabrywu gabrywu requested a review from cxzl25 September 18, 2024 10:25
@gabrywu
Copy link
Member Author

gabrywu commented Sep 18, 2024

@cxzl25 what do you think of the call kyuubi.compact procedure?

@gabrywu
Copy link
Member Author

gabrywu commented Sep 19, 2024

@AngersZhuuuu can you help to review this PR?

@pan3793
Copy link
Member

pan3793 commented Sep 19, 2024

For the syntax part, given Delta and Iceberg's dominance in the lakehouse market, I suggest following either Delta's VACCUM or Iceberg's CALL syntax.

Additional information:

  • Kyuubi Spark extension already adopted Delta ZORDER syntax
  • Spark 4.0 adopting the Iceberg CALL syntax, see SPARK-48781

@gabrywu
Copy link
Member Author

gabrywu commented Sep 19, 2024

we'd better talk about the syntax and make a final decision in the dev emails dev@kyuubi.apache.org, otherwise, the upcoming PR will still not use CALL procedure
@pan3793 , @ulysses-you , @cxzl25

@gabrywu
Copy link
Member Author

gabrywu commented Sep 19, 2024

An email thread to decide which one should be used, command or call procedure
[VOTE][DISCUSS] A Spark SQL command or Call procedure

@turboFei
Copy link
Member

do you support to compact one partition for partitioned table?

@gabrywu
Copy link
Member Author

gabrywu commented Sep 25, 2024

do you support to compact one partition for partitioned table?

hi, @turboFei I removed this feature from this PR. only support partition table internally.

gabrywu added a commit to gabrywu/kyuubi that referenced this pull request Sep 25, 2024
@gabrywu
Copy link
Member Author

gabrywu commented Sep 26, 2024

Close this PR and will create a new one if apache/spark/pull/47190 is released in next Spark version v4.0.0

@gabrywu gabrywu closed this Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants