[SPARK] Reuse time formatters instance in value serialization of TRowSet generation #5811

bowenliang123 · 2023-12-04T02:59:38Z

🔍 Description

Issue References 🔗

Subtask of #5808.

Describe Your Solution 🔧

Value serialization to Hive style string by HiveResult.toHiveString requires a TimeFormatters instance handling the date/time data types. Reusing the pre-created time formatters's instance, it dramatically reduces the overhead and improves the TRowset generation.

This may help to reduce memory footprints and fewer operations for TRowSet generation performance.

This is also aligned to the Spark's RowSetUtils in the implementation of RowSet generation for SPARK-39041 by @yaooqinn , with explicitly declared TimeFormatters instance.

Types of changes 🔖

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Related Unit Tests

Checklists

📝 Author Self Checklist

My code follows the style guidelines of this project
I have performed a self-review
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
This patch was not authored or co-authored using Generative Tooling

📝 Committer Pre-Merge Checklist

Be nice. Be informative.

codecov-commenter · 2023-12-04T04:22:35Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (58887dc) 61.29% compared to head (2270991) 61.27%.
Report is 1 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #5811      +/-   ##
============================================
- Coverage     61.29%   61.27%   -0.02%     
  Complexity       23       23              
============================================
  Files           608      608              
  Lines         36070    36085      +15     
  Branches       4951     4951              
============================================
+ Hits          22108    22112       +4     
- Misses        11565    11571       +6     
- Partials       2397     2402       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wForget · 2023-12-04T05:05:30Z

...ls/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/schema/RowSet.scala

  def toHiveString(valueAndType: (Any, DataType), nested: Boolean = false): String = {
    // compatible w/ Spark 3.1 and above
-    val timeFormatters = HiveResult.getTimeFormatters
+    val timeFormatters: TimeFormatters = valueAndType match {


We may need to create a timeFormatters for a dataset instead of each cell.

Refer to the implementation in spark: https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/RowSetUtils.scala#L37

Good idea.
Maybe this is what you are suggesting, allowing to reuse existed timeFormatters and create instances if not provided.

def toHiveString( valueAndType: (Any, DataType), nested: Boolean = false, timeFormatters: TimeFormatters = HiveResult.getTimeFormatters): String = { HiveResult.toHiveString(valueAndType, nested, timeFormatters) }

The reused timeFormatters may not be dedicated to a dataset but to the scope of threading. Especially considering possible parallel execution on column-level or row-level.

I think we can reuse it in the toTRowSet method

Sure. Adopted.

bowenliang123 · 2023-12-06T14:31:12Z

...ls/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/schema/RowSet.scala


-  def toHiveString(valueAndType: (Any, DataType), nested: Boolean = false): String = {
-    // compatible w/ Spark 3.1 and above
-    val timeFormatters = HiveResult.getTimeFormatters


PS: The key change of this PR, skipping time formatters recreation.

~~I think the reason that the time formatter is created each time is because the session time zone can be modified by querying set spark.sql.session.timeZone=<new time zone>~~

wForget · 2023-12-07T02:40:12Z

.../kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala

      timestampAsString: Boolean): DataFrame = {

    val quotedCol = (name: String) => col(quoteIfNeeded(name))
+    val tf = getTimeFormatters


I'm not sure if HiveResult.getTimeFormatters is thread safe and serializable, convertTopLevelComplexTypeToHiveString method is only for handling complex types in arrow mode, @cfmcgrady do you have any suggestions?

Shall we roll back to create TimeFormatters inside convertTopLevelComplexTypeToHiveString, if HiveResult.getTimeFormatters is probably not thread-safe ?

Shall we roll back to create TimeFormatters inside convertTopLevelComplexTypeToHiveString, if HiveResult.getTimeFormatters is probably not thread-safe ?

In that case, we could just do an adjustment like this

RowSet.toHiveString((row, st), nested = true, timeFormatters = getTimeFormatters)

It may not be thread-safe, so we should try to control its usage scope.

FYI: https://github.com/apache/spark/blob/43ca0b929ab3c2f10d1879e5df622195564f8885/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala#L198

https://github.com/apache/spark/blob/43ca0b929ab3c2f10d1879e5df622195564f8885/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala#L532

Shall we roll back to create TimeFormatters inside convertTopLevelComplexTypeToHiveString, if HiveResult.getTimeFormatters is probably not thread-safe ?

In that case, we could just do an adjustment like this

RowSet.toHiveString((row, st), nested = true, timeFormatters = getTimeFormatters)

OK, let's put it in this style.

pan3793 · 2023-12-07T03:22:35Z

...ls/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/schema/RowSet.scala

  def toColumnBasedSet(rows: Seq[Row], schema: StructType): TRowSet = {
    val rowSize = rows.length
    val tRowSet = new TRowSet(0, new java.util.ArrayList[TRow](rowSize))
+    val timeFormatters = getTimeFormatters


this is a good place as the time formatter is session timezone aware

def getTimeFormatters: TimeFormatters = { val dateFormatter = DateFormatter() val timestampFormatter = TimestampFormatter.getFractionFormatter( DateTimeUtils.getZoneId(SQLConf.get.sessionLocalTimeZone)) TimeFormatters(dateFormatter, timestampFormatter) }

Yes, it's session config awared.

...uubi-spark-sql-engine/src/test/scala/org/apache/kyuubi/engine/spark/schema/RowSetSuite.scala

bowenliang123 · 2023-12-07T09:26:57Z

The green block shows the approximate 5x to 30x Improvement in sterilizing values of decimal, date, timestamp, array, map, interval, localdate and instant.

...ls/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/schema/RowSet.scala

pan3793

LGTM, only a minor comment

pan3793 · 2023-12-07T09:34:56Z

@bowenliang123 the benchmark result looks impressive. would be better to group the results by data type.

wForget

LGTM, Thanks.

Just a little advice, we can add a comment for the convertTopLevelComplexTypeToHiveString method, which still has similar performance issue.

cfmcgrady

LGTM. good catch!

bowenliang123 · 2023-12-07T10:30:52Z

LGTM, Thanks.

Just a little advice, we can add a comment for the convertTopLevelComplexTypeToHiveString method, which still has similar performance issue.

Thanks, a todo comment added to the convertTopLevelComplexTypeToHiveString method.

bowenliang123 · 2023-12-07T10:31:26Z

LGTM. good catch!

Thanks, and the most credit for this PR goes to @wForget .

…ization of TRowSet generation # 🔍 Description ## Issue References 🔗 Subtask of #5808. ## Describe Your Solution 🔧 Value serialization to Hive style string by `HiveResult.toHiveString` requires a `TimeFormatters` instance handling the date/time data types. Reusing the pre-created time formatters's instance, it dramatically reduces the overhead and improves the TRowset generation. This may help to reduce memory footprints and fewer operations for TRowSet generation performance. This is also aligned to the Spark's `RowSetUtils` in the implementation of RowSet generation for [SPARK-39041](https://issues.apache.org/jira/browse/SPARK-39041) by yaooqinn , with explicitly declared TimeFormatters instance. ## Types of changes 🔖 - [x] Bugfix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 #### Behavior Without This Pull Request ⚰️ #### Behavior With This Pull Request 🎉 #### Related Unit Tests --- # Checklists ## 📝 Author Self Checklist - [x] My code follows the [style guidelines](https://kyuubi.readthedocs.io/en/master/contributing/code/style.html) of this project - [x] I have performed a self-review - [x] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) ## 📝 Committer Pre-Merge Checklist - [x] Pull request title is okay. - [x] No license issues. - [x] Milestone correctly set? - [x] Test coverage is ok - [x] Assignees are selected. - [x] Minimum number of approvals - [x] No changes are requested **Be nice. Be informative.** Closes #5811 from bowenliang123/rowset-timeformatter. Closes #5811 2270991 [Bowen Liang] Reuse time formatters instance in value serialization of TRowSet Authored-by: Bowen Liang <liangbowen@gf.com.cn> Signed-off-by: Cheng Pan <chengpan@apache.org> (cherry picked from commit 4463cc8) Signed-off-by: Cheng Pan <chengpan@apache.org>

pan3793 · 2023-12-07T12:22:45Z

Thanks, merged to master/1.8

…se notes # 🔍 Description ## Issue References 🔗 Currently, we use a rather primitive way to manually write release notes from scratch, and some of the mechanical and repetitive work can be simplified by the scripts. ## Describe Your Solution 🔧 Adds a script to simplify the process of creating release notes. Note: it just simplifies some processes, the release manager still needs to tune the outputs by hand. ## Types of changes 🔖 - [ ] Bugfix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 ``` RELEASE_TAG=v1.8.1 PREVIOUS_RELEASE_TAG=v1.8.0 build/release/pre_gen_release_notes.py ``` ``` $ head build/release/commits-v1.8.1.txt [KYUUBI #5981] Deploy Spark Hive connector with Scala 2.13 to Maven Central [KYUUBI #6058] Make Jetty server stop timeout configurable [KYUUBI #5952][1.8] Disconnect connections without running operations after engine maxlife time graceful period [KYUUBI #6048] Assign serviceNode and add volatile for variables [KYUUBI #5991] Error on reading Atlas properties composed of multi values [KYUUBI #6045] [REST] Sync the AdminRestApi with the AdminResource Apis [KYUUBI #6047] [CI] Free up disk space [KYUUBI #6036] JDBC driver conditional sets fetchSize on opening session [KYUUBI #6028] Exited spark-submit process should not block batch submit queue [KYUUBI #6018] Speed up GetTables operation for Spark session catalog ``` ``` $ head build/release/contributors-v1.8.1.txt * Shaoyun Chen -- [KYUUBI #5857][KYUUBI #5720][KYUUBI #5785][KYUUBI #5617] * Chao Chen -- [KYUUBI #5750] * Flyangz -- [KYUUBI #5832] * Pengqi Li -- [KYUUBI #5713] * Bowen Liang -- [KYUUBI #5730][KYUUBI #5802][KYUUBI #5767][KYUUBI #5831][KYUUBI #5801][KYUUBI #5754][KYUUBI #5626][KYUUBI #5811][KYUUBI #5853][KYUUBI #5765] * Paul Lin -- [KYUUBI #5799][KYUUBI #5814] * Senmiao Liu -- [KYUUBI #5969][KYUUBI #5244] * Xiao Liu -- [KYUUBI #5962] * Peiyue Liu -- [KYUUBI #5331] * Junjie Ma -- [KYUUBI #5789] ``` --- # Checklist 📝 - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) **Be nice. Be informative.** Closes #6074 from pan3793/release-script. Closes #6074 3d5ec20 [Cheng Pan] credits 1765279 [Cheng Pan] Add a script to simplify the process of creating release notes Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>

… release notes # 🔍 Description ## Issue References 🔗 Currently, we use a rather primitive way to manually write release notes from scratch, and some of the mechanical and repetitive work can be simplified by the scripts. ## Describe Your Solution 🔧 Adds a script to simplify the process of creating release notes. Note: it just simplifies some processes, the release manager still needs to tune the outputs by hand. ## Types of changes 🔖 - [ ] Bugfix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 ``` RELEASE_TAG=v1.8.1 PREVIOUS_RELEASE_TAG=v1.8.0 build/release/pre_gen_release_notes.py ``` ``` $ head build/release/commits-v1.8.1.txt [KYUUBI apache#5981] Deploy Spark Hive connector with Scala 2.13 to Maven Central [KYUUBI apache#6058] Make Jetty server stop timeout configurable [KYUUBI apache#5952][1.8] Disconnect connections without running operations after engine maxlife time graceful period [KYUUBI apache#6048] Assign serviceNode and add volatile for variables [KYUUBI apache#5991] Error on reading Atlas properties composed of multi values [KYUUBI apache#6045] [REST] Sync the AdminRestApi with the AdminResource Apis [KYUUBI apache#6047] [CI] Free up disk space [KYUUBI apache#6036] JDBC driver conditional sets fetchSize on opening session [KYUUBI apache#6028] Exited spark-submit process should not block batch submit queue [KYUUBI apache#6018] Speed up GetTables operation for Spark session catalog ``` ``` $ head build/release/contributors-v1.8.1.txt * Shaoyun Chen -- [KYUUBI apache#5857][KYUUBI apache#5720][KYUUBI apache#5785][KYUUBI apache#5617] * Chao Chen -- [KYUUBI apache#5750] * Flyangz -- [KYUUBI apache#5832] * Pengqi Li -- [KYUUBI apache#5713] * Bowen Liang -- [KYUUBI apache#5730][KYUUBI apache#5802][KYUUBI apache#5767][KYUUBI apache#5831][KYUUBI apache#5801][KYUUBI apache#5754][KYUUBI apache#5626][KYUUBI apache#5811][KYUUBI apache#5853][KYUUBI apache#5765] * Paul Lin -- [KYUUBI apache#5799][KYUUBI apache#5814] * Senmiao Liu -- [KYUUBI apache#5969][KYUUBI apache#5244] * Xiao Liu -- [KYUUBI apache#5962] * Peiyue Liu -- [KYUUBI apache#5331] * Junjie Ma -- [KYUUBI apache#5789] ``` --- # Checklist 📝 - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) **Be nice. Be informative.** Closes apache#6074 from pan3793/release-script. Closes apache#6074 3d5ec20 [Cheng Pan] credits 1765279 [Cheng Pan] Add a script to simplify the process of creating release notes Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>

github-actions bot added the module:spark label Dec 4, 2023

bowenliang123 mentioned this pull request Dec 4, 2023

[Umbrella] Improvements and evaluation for TRowSet generation of Spark Engine #5808

Open

12 tasks

wForget reviewed Dec 4, 2023

View reviewed changes

wForget mentioned this pull request Dec 4, 2023

[SPARK] Improvement for DecimalType serialization in toTColumn for column-based TRowSet generation #5810

Closed

18 tasks

bowenliang123 closed this Dec 5, 2023

bowenliang123 deleted the rowset-timeformatter branch December 6, 2023 00:35

bowenliang123 restored the rowset-timeformatter branch December 6, 2023 13:59

bowenliang123 reopened this Dec 6, 2023

bowenliang123 changed the title ~~[SPARK] Skip creating time formatter for time-unrelated data types in value serialization of RowSet generation~~ [SPARK] Skip recreating time formatters in value serialization of RowSet generation Dec 6, 2023

bowenliang123 changed the title ~~[SPARK] Skip recreating time formatters in value serialization of RowSet generation~~ [SPARK] Skip recreating time formatters in value serialization of TRowSet generation Dec 6, 2023

bowenliang123 requested review from pan3793 and wForget December 6, 2023 14:30

bowenliang123 commented Dec 6, 2023

View reviewed changes

bowenliang123 changed the title ~~[SPARK] Skip recreating time formatters in value serialization of TRowSet generation~~ [SPARK] Reusing time formatters instance in value serialization of TRowSet generation Dec 6, 2023

wForget reviewed Dec 7, 2023

View reviewed changes

pan3793 changed the title ~~[SPARK] Reusing time formatters instance in value serialization of TRowSet generation~~ [SPARK] Reuse time formatters instance in value serialization of TRowSet generation Dec 7, 2023

pan3793 reviewed Dec 7, 2023

View reviewed changes

wForget reviewed Dec 7, 2023

View reviewed changes

...uubi-spark-sql-engine/src/test/scala/org/apache/kyuubi/engine/spark/schema/RowSetSuite.scala Outdated Show resolved Hide resolved

pan3793 reviewed Dec 7, 2023

View reviewed changes

...ls/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/schema/RowSet.scala Outdated Show resolved Hide resolved

pan3793 approved these changes Dec 7, 2023

View reviewed changes

bowenliang123 requested review from cfmcgrady and wForget December 7, 2023 09:53

wForget approved these changes Dec 7, 2023

View reviewed changes

cfmcgrady approved these changes Dec 7, 2023

View reviewed changes

bowenliang123 assigned bowenliang123 and wForget Dec 7, 2023

bowenliang123 added this to the v1.8.1 milestone Dec 7, 2023

Reuse time formatters instance in value serialization of TRowSet

2270991

bowenliang123 force-pushed the rowset-timeformatter branch from a5d187a to 2270991 Compare December 7, 2023 10:29

pan3793 closed this in 4463cc8 Dec 7, 2023

bowenliang123 deleted the rowset-timeformatter branch December 7, 2023 13:06

[SPARK] Reuse time formatters instance in value serialization of TRowSet generation #5811

[SPARK] Reuse time formatters instance in value serialization of TRowSet generation #5811

Uh oh!

Conversation

bowenliang123 commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Description

Issue References 🔗

Describe Your Solution 🔧

Types of changes 🔖

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Related Unit Tests

Checklists

📝 Author Self Checklist

📝 Committer Pre-Merge Checklist

Uh oh!

codecov-commenter commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cfmcgrady Dec 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bowenliang123 commented Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pan3793 left a comment

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Dec 7, 2023

Uh oh!

wForget left a comment

Choose a reason for hiding this comment

Uh oh!

cfmcgrady left a comment

Choose a reason for hiding this comment

Uh oh!

bowenliang123 commented Dec 7, 2023

Uh oh!

bowenliang123 commented Dec 7, 2023

Uh oh!

pan3793 commented Dec 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

bowenliang123 commented Dec 4, 2023 •

edited

Loading

codecov-commenter commented Dec 4, 2023 •

edited

Loading

cfmcgrady Dec 6, 2023 •

edited

Loading

bowenliang123 commented Dec 7, 2023 •

edited

Loading