Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KYUUBI #6437] Fix Spark engine query result save to HDFS #6444

Closed
wants to merge 1 commit into from

Conversation

camper42
Copy link
Contributor

@camper42 camper42 commented Jun 3, 2024

🔍 Description

Issue References 🔗

This pull request fixes #6437

Describe Your Solution 🔧

Use org.apache.hadoop.fs.Path instead of java.nio.file.Paths to avoid OPERATION_RESULT_SAVE_TO_FILE_DIR scheme unexpected change.

Types of changes 🔖

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Spark Job failed to start with error: java.io.IOException: JuiceFS initialized failed for jfs:/// with conf kyuubi.operation.result.saveToFile.dir=jfs://datalake/tmp.

hdfs://xxx:port/tmp may encounter similar errors

Behavior With This Pull Request 🎉

User Can use hdfs dir as kyuubi.operation.result.saveToFile.dir without error.

Related Unit Tests

Seems no test suites added in #5591 and #5986, I'll try to build a dist and test with our internal cluster.


Checklist 📝

Be nice. Be informative.

@camper42
Copy link
Contributor Author

camper42 commented Jun 3, 2024

@wForget some issues with local testing, and since the changes were minimal, I just submitted the PR. Please review.

I'm build a dist to test in our internal cluster, will send result later.

@pan3793 pan3793 changed the title [Kyuubi #6437] Fix Spark engine query result save to HDFS [KYUUBI #6437] Fix Spark engine query result save to HDFS Jun 3, 2024
@pan3793 pan3793 requested a review from cxzl25 June 3, 2024 08:09
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 20.00000% with 16 lines in your changes are missing coverage. Please review.

Project coverage is 58.30%. Comparing base (4d58b96) to head (990f0a7).

Files Patch % Lines
...uubi/engine/spark/operation/ExecuteStatement.scala 25.00% 6 Missing ⚠️
...rg/apache/kyuubi/engine/spark/SparkSQLEngine.scala 16.66% 4 Missing and 1 partial ⚠️
.../engine/spark/session/SparkSQLSessionManager.scala 16.66% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #6444      +/-   ##
============================================
+ Coverage     58.26%   58.30%   +0.03%     
  Complexity       24       24              
============================================
  Files           656      656              
  Lines         40268    40264       -4     
  Branches       5498     5498              
============================================
+ Hits          23463    23476      +13     
+ Misses        14287    14283       -4     
+ Partials       2518     2505      -13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@cxzl25 cxzl25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Although this function is currently available, it lacks some testing, so some previous changes may cause regressions.

@cxzl25 cxzl25 added this to the v1.9.2 milestone Jun 3, 2024
@camper42
Copy link
Contributor Author

camper42 commented Jun 4, 2024

tested with 1.9.1 + this PR, problem solved. ready to merge

@camper42
Copy link
Contributor Author

camper42 commented Jun 4, 2024

better performance and memory usage than kyuubi.operation.incremental.collect=true

conf driver memory usage time
kyuubi.operation.incremental.collect=true 3.74 GiB 14min
kyuubi.operation.result.saveToFile.enabled 2.34 GiB 10min

@pan3793 pan3793 closed this in 71649da Jun 4, 2024
pan3793 pushed a commit that referenced this pull request Jun 4, 2024
This pull request fixes #6437

Use `org.apache.hadoop.fs.Path` instead of `java.nio.file.Paths` to avoid `OPERATION_RESULT_SAVE_TO_FILE_DIR` scheme unexpected change.

- [x] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

Spark Job failed to start with error: `java.io.IOException: JuiceFS initialized failed for jfs:///` with conf `kyuubi.operation.result.saveToFile.dir=jfs://datalake/tmp`.

`hdfs://xxx:port/tmp` may encounter similar errors

User Can use hdfs dir as `kyuubi.operation.result.saveToFile.dir` without error.

Seems no test suites added in #5591 and #5986, I'll try to build a dist and test with our internal cluster.

---

- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes #6444 from camper42/save-to-hdfs.

Closes #6437

990f0a7 [camper42] [Kyuubi #6437] Fix Spark engine query result save to HDFS

Authored-by: camper42 <camper.xlii@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
(cherry picked from commit 71649da)
Signed-off-by: Cheng Pan <chengpan@apache.org>
@pan3793
Copy link
Member

pan3793 commented Jun 4, 2024

Thanks, merged to master/1.9

@camper42 camper42 deleted the save-to-hdfs branch June 4, 2024 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Can't use a HDFS dir for kyuubi.operation.result.saveToFile.dir
4 participants