Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TASK][MEDIUM] Spark engine query results support reading from HDFS #5377

Closed
2 of 3 tasks
pan3793 opened this issue Oct 8, 2023 · 3 comments
Closed
2 of 3 tasks

[TASK][MEDIUM] Spark engine query results support reading from HDFS #5377

pan3793 opened this issue Oct 8, 2023 · 3 comments
Assignees

Comments

@pan3793
Copy link
Member

pan3793 commented Oct 8, 2023

Code of Conduct

Search before creating

  • I have searched in the task list and found no similar tasks.

Mentor

  • I have sufficient knowledge and experience of this task, and I volunteer to be the mentor of this task to guide contributors to complete the task.

Skill requirements

  • Basic knowledge on Scala Programing Language
  • Familiar with Apache Spark

Background and Goals

The client's SQL query may cause the Spark engine to fail because the query data result is too large, and the driver to fail due to OOM.
Although the number of query results can be limited by configuring kyuubi.operation.result.max.rows, if the amount of data in one row is too large, it will still cause OOM.
If it can support writing query results to HDFS or other storage systems, when the client needs to obtain the results, the engine will obtain the results from HDFS, which can avoid the OOM problem.

Implementation steps

  1. Distinguish execution plans with query results
  2. Estimate the output size of the execution plan
    org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils#getSizePerRow
  3. Use df.write api to write the query results of the execution plan to HDFS
  4. Implement an iterator to read data from HDFS and return it to the client

Additional context

Original reporter is @cxzl25

@pan3793
Copy link
Member Author

pan3793 commented Oct 8, 2023

@cxzl25 this task is missing "implementation steps" before adding into https://github.com/orgs/apache/projects/296

@lsm1
Copy link
Contributor

lsm1 commented Oct 10, 2023

Hi , I am interested in this task, can you assign it to me ? I expect to achieve it in a month Thanks.

@cfmcgrady
Copy link
Contributor

  1. Use df.write api to write the query results of the execution plan to HDFS

directly call df.write will introduce an extra shuffle for the outermost limit, and hurt performance

turboFei added a commit that referenced this issue Dec 21, 2023
# 🔍 Description
## Issue References 🔗

#5591 (comment)

## Describe Your Solution 🔧

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️

#### Behavior With This Pull Request 🎉

#### Related Unit Tests

---

# Checklists
## 📝 Author Self Checklist

- [x] My code follows the [style guidelines](https://kyuubi.readthedocs.io/en/master/contributing/code/style.html) of this project
- [x] I have performed a self-review
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

## 📝 Committer Pre-Merge Checklist

- [ ] Pull request title is okay.
- [ ] No license issues.
- [ ] Milestone correctly set?
- [ ] Test coverage is ok
- [ ] Assignees are selected.
- [ ] Minimum number of approvals
- [ ] No changes are requested

**Be nice. Be informative.**

Closes #5895 from lsm1/branch-kyuubi-5377-followup.

Closes #5377

4219d28 [Fei Wang] nit
31d4fc1 [senmiaoliu] use zlib when SPARK version less than 3.2

Lead-authored-by: senmiaoliu <senmiaoliu@trip.com>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Fei Wang <fwang12@ebay.com>
turboFei added a commit that referenced this issue Feb 4, 2024
# 🔍 Description
## Issue References 🔗

We shall always try to clean session result path as `kyuubi.operation.result.saveToFile.enabled` can be enabled during runtime.
This pull request fixes #5377

## Describe Your Solution 🔧

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️

#### Behavior With This Pull Request 🎉

#### Related Unit Tests

---

# Checklist 📝

- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes #5986 from turboFei/OPERATION_RESULT_SAVE_TO_FILE_runtime.

Closes #5377

5dfdf96 [Fei Wang] comments
895caec [Fei Wang] [KYUUBI #5377][FOLLOWUP] Always try to cleanup session result path

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Fei Wang <fwang12@ebay.com>
turboFei added a commit that referenced this issue Feb 4, 2024
…sult max rows

# 🔍 Description
Followup #5591
Support to get existing limit from more plan and regard the result max rows.
## Issue References 🔗

This pull request fixes #

## Describe Your Solution 🔧

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️

#### Behavior With This Pull Request 🎉

#### Related Unit Tests

---

# Checklist 📝

- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes #5963 from turboFei/incremental_save.

Closes #5377

223d510 [Fei Wang] use optimized plan
ecefc2a [Fei Wang] use spark plan
57091e5 [Fei Wang] minor
2096144 [Fei Wang] for logical plan
0f734ee [Fei Wang] ut
fdc1155 [Fei Wang] save
f8e405a [Fei Wang] math.min

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Fei Wang <fwang12@ebay.com>
zhaohehuhu pushed a commit to zhaohehuhu/incubator-kyuubi that referenced this issue Feb 5, 2024
# 🔍 Description
## Issue References 🔗

We shall always try to clean session result path as `kyuubi.operation.result.saveToFile.enabled` can be enabled during runtime.
This pull request fixes apache#5377

## Describe Your Solution 🔧

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️

#### Behavior With This Pull Request 🎉

#### Related Unit Tests

---

# Checklist 📝

- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes apache#5986 from turboFei/OPERATION_RESULT_SAVE_TO_FILE_runtime.

Closes apache#5377

5dfdf96 [Fei Wang] comments
895caec [Fei Wang] [KYUUBI apache#5377][FOLLOWUP] Always try to cleanup session result path

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Fei Wang <fwang12@ebay.com>
zhaohehuhu pushed a commit to zhaohehuhu/incubator-kyuubi that referenced this issue Feb 5, 2024
…ard result max rows

# 🔍 Description
Followup apache#5591
Support to get existing limit from more plan and regard the result max rows.
## Issue References 🔗

This pull request fixes #

## Describe Your Solution 🔧

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️

#### Behavior With This Pull Request 🎉

#### Related Unit Tests

---

# Checklist 📝

- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes apache#5963 from turboFei/incremental_save.

Closes apache#5377

223d510 [Fei Wang] use optimized plan
ecefc2a [Fei Wang] use spark plan
57091e5 [Fei Wang] minor
2096144 [Fei Wang] for logical plan
0f734ee [Fei Wang] ut
fdc1155 [Fei Wang] save
f8e405a [Fei Wang] math.min

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Fei Wang <fwang12@ebay.com>
zhaohehuhu pushed a commit to zhaohehuhu/incubator-kyuubi that referenced this issue Mar 21, 2024
### _Why are the changes needed?_

close apache#5377

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [ ] [Run test](https://kyuubi.readthedocs.io/en/master/contributing/code/testing.html#running-tests) locally before make a pull request

### _Was this patch authored or co-authored using generative AI tooling?_

NO

Closes apache#5591 from lsm1/branch-kyuubi-5377.

Closes apache#5377

9d1a18c [senmiaoliu] ignore empty file
3c70a1e [LSM] fix doc
73d3c3a [senmiaoliu] fix style and add some comment
80e1f0d [senmiaoliu] Close orc fetchOrcStatement and remove result save file when ExecuteStatement close
42634a1 [senmiaoliu] fix style
979125d [senmiaoliu] fix style
1dc07a5 [senmiaoliu] spark engine save into hdfs file

Lead-authored-by: senmiaoliu <senmiaoliu@trip.com>
Co-authored-by: LSM <senmiaoliu@trip.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
zhaohehuhu pushed a commit to zhaohehuhu/incubator-kyuubi that referenced this issue Mar 21, 2024
# 🔍 Description
## Issue References 🔗

apache#5591 (comment)

## Describe Your Solution 🔧

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️

#### Behavior With This Pull Request 🎉

#### Related Unit Tests

---

# Checklists
## 📝 Author Self Checklist

- [x] My code follows the [style guidelines](https://kyuubi.readthedocs.io/en/master/contributing/code/style.html) of this project
- [x] I have performed a self-review
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

## 📝 Committer Pre-Merge Checklist

- [ ] Pull request title is okay.
- [ ] No license issues.
- [ ] Milestone correctly set?
- [ ] Test coverage is ok
- [ ] Assignees are selected.
- [ ] Minimum number of approvals
- [ ] No changes are requested

**Be nice. Be informative.**

Closes apache#5895 from lsm1/branch-kyuubi-5377-followup.

Closes apache#5377

4219d28 [Fei Wang] nit
31d4fc1 [senmiaoliu] use zlib when SPARK version less than 3.2

Lead-authored-by: senmiaoliu <senmiaoliu@trip.com>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Fei Wang <fwang12@ebay.com>
zhaohehuhu pushed a commit to zhaohehuhu/incubator-kyuubi that referenced this issue Mar 21, 2024
# 🔍 Description
## Issue References 🔗

We shall always try to clean session result path as `kyuubi.operation.result.saveToFile.enabled` can be enabled during runtime.
This pull request fixes apache#5377

## Describe Your Solution 🔧

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️

#### Behavior With This Pull Request 🎉

#### Related Unit Tests

---

# Checklist 📝

- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes apache#5986 from turboFei/OPERATION_RESULT_SAVE_TO_FILE_runtime.

Closes apache#5377

5dfdf96 [Fei Wang] comments
895caec [Fei Wang] [KYUUBI apache#5377][FOLLOWUP] Always try to cleanup session result path

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Fei Wang <fwang12@ebay.com>
zhaohehuhu pushed a commit to zhaohehuhu/incubator-kyuubi that referenced this issue Mar 21, 2024
…ard result max rows

# 🔍 Description
Followup apache#5591
Support to get existing limit from more plan and regard the result max rows.
## Issue References 🔗

This pull request fixes #

## Describe Your Solution 🔧

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️

#### Behavior With This Pull Request 🎉

#### Related Unit Tests

---

# Checklist 📝

- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes apache#5963 from turboFei/incremental_save.

Closes apache#5377

223d510 [Fei Wang] use optimized plan
ecefc2a [Fei Wang] use spark plan
57091e5 [Fei Wang] minor
2096144 [Fei Wang] for logical plan
0f734ee [Fei Wang] ut
fdc1155 [Fei Wang] save
f8e405a [Fei Wang] math.min

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Fei Wang <fwang12@ebay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
3 participants