Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Improve performance on converting spark rows to column-based thrift row set #6661

Closed
3 of 4 tasks
hh-cn opened this issue Sep 3, 2024 · 1 comment
Closed
3 of 4 tasks

Comments

@hh-cn
Copy link

hh-cn commented Sep 3, 2024

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

What would you like to be improved?

https://issues.apache.org/jira/browse/SPARK-47085 reported an issue, scala Seq.apply has O(n) complexity when accessing to a non-IndexedSeq with val row = rows(idx), in this PR 6077, the row-based TRowSet has been fixed, but TColumnGenerator.getColumnToList was missed.

In my localhost test, it will cost 150s to iterate a Hive JDBC statement resultSet (100000 rows, 20+ columns) with statement.setFetchSize(10000), but only took 3s with statement.setFetchSize(100), this is a serious performance issue.

How should we improve?

In trait TColumnGenerator.getColumnToList, convert the while loop

    while (idx < rowSize) {
      val row = rows(idx)
      ...
    }

to a foreach like this

    rows.foreach { row =>
        ....
    }

will resolve this

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
  • No. I cannot submit a PR at this time.
Copy link

github-actions bot commented Sep 3, 2024

Hello @hh-cn,
Thanks for finding the time to report the issue!
We really appreciate the community's efforts to improve Apache Kyuubi.

@hh-cn hh-cn changed the title [Improvement] Impreve performance on converting spark rows to column-based thrift row set [Improvement] Improve performance on converting spark rows to column-based thrift row set Sep 3, 2024
bowenliang123 pushed a commit that referenced this issue Sep 4, 2024
# 🔍 Description
## Issue References 🔗

This pull request fixes #6661

## Describe Your Solution 🔧

TColumnGenerator.getColumnToList should not access to non-IndexedSeq with Seq.apply(i), which will cause performance reduce, convert it to foreach loop will be good. see https://issues.apache.org/jira/browse/SPARK-47085 for more details.

## Types of changes 🔖

- [x] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️

#### Behavior With This Pull Request 🎉

#### Related Unit Tests

---

# Checklist 📝

- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes #6662 from hh-cn/KYUUBI-6661.

Closes #6661

4597e88 [hang.huang] improve column-based TRowSet generation

Authored-by: hang.huang <hang.huang@advancegroup.com>
Signed-off-by: Bowen Liang <liangbowen@gf.com.cn>
(cherry picked from commit 14e07ea)
Signed-off-by: Bowen Liang <liangbowen@gf.com.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant