You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched in the issues and found no similar issues.
What would you like to be improved?
https://issues.apache.org/jira/browse/SPARK-47085 reported an issue, scala Seq.apply has O(n) complexity when accessing to a non-IndexedSeq with val row = rows(idx), in this PR 6077, the row-based TRowSet has been fixed, but TColumnGenerator.getColumnToList was missed.
In my localhost test, it will cost 150s to iterate a Hive JDBC statement resultSet (100000 rows, 20+ columns) with statement.setFetchSize(10000), but only took 3s with statement.setFetchSize(100), this is a serious performance issue.
How should we improve?
In trait TColumnGenerator.getColumnToList, convert the while loop
while (idx < rowSize) {
val row = rows(idx)
...
}
to a foreach like this
rows.foreach { row =>
....
}
will resolve this
Are you willing to submit PR?
Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
No. I cannot submit a PR at this time.
The text was updated successfully, but these errors were encountered:
Hello @hh-cn,
Thanks for finding the time to report the issue!
We really appreciate the community's efforts to improve Apache Kyuubi.
hh-cn
changed the title
[Improvement] Impreve performance on converting spark rows to column-based thrift row set
[Improvement] Improve performance on converting spark rows to column-based thrift row set
Sep 3, 2024
# 🔍 Description
## Issue References 🔗
This pull request fixes#6661
## Describe Your Solution 🔧
TColumnGenerator.getColumnToList should not access to non-IndexedSeq with Seq.apply(i), which will cause performance reduce, convert it to foreach loop will be good. see https://issues.apache.org/jira/browse/SPARK-47085 for more details.
## Types of changes 🔖
- [x] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
## Test Plan 🧪
#### Behavior Without This Pull Request ⚰️
#### Behavior With This Pull Request 🎉
#### Related Unit Tests
---
# Checklist 📝
- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)
**Be nice. Be informative.**
Closes#6662 from hh-cn/KYUUBI-6661.
Closes#66614597e88 [hang.huang] improve column-based TRowSet generation
Authored-by: hang.huang <hang.huang@advancegroup.com>
Signed-off-by: Bowen Liang <liangbowen@gf.com.cn>
(cherry picked from commit 14e07ea)
Signed-off-by: Bowen Liang <liangbowen@gf.com.cn>
Code of Conduct
Search before asking
What would you like to be improved?
https://issues.apache.org/jira/browse/SPARK-47085 reported an issue, scala Seq.apply has O(n) complexity when accessing to a non-IndexedSeq with
val row = rows(idx)
, in this PR 6077, the row-based TRowSet has been fixed, but TColumnGenerator.getColumnToList was missed.In my localhost test, it will cost 150s to iterate a Hive JDBC statement resultSet (100000 rows, 20+ columns) with
statement.setFetchSize(10000)
, but only took 3s withstatement.setFetchSize(100)
, this is a serious performance issue.How should we improve?
In trait TColumnGenerator.getColumnToList, convert the while loop
to a foreach like this
will resolve this
Are you willing to submit PR?
The text was updated successfully, but these errors were encountered: