Skip to content

[Improvement] Improve performance on converting spark rows to column-based thrift row set #6661

Closed
@hh-cn

Description

@hh-cn

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

What would you like to be improved?

https://issues.apache.org/jira/browse/SPARK-47085 reported an issue, scala Seq.apply has O(n) complexity when accessing to a non-IndexedSeq with val row = rows(idx), in this PR 6077, the row-based TRowSet has been fixed, but TColumnGenerator.getColumnToList was missed.

In my localhost test, it will cost 150s to iterate a Hive JDBC statement resultSet (100000 rows, 20+ columns) with statement.setFetchSize(10000), but only took 3s with statement.setFetchSize(100), this is a serious performance issue.

How should we improve?

In trait TColumnGenerator.getColumnToList, convert the while loop

    while (idx < rowSize) {
      val row = rows(idx)
      ...
    }

to a foreach like this

    rows.foreach { row =>
        ....
    }

will resolve this

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
  • No. I cannot submit a PR at this time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions