Code of Conduct
Search before asking
What would you like to be improved?
https://issues.apache.org/jira/browse/SPARK-47085 reported an issue, scala Seq.apply has O(n) complexity when accessing to a non-IndexedSeq with val row = rows(idx)
, in this PR 6077, the row-based TRowSet has been fixed, but TColumnGenerator.getColumnToList was missed.
In my localhost test, it will cost 150s to iterate a Hive JDBC statement resultSet (100000 rows, 20+ columns) with statement.setFetchSize(10000)
, but only took 3s with statement.setFetchSize(100)
, this is a serious performance issue.
How should we improve?
In trait TColumnGenerator.getColumnToList, convert the while loop
while (idx < rowSize) {
val row = rows(idx)
...
}
to a foreach like this
rows.foreach { row =>
....
}
will resolve this
Are you willing to submit PR?