Skip to content

[TASK][MEDIUM] Spark engine query results support reading from HDFS #5377

Closed
@pan3793

Description

@pan3793

Code of Conduct

Search before creating

  • I have searched in the task list and found no similar tasks.

Mentor

  • I have sufficient knowledge and experience of this task, and I volunteer to be the mentor of this task to guide contributors to complete the task.

Skill requirements

  • Basic knowledge on Scala Programing Language
  • Familiar with Apache Spark

Background and Goals

The client's SQL query may cause the Spark engine to fail because the query data result is too large, and the driver to fail due to OOM.
Although the number of query results can be limited by configuring kyuubi.operation.result.max.rows, if the amount of data in one row is too large, it will still cause OOM.
If it can support writing query results to HDFS or other storage systems, when the client needs to obtain the results, the engine will obtain the results from HDFS, which can avoid the OOM problem.

Implementation steps

  1. Distinguish execution plans with query results
  2. Estimate the output size of the execution plan
    org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils#getSizePerRow
  3. Use df.write api to write the query results of the execution plan to HDFS
  4. Implement an iterator to read data from HDFS and return it to the client

Additional context

Original reporter is @cxzl25

Metadata

Metadata

Assignees

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions