Closed
Description
Code of Conduct
- I agree to follow this project's Code of Conduct
Search before creating
- I have searched in the task list and found no similar tasks.
Mentor
- I have sufficient knowledge and experience of this task, and I volunteer to be the mentor of this task to guide contributors to complete the task.
Skill requirements
- Basic knowledge on Scala Programing Language
- Familiar with Apache Spark
Background and Goals
The client's SQL query may cause the Spark engine to fail because the query data result is too large, and the driver to fail due to OOM.
Although the number of query results can be limited by configuring kyuubi.operation.result.max.rows
, if the amount of data in one row is too large, it will still cause OOM.
If it can support writing query results to HDFS or other storage systems, when the client needs to obtain the results, the engine will obtain the results from HDFS, which can avoid the OOM problem.
Implementation steps
- Distinguish execution plans with query results
- Estimate the output size of the execution plan
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils#getSizePerRow
- Use
df.write
api to write the query results of the execution plan to HDFS - Implement an iterator to read data from HDFS and return it to the client
Additional context
Original reporter is @cxzl25