Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json/parquet/orc format #7268

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

KevinyhZou
Copy link
Contributor

@KevinyhZou KevinyhZou commented Sep 18, 2024

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #7267)

How was this patch tested?

BY UT

@KevinyhZou KevinyhZou marked this pull request as draft September 18, 2024 12:35
@github-actions github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Sep 18, 2024
Copy link

#7267

Copy link

Run Gluten Clickhouse CI

@KevinyhZou KevinyhZou changed the title [GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json format [GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json format Sep 18, 2024
@KevinyhZou KevinyhZou force-pushed the support_nested_project_push_down_json branch from 5ba3026 to 4c202a6 Compare September 19, 2024 07:08
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@KevinyhZou KevinyhZou changed the title [GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json format [GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json/parquet/orc format Sep 26, 2024
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@KevinyhZou KevinyhZou marked this pull request as ready for review September 26, 2024 06:33
Copy link

Run Gluten Clickhouse CI

"select id, d1.c, d1.d[0].x, d2.d['m124'].y from %s where day = '2024-09-26' and hour = '12'"
.format(pq_table_name)
withSQLConf(
("spark.sql.hive.convertMetastoreParquet" -> "false"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这俩orc和parquet的开关在什么使用场景下是false呢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当需要使用hive parquet/orc serde 读取 table 时,而不是使用spark内置的parquet/orc reader读取时,这两个配置就需要被设置为false @taiyang-li

@KevinyhZou
Copy link
Contributor Author

性能测试

表schema:test_tbl (a STRING, b STRUCT<x1: STRING, x2: STRING, x3: STRING, x4: STRING, x5: STRING>)
测试sql: select count(b.x1) from test_tbl
数据量:1200W行
分别使用json/parquet/orc 三种测试存放数据,测试 该SQL查询的端到端耗时情况

优化前 平均耗时:
json格式: 16.52s
parquet耗时:2.02s
orc耗时:1.25s

优化后 平均耗时:
json格式:12.71s
parquet耗时: 0.63s
orc耗时:0.36s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLICKHOUSE CORE works for Gluten Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] Support nested colum pruning for HiveTableScanExec
2 participants