Skip to content

HIVE-29473: prevent combining stats between SELECT and LV fields#6331

Draft
konstantinb wants to merge 3 commits intoapache:masterfrom
konstantinb:HIVE-29473
Draft

HIVE-29473: prevent combining stats between SELECT and LV fields#6331
konstantinb wants to merge 3 commits intoapache:masterfrom
konstantinb:HIVE-29473

Conversation

@konstantinb
Copy link
Contributor

What changes were proposed in this pull request?

HIVE-29473: preventing stats override of select columns with 2+ LVs

Why are the changes needed?

TBD

Does this PR introduce any user-facing change?

No

How was this patch tested?

TBD

@konstantinb konstantinb changed the title HIVE-29473: preventing stats override of select columns with 2+ LVs HIVE-29473: prevent combining stats between SELECT and LV fields Feb 23, 2026
expressions: _col0 (type: string), _col1 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 500 Data size: 115500 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a typical example of LV column stats impacting the data size estimations of SELECT columns:

` Column Naming

Context Column Name Represents avgColLen
LVJ output schema _col0 SELECT's key 2.812
LVJ output schema _col1 SELECT's value 6.812
LVJ output schema _col8 UDTF's exploded element
UDTF internal stats _col0 array expression input 56.0

The UDTF branch's column generator restarts at 0, so its internal stats use _col0 for the array expression — colliding with SELECT's _col0.


Processing Comparison

Step Original Code Proposed Fix
Expression Map Shared: {_col0, _col1, _col8} Split: SELECT {_col0, _col1}, UDTF {_col8}
Schema Full: [_col0, _col1, _col8] Split by numSelColumns
UDTF lookup for _col0 Looks up _col0 in udtfStats → finds array's _col0 (56.0) _col0 not in udtfExprMap → skipped
UDTF lookup for _col8 _col8 → Column[col], not found in udtfStats _col8 → Column[col], not found in udtfStats
Merge _col0 MAX(2.812, 56.0) = 56.0 No collision → 2.812

Final Column Statistics

Column Original Code Proposed Fix
_col0 avgColLen 56.0 ✗ 2.812 ✓
_col1 avgColLen 6.812 6.812
Per-row total 62.812 bytes 9.624 bytes

Data Size — LVJ Debug Output (500 rows)

Original Code Proposed Fix
Calculation 62.812 × 500 9.624 × 500
Total 31,406 bytes 4,812 bytes

Data Size — EXPLAIN Output (500 rows)

Column Original Code Proposed Fix
key avgColLen 140 ✗ 87 ✓
value avgColLen 91 91
Per-row total 231 bytes 178 bytes
Original Code Proposed Fix
Calculation 231 × 500 178 × 500
Total 115,500 bytes 89,000 bytes

minReductionHashAggr: 0.4
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

null sort order: zz
sort order: ++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

minReductionHashAggr: 0.4
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

null sort order: zz
sort order: ++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

keys: KEY._col0 (type: string), KEY._col1 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

Group By Operator
aggregations: count()
keys: _col0 (type: string)
minReductionHashAggr: 0.99
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original code output:
minReductionHashAggr: 0.4

minReductionHashAggr: 0.99
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

null sort order: z
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

Group By Operator
aggregations: count()
keys: _col0 (type: string)
minReductionHashAggr: 0.99
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original code output:
minReductionHashAggr: 0.4

minReductionHashAggr: 0.99
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

null sort order: z
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

@konstantinb
Copy link
Contributor Author

Original output for the new .q file: lvj_stats_isolation.q.out.txt

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants