HIVE-29473: prevent combining stats between SELECT and LV fields#6331

Draft

konstantinb wants to merge 3 commits intoapache:masterfrom

konstantinb:HIVE-29473

Contributor

konstantinb commented Feb 21, 2026

What changes were proposed in this pull request?

HIVE-29473: preventing stats override of select columns with 2+ LVs

Why are the changes needed?

TBD

Does this PR introduce any user-facing change?

No

How was this patch tested?

TBD


          HIVE-29473: preventing stats override of select columns with 2+ LVs

533bce5

asf-ci-hive added tests pending tests unstable and removed tests pending labels


          HIVE-29473: better use of existing methods/libraries, unit testing an…

9f854a3

…d corrected.out files

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels

konstantinb changed the title ~~HIVE-29473: preventing stats override of select columns with 2+ LVs~~ HIVE-29473: prevent combining stats between SELECT and LV fields

konstantinb commented

View reviewed changes

ql/src/test/results/clientpositive/llap/union26.q.out

                                           expressions: _col0 (type: string), _col1 (type: string)
                                           outputColumnNames: _col0, _col1
-                                          Statistics: Num rows: 500 Data size: 115500 Basic stats: COMPLETE Column stats: COMPLETE
+                                          Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE

Contributor Author

konstantinb Feb 23, 2026

This is a typical example of LV column stats impacting the data size estimations of SELECT columns:

` Column Naming

Context	Column Name	Represents	avgColLen
LVJ output schema	_col0	SELECT's key	2.812
LVJ output schema	_col1	SELECT's value	6.812
LVJ output schema	_col8	UDTF's exploded element	—
UDTF internal stats	_col0	array expression input	56.0

The UDTF branch's column generator restarts at 0, so its internal stats use _col0 for the array expression — colliding with SELECT's _col0.

Processing Comparison

Step	Original Code	Proposed Fix
Expression Map	Shared: {_col0, _col1, _col8}	Split: SELECT {_col0, _col1}, UDTF {_col8}
Schema	Full: [_col0, _col1, _col8]	Split by numSelColumns
UDTF lookup for _col0	Looks up _col0 in udtfStats → finds array's _col0 (56.0)	_col0 not in udtfExprMap → skipped
UDTF lookup for _col8	_col8 → Column[col], not found in udtfStats	_col8 → Column[col], not found in udtfStats
Merge _col0	MAX(2.812, 56.0) = 56.0	No collision → 2.812

Final Column Statistics

Column	Original Code	Proposed Fix
_col0 avgColLen	56.0 ✗	2.812 ✓
_col1 avgColLen	6.812	6.812
Per-row total	62.812 bytes	9.624 bytes

Data Size — LVJ Debug Output (500 rows)

	Original Code	Proposed Fix
Calculation	62.812 × 500	9.624 × 500
Total	31,406 bytes	4,812 bytes

Data Size — EXPLAIN Output (500 rows)

Column	Original Code	Proposed Fix
key avgColLen	140 ✗	87 ✓
value avgColLen	91	91
Per-row total	231 bytes	178 bytes

	Original Code	Proposed Fix
Calculation	231 × 500	178 × 500
Total	115,500 bytes	89,000 bytes


          HIVE-29473: further code optimizxations + bug-specific test file

7f48c9d

asf-ci-hive added tests pending and removed tests passed labels

konstantinb commented

View reviewed changes

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                                        minReductionHashAggr: 0.4
+                                        mode: hash
+                                        outputColumnNames: _col0, _col1, _col2
+                                        Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE

Contributor Author

konstantinb Feb 23, 2026

original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb commented

View reviewed changes

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                                          null sort order: zz
+                                          sort order: ++
+                                          Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
+                                          Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE

Contributor Author

konstantinb Feb 23, 2026

original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb commented

View reviewed changes

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                                          minReductionHashAggr: 0.4
+                                          mode: hash
+                                          outputColumnNames: _col0, _col1, _col2
+                                          Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE

Contributor Author

konstantinb Feb 23, 2026

original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb commented

View reviewed changes

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                                            null sort order: zz
+                                            sort order: ++
+                                            Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
+                                            Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE

Contributor Author

konstantinb Feb 23, 2026

original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb commented

View reviewed changes

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                              keys: KEY._col0 (type: string), KEY._col1 (type: string)
+                              mode: mergepartial
+                              outputColumnNames: _col0, _col1, _col2
+                              Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE

Contributor Author

konstantinb Feb 23, 2026

original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb commented

View reviewed changes

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                              Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE
+                              File Output Operator
+                                compressed: false
+                                Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE

Contributor Author

konstantinb Feb 23, 2026

original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb commented

View reviewed changes

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                                      Group By Operator
+                                        aggregations: count()
+                                        keys: _col0 (type: string)
+                                        minReductionHashAggr: 0.99

Contributor Author

konstantinb Feb 23, 2026

original code output:
minReductionHashAggr: 0.4

konstantinb commented

View reviewed changes

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                                        minReductionHashAggr: 0.99
+                                        mode: hash
+                                        outputColumnNames: _col0, _col1
+                                        Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE

Contributor Author

konstantinb Feb 23, 2026

original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb commented

View reviewed changes

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                                          null sort order: z
+                                          sort order: +
+                                          Map-reduce partition columns: _col0 (type: string)
+                                          Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE

Contributor Author

konstantinb Feb 23, 2026

original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb commented

View reviewed changes

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                                        Group By Operator
+                                          aggregations: count()
+                                          keys: _col0 (type: string)
+                                          minReductionHashAggr: 0.99

Contributor Author

konstantinb Feb 23, 2026

original code output:
minReductionHashAggr: 0.4

konstantinb commented

View reviewed changes

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                                          minReductionHashAggr: 0.99
+                                          mode: hash
+                                          outputColumnNames: _col0, _col1
+                                          Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE

Contributor Author

konstantinb Feb 23, 2026

original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb commented

View reviewed changes

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                                            null sort order: z
+                                            sort order: +
+                                            Map-reduce partition columns: _col0 (type: string)
+                                            Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE

Contributor Author

konstantinb Feb 23, 2026

original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb commented

View reviewed changes

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                              keys: KEY._col0 (type: string)
+                              mode: mergepartial
+                              outputColumnNames: _col0, _col1
+                              Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE

Contributor Author

konstantinb Feb 23, 2026

original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb commented

View reviewed changes

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                              Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
+                              File Output Operator
+                                compressed: false
+                                Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE

Contributor Author

konstantinb Feb 23, 2026

original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

Contributor Author

konstantinb commented Feb 23, 2026

Original output for the new .q file: lvj_stats_isolation.q.out.txt

sonarqubecloud bot commented Feb 23, 2026

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

asf-ci-hive added tests passed and removed tests pending labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels