HIVE-29473: prevent combining stats between SELECT and LV fields#6331
HIVE-29473: prevent combining stats between SELECT and LV fields#6331konstantinb wants to merge 3 commits intoapache:masterfrom
Conversation
…d corrected.out files
| expressions: _col0 (type: string), _col1 (type: string) | ||
| outputColumnNames: _col0, _col1 | ||
| Statistics: Num rows: 500 Data size: 115500 Basic stats: COMPLETE Column stats: COMPLETE | ||
| Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
This is a typical example of LV column stats impacting the data size estimations of SELECT columns:
` Column Naming
| Context | Column Name | Represents | avgColLen |
|---|---|---|---|
| LVJ output schema | _col0 | SELECT's key | 2.812 |
| LVJ output schema | _col1 | SELECT's value | 6.812 |
| LVJ output schema | _col8 | UDTF's exploded element | — |
| UDTF internal stats | _col0 | array expression input | 56.0 |
The UDTF branch's column generator restarts at 0, so its internal stats use _col0 for the array expression — colliding with SELECT's _col0.
Processing Comparison
| Step | Original Code | Proposed Fix |
|---|---|---|
| Expression Map | Shared: {_col0, _col1, _col8} | Split: SELECT {_col0, _col1}, UDTF {_col8} |
| Schema | Full: [_col0, _col1, _col8] | Split by numSelColumns |
| UDTF lookup for _col0 | Looks up _col0 in udtfStats → finds array's _col0 (56.0) | _col0 not in udtfExprMap → skipped |
| UDTF lookup for _col8 | _col8 → Column[col], not found in udtfStats | _col8 → Column[col], not found in udtfStats |
| Merge _col0 | MAX(2.812, 56.0) = 56.0 | No collision → 2.812 |
Final Column Statistics
| Column | Original Code | Proposed Fix |
|---|---|---|
| _col0 avgColLen | 56.0 ✗ | 2.812 ✓ |
| _col1 avgColLen | 6.812 | 6.812 |
| Per-row total | 62.812 bytes | 9.624 bytes |
Data Size — LVJ Debug Output (500 rows)
| Original Code | Proposed Fix | |
|---|---|---|
| Calculation | 62.812 × 500 | 9.624 × 500 |
| Total | 31,406 bytes | 4,812 bytes |
Data Size — EXPLAIN Output (500 rows)
| Column | Original Code | Proposed Fix |
|---|---|---|
| key avgColLen | 140 ✗ | 87 ✓ |
| value avgColLen | 91 | 91 |
| Per-row total | 231 bytes | 178 bytes |
| Original Code | Proposed Fix | |
|---|---|---|
| Calculation | 231 × 500 | 178 × 500 |
| Total | 115,500 bytes | 89,000 bytes |
| minReductionHashAggr: 0.4 | ||
| mode: hash | ||
| outputColumnNames: _col0, _col1, _col2 | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| null sort order: zz | ||
| sort order: ++ | ||
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string) | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| minReductionHashAggr: 0.4 | ||
| mode: hash | ||
| outputColumnNames: _col0, _col1, _col2 | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| null sort order: zz | ||
| sort order: ++ | ||
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string) | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| keys: KEY._col0 (type: string), KEY._col1 (type: string) | ||
| mode: mergepartial | ||
| outputColumnNames: _col0, _col1, _col2 | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE | ||
| File Output Operator | ||
| compressed: false | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| Group By Operator | ||
| aggregations: count() | ||
| keys: _col0 (type: string) | ||
| minReductionHashAggr: 0.99 |
There was a problem hiding this comment.
original code output:
minReductionHashAggr: 0.4
| minReductionHashAggr: 0.99 | ||
| mode: hash | ||
| outputColumnNames: _col0, _col1 | ||
| Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
| null sort order: z | ||
| sort order: + | ||
| Map-reduce partition columns: _col0 (type: string) | ||
| Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
| Group By Operator | ||
| aggregations: count() | ||
| keys: _col0 (type: string) | ||
| minReductionHashAggr: 0.99 |
There was a problem hiding this comment.
original code output:
minReductionHashAggr: 0.4
| minReductionHashAggr: 0.99 | ||
| mode: hash | ||
| outputColumnNames: _col0, _col1 | ||
| Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
| null sort order: z | ||
| sort order: + | ||
| Map-reduce partition columns: _col0 (type: string) | ||
| Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
| keys: KEY._col0 (type: string) | ||
| mode: mergepartial | ||
| outputColumnNames: _col0, _col1 | ||
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE | ||
| File Output Operator | ||
| compressed: false | ||
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
|
Original output for the new .q file: lvj_stats_isolation.q.out.txt |
|



What changes were proposed in this pull request?
HIVE-29473: preventing stats override of select columns with 2+ LVs
Why are the changes needed?
TBD
Does this PR introduce any user-facing change?
No
How was this patch tested?
TBD