Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
Run the fullstack test in a next-gen columnar environment:
cd tests/fullstack-test-next-gen
./compose.sh exec -T tiflash-cn0 bash -c \
'cd /tests && ENABLE_NEXT_GEN=true verbose=true ./run-test.sh fullstack-test/mpp/late_materialization_extra_table_id_column.test'
The test case (tests/fullstack-test/mpp/late_materialization_extra_table_id_column.test) does the following:
- Create a range-partitioned table
test.t(id, age, t time, key(id)).
- Bulk-load data, set TiFlash replica, and run
ANALYZE.
- In a single transaction:
insert into test.t values (11, 10, ...), (12, 11, ...);
select * from test.t where id > 10;
select hour(t) as hour, sum(age) from test.t where id > 10 group by hour;
Key conditions that trigger the bug:
- Partition table scan with dynamic partition pruning (
tidb_partition_prune_mode='dynamic')
- Late materialization (
select * includes virtual column _tidb_tid, column id -3)
- Uncommitted rows still in memtable (insert + select in the same transaction)
2. What did you expect to see? (Required)
Both queries in the transaction should succeed and return:
+------+------+-----------+
| id | age | t |
+------+------+-----------+
| 11 | 10 | 700:11:11 |
| 12 | 11 | 710:11:11 |
+------+------+-----------+
+------+----------+
| hour | sum(age) |
+------+----------+
| 710 | 11 |
| 700 | 10 |
+------+----------+
The virtual _tidb_tid column (EXTRA_PHYSICAL_TABLE_ID_COL_ID = -3) should be filled by TiFlash locally, not read from storage.
3. What did you see instead (Required)
The select * query fails. TiFlash logs:
read_block failed in tiflash-proxy
Read block from proxy failed
Proxy (kvengine) logs:
ffi_read_block failed: table error Schema out of date: tbl:2568 col:-3 read null for not null column
The failure happens on the partition region that contains the newly inserted rows (large memtable, e.g. region 1297 with ~2.5MB memtable). Other partition regions return 0 rows and do not hit the error. The follow-up group by query can succeed because it does not include column -3 in the scan.
4. What is your TiFlash version? (Required)
Observed on v9.0.0-beta.2.pre-170-g8ec02509ae (next-gen columnar / cloud-storage-engine path).
Root cause analysis (current understanding)
There is an inconsistency between TiFlash and kvengine on how virtual column -3 (_tidb_tid / EXTRA_PHYSICAL_TABLE_ID_COL_ID) is handled in the next-gen columnar read path.
TiFlash side (correct design):
genColumnDefinesForDisaggregatedRead() excludes extra_table_id_col_id from columns sent to the read path and only records extra_table_id_index.
- When deserializing proxy blocks, TiFlash skips column
-3 and fills it locally via action.fill(block, physical_table_id).
Bug:
RNProxyReader::createProxyReader() builds table_info from table_scan_pb columns and only skips generated columns. It still passes column -3 to fn_get_columnar_reader().
- kvengine
new_schema_from_columns() filters handle column -1 but not -3, so the columnar/row decoder tries to read a not-null column that does not exist in memtable/row data.
- Decoding fails with
read null for not null column → SchemaOutOfDate.
Suggested fix:
- TiFlash (primary): In
createProxyReader(), skip MutSup::extra_table_id_col_id when building table_info, consistent with genColumnDefinesForDisaggregatedRead().
- kvengine (defensive): In
new_schema_from_columns() / CloudColumnarReader::new, filter EXTRA_PHYSICAL_TABLE_ID_COL_ID (-3) similar to HANDLE_COL_ID (-1).
Legacy DM read path does not have this issue because virtual columns are handled entirely on the TiFlash side.
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
Run the fullstack test in a next-gen columnar environment:
The test case (
tests/fullstack-test/mpp/late_materialization_extra_table_id_column.test) does the following:test.t(id, age, t time, key(id)).ANALYZE.insert into test.t values (11, 10, ...), (12, 11, ...);select * from test.t where id > 10;select hour(t) as hour, sum(age) from test.t where id > 10 group by hour;Key conditions that trigger the bug:
tidb_partition_prune_mode='dynamic')select *includes virtual column_tidb_tid, column id-3)2. What did you expect to see? (Required)
Both queries in the transaction should succeed and return:
The virtual
_tidb_tidcolumn (EXTRA_PHYSICAL_TABLE_ID_COL_ID = -3) should be filled by TiFlash locally, not read from storage.3. What did you see instead (Required)
The
select *query fails. TiFlash logs:Proxy (kvengine) logs:
The failure happens on the partition region that contains the newly inserted rows (large memtable, e.g. region 1297 with ~2.5MB memtable). Other partition regions return 0 rows and do not hit the error. The follow-up
group byquery can succeed because it does not include column-3in the scan.4. What is your TiFlash version? (Required)
Observed on
v9.0.0-beta.2.pre-170-g8ec02509ae(next-gen columnar / cloud-storage-engine path).Root cause analysis (current understanding)
There is an inconsistency between TiFlash and kvengine on how virtual column
-3(_tidb_tid/EXTRA_PHYSICAL_TABLE_ID_COL_ID) is handled in the next-gen columnar read path.TiFlash side (correct design):
genColumnDefinesForDisaggregatedRead()excludesextra_table_id_col_idfrom columns sent to the read path and only recordsextra_table_id_index.-3and fills it locally viaaction.fill(block, physical_table_id).Bug:
RNProxyReader::createProxyReader()buildstable_infofromtable_scan_pbcolumns and only skips generated columns. It still passes column-3tofn_get_columnar_reader().new_schema_from_columns()filters handle column-1but not-3, so the columnar/row decoder tries to read a not-null column that does not exist in memtable/row data.read null for not null column→SchemaOutOfDate.Suggested fix:
createProxyReader(), skipMutSup::extra_table_id_col_idwhen buildingtable_info, consistent withgenColumnDefinesForDisaggregatedRead().new_schema_from_columns()/CloudColumnarReader::new, filterEXTRA_PHYSICAL_TABLE_ID_COL_ID(-3) similar toHANDLE_COL_ID(-1).Legacy DM read path does not have this issue because virtual columns are handled entirely on the TiFlash side.