Description
Apache Iceberg version
0.13.2
Query engine
Hive
Please describe the bug 🐞
In Iceberg upsert mode, create v2 table like this:
create table upsert_update_time_test(
id bigint comment 'pk',
data bigint comment 'data',
update_time string comment 'update_time'
)
comment 'upsert_update_time_test'
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
TBLPROPERTIES (
'engine.hive.enabled'='true',
'write.metadata.delete-after-commit.enabled'='true',
'write.target-file-size-bytes'='134217728',
'write.metadata.previous-versions-max'='5',
'write.metadata.metrics.default'='full',
'format-version'='2'
);
Write data to iceberg with Flink like the code below:
FlinkSink.forRow(rowDataStream, tableSchema)
.tableLoader(tableLoader)
.tableSchema(tableSchema)
.upsert(true)
.writeParallelism(1)
.equalityFieldColumns(ImmutableList.of("id"))
.append();
And send data to like this:
$ nc -lk 3287
I,1,101,2022-08-26 15:44:50
U,1,103,2022-08-26 15:45:23
Finally, using hive and spark both got the following query results:
select * from upsert_update_time_test;
OK
upsert_update_time_test.id upsert_update_time_test.data upsert_update_time_test.update_time
1 103 2022-08-26 15:45:23
Time taken: 0.107 seconds, Fetched: 1 row(s)
hive (iceberg_yx)> select * from upsert_update_time_test where update_time <= '2022-08-26 15:45:00';
OK
upsert_update_time_test.id upsert_update_time_test.data upsert_update_time_test.update_time
1 101 2022-08-26 15:44:50
Time taken: 0.76 seconds, Fetched: 1 row(s)
hive (iceberg_yx)> select * from upsert_update_time_test where update_time <= '2022-08-26 15:46:00';
OK
upsert_update_time_test.id upsert_update_time_test.data upsert_update_time_test.update_time
1 103 2022-08-26 15:45:23
Time taken: 1.26 seconds, Fetched: 1 row(s)
hive (iceberg_yx)>
> select * from upsert_update_time_test where data <= 102;
OK
upsert_update_time_test.id upsert_update_time_test.data upsert_update_time_test.update_time
1 101 2022-08-26 15:44:50
Time taken: 0.119 seconds, Fetched: 1 row(s)
hive (iceberg_yx)>
> select * from upsert_update_time_test where data <= 103;
OK
upsert_update_time_test.id upsert_update_time_test.data upsert_update_time_test.update_time
1 103 2022-08-26 15:45:23
Time taken: 0.114 seconds, Fetched: 1 row(s)
hive (iceberg_yx)>
> select * from upsert_update_time_test where id = 1;
OK
upsert_update_time_test.id upsert_update_time_test.data upsert_update_time_test.update_time
1 103 2022-08-26 15:45:23
Time taken: 0.134 seconds, Fetched: 1 row(s)
The above query results show that the v2 table can query the historical version of the data when it meets the historical data conditions. Is this a bug or is there something wrong with my operation? Anybody else met this?