Skip to content

The upsert mode can query the historical version of the data under certain conditions #5671

Closed as not planned
@Shane-Yu

Description

@Shane-Yu

Apache Iceberg version

0.13.2

Query engine

Hive

Please describe the bug 🐞

In Iceberg upsert mode, create v2 table like this:

create table upsert_update_time_test(
id bigint comment 'pk',
data bigint comment 'data',
update_time string comment 'update_time'
)
comment 'upsert_update_time_test'
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
TBLPROPERTIES (
'engine.hive.enabled'='true',
'write.metadata.delete-after-commit.enabled'='true',
'write.target-file-size-bytes'='134217728',
'write.metadata.previous-versions-max'='5',
'write.metadata.metrics.default'='full',
'format-version'='2'
);

Write data to iceberg with Flink like the code below:

FlinkSink.forRow(rowDataStream, tableSchema)
.tableLoader(tableLoader)
.tableSchema(tableSchema)
.upsert(true)
.writeParallelism(1)
.equalityFieldColumns(ImmutableList.of("id"))
.append();

And send data to like this:

$ nc -lk 3287
I,1,101,2022-08-26 15:44:50
U,1,103,2022-08-26 15:45:23
image

Finally, using hive and spark both got the following query results:

select * from upsert_update_time_test;
OK
upsert_update_time_test.id upsert_update_time_test.data upsert_update_time_test.update_time
1 103 2022-08-26 15:45:23
Time taken: 0.107 seconds, Fetched: 1 row(s)
hive (iceberg_yx)> select * from upsert_update_time_test where update_time <= '2022-08-26 15:45:00';
OK
upsert_update_time_test.id upsert_update_time_test.data upsert_update_time_test.update_time
1 101 2022-08-26 15:44:50
Time taken: 0.76 seconds, Fetched: 1 row(s)
hive (iceberg_yx)> select * from upsert_update_time_test where update_time <= '2022-08-26 15:46:00';
OK
upsert_update_time_test.id upsert_update_time_test.data upsert_update_time_test.update_time
1 103 2022-08-26 15:45:23
Time taken: 1.26 seconds, Fetched: 1 row(s)
hive (iceberg_yx)>
> select * from upsert_update_time_test where data <= 102;
OK
upsert_update_time_test.id upsert_update_time_test.data upsert_update_time_test.update_time
1 101 2022-08-26 15:44:50
Time taken: 0.119 seconds, Fetched: 1 row(s)
hive (iceberg_yx)>
> select * from upsert_update_time_test where data <= 103;
OK
upsert_update_time_test.id upsert_update_time_test.data upsert_update_time_test.update_time
1 103 2022-08-26 15:45:23
Time taken: 0.114 seconds, Fetched: 1 row(s)
hive (iceberg_yx)>
> select * from upsert_update_time_test where id = 1;
OK
upsert_update_time_test.id upsert_update_time_test.data upsert_update_time_test.update_time
1 103 2022-08-26 15:45:23
Time taken: 0.134 seconds, Fetched: 1 row(s)

image

image

The above query results show that the v2 table can query the historical version of the data when it meets the historical data conditions. Is this a bug or is there something wrong with my operation? Anybody else met this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions