-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed as not planned
Closed as not planned
Copy link
Labels
Description
Apache Iceberg version
1.1.0 (latest release)
Query engine
Spark
Please describe the bug 🐞
source structure example: s3://bucket/data/id=123/name=test/date=321/result.orc
CALL iceberg_catalog.system.add_files(
table => 'test.test_name',
source_table => '`orc`.`s3://bucket/data/`',
partition_filter => map('id', '3')
check_duplicate_files => false
partition_filter option does not handle the order of partition, which produces nested folders scanning until finding the first match. Should we run filter by partition in order before run nested Listing leaf files and directories?
Example of current flow:
s3://bucket/data/id=1/name=test/date=321/result.orc -> Listing leaf files and directories on each sub folder
s3://bucket/data/id=2/name=test/date=321/result.orc -> Listing leaf files and directories on each sub folder
s3://bucket/data/id=3/name=test/date=321/result.orc -> Match needed partition_filter
s3://bucket/data/id=4/name=test/date=321/result.orc -> Listing leaf files and directories on each sub folder
Also if i have partition_by id, name, date in table and specify
CALL iceberg_catalog.system.add_files(
table => 'test.test_name',
source_table => '`orc`.`s3://bucket/data/id=1/name=test/`',
check_duplicate_files => false
Iceberg will ignore these partitions and set them as null in table, instead of pulling these data from the path, in spark it's handled by basePath before reading the partitions but here is used InMemoryFileIndex without the possibility to do that?