Skip to content

Iceberg add_files procedure with partition_filter scan non needed folders #7027

@sweetpythoncode

Description

@sweetpythoncode

Apache Iceberg version

1.1.0 (latest release)

Query engine

Spark

Please describe the bug 🐞

source structure example: s3://bucket/data/id=123/name=test/date=321/result.orc

CALL iceberg_catalog.system.add_files(
    table => 'test.test_name',
    source_table => '`orc`.`s3://bucket/data/`',
    partition_filter => map('id', '3')
    check_duplicate_files => false

partition_filter option does not handle the order of partition, which produces nested folders scanning until finding the first match. Should we run filter by partition in order before run nested Listing leaf files and directories?

Example of current flow:

s3://bucket/data/id=1/name=test/date=321/result.orc -> Listing leaf files and directories on each sub folder 
s3://bucket/data/id=2/name=test/date=321/result.orc -> Listing leaf files and directories on each sub folder
s3://bucket/data/id=3/name=test/date=321/result.orc -> Match needed partition_filter
s3://bucket/data/id=4/name=test/date=321/result.orc -> Listing leaf files and directories on each sub folder

Also if i have partition_by id, name, date in table and specify

CALL iceberg_catalog.system.add_files(
    table => 'test.test_name',
    source_table => '`orc`.`s3://bucket/data/id=1/name=test/`',
    check_duplicate_files => false

Iceberg will ignore these partitions and set them as null in table, instead of pulling these data from the path, in spark it's handled by basePath before reading the partitions but here is used InMemoryFileIndex without the possibility to do that?

cc @RussellSpitzer @szehon-ho

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions