Skip to content

Cannot read partitions with special characters (including space) with pyarrow >= 11 #1393

Closed
@emanueledomingo

Description

@emanueledomingo

Environment

Delta-rs version:
Binding: Python
Environment: Python==3.10, deltalake==0.9.0, pyarrow==11.0.0

  • Cloud provider: N/A
  • OS: Ubuntu 22.04
  • Other:

Bug

What happened:

If the values of a column contain special characters (including space) the writer encodes them when using the column as a partition. If you then try to read the table with the same column as a partition, it finds nothing.

This bug happens if the pyarrow version is >= 11. It works with pyarrow 10.0.1 (special characters not encoded).

What you expected to happen: Partition of a column with special characters correctly read even if they are encoded.

How to reproduce it:

import deltalake as dl
import pyarrow as pa

n_legs = pa.array([2, 4, 5, 100])
animals = pa.array(["Flamingo", "Horse", "Brittle Stars", "Centipede"])
names = ["n_legs", "animals"]

pa_table = pa.Table.from_arrays([n_legs, animals], names=names)

dt_table_uri = "tmp"
dl.write_deltalake(dt_table_uri, pa_table, partition_by=["animals"], mode="overwrite")

dt_table = dl.DeltaTable(dt_table_uri)
dt_table.to_pyarrow_table(partitions=[("animals", "=", "Brittle Stars")]).num_rows

It num_rows returns 0, 1 is expected.

More details:

The content of the /tmp folder is

$ ls -1 tmp/
'animals=Brittle%2520Stars'
'animals=Centipede'
'animals=Flamingo'
'animals=Horse'
_delta_log

Note: even if i try with Brittle%2520Stars as partition value the num_rows returns 0.

With pyarrow 10.0.1 the same script gives num_rows equal to 1 and the folder is

$ ls -1 tmp/
'animals=Brittle Stars'
'animals=Centipede'
'animals=Flamingo'
'animals=Horse'
_delta_log

as expected.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions