Description
Environment
Delta-rs version:
Binding: Python
Environment: Python==3.10, deltalake==0.9.0, pyarrow==11.0.0
- Cloud provider: N/A
- OS: Ubuntu 22.04
- Other:
Bug
What happened:
If the values of a column contain special characters (including space) the writer encodes them when using the column as a partition. If you then try to read the table with the same column as a partition, it finds nothing.
This bug happens if the pyarrow version is >= 11. It works with pyarrow 10.0.1 (special characters not encoded).
What you expected to happen: Partition of a column with special characters correctly read even if they are encoded.
How to reproduce it:
import deltalake as dl
import pyarrow as pa
n_legs = pa.array([2, 4, 5, 100])
animals = pa.array(["Flamingo", "Horse", "Brittle Stars", "Centipede"])
names = ["n_legs", "animals"]
pa_table = pa.Table.from_arrays([n_legs, animals], names=names)
dt_table_uri = "tmp"
dl.write_deltalake(dt_table_uri, pa_table, partition_by=["animals"], mode="overwrite")
dt_table = dl.DeltaTable(dt_table_uri)
dt_table.to_pyarrow_table(partitions=[("animals", "=", "Brittle Stars")]).num_rows
It num_rows
returns 0, 1 is expected.
More details:
The content of the /tmp
folder is
$ ls -1 tmp/
'animals=Brittle%2520Stars'
'animals=Centipede'
'animals=Flamingo'
'animals=Horse'
_delta_log
Note: even if i try with Brittle%2520Stars
as partition value the num_rows
returns 0.
With pyarrow 10.0.1 the same script gives num_rows
equal to 1 and the folder is
$ ls -1 tmp/
'animals=Brittle Stars'
'animals=Centipede'
'animals=Flamingo'
'animals=Horse'
_delta_log
as expected.
Activity