Skip to content

Conversation

wengh
Copy link
Contributor

@wengh wengh commented Mar 12, 2025

Follow up of #49961

What changes were proposed in this pull request?

This PR adds the serialization and deserialization required to pass V1 data source filters from JVM to Python. Also adds the equivalent Python dataclass representation of the filters.

To ensure that filter values are properly converted to Python values, we use VariantVal to serialize catalyst values into binary, then deserialize into Python VariantVal, then convert to Python values.

Examples

Supported filters

SQL filter Representation
a.b.c = 1 EqualTo(("a", "b", "c"), 1)
a = 1 EqualTo(("a",), 1)
a = 'hi' EqualTo(("a",), "hi")
a = array(1, 2) EqualTo(("a",), [1, 2])
a EqualTo(("a",), True)
not a Not(EqualTo(("a",), True))
a <> 1 Not(EqualTo(("a",), 1))
a > 1 GreaterThan(("a",), 1)
a >= 1 GreaterThanOrEqual(("a",), 1)
a < 1 LessThan(("a",), 1)
a <= 1 LessThanOrEqual(("a",), 1)
a in (1, 2, 3) In(("a",), (1, 2, 3))
a is null IsNull(("a",))
a is not null IsNotNull(("a",))
a like 'abc%' StringStartsWith(("a",), "abc")
a like '%abc' StringEndsWith(("a",), "abc")
a like '%abc%' StringContains(("a",), "abc")

Unsupported filters

  • a = b
  • f(a, b) = 1
  • a % 2 = 1
  • a[0] = 1
  • a < 0 or a > 1
  • a like 'c%c%'
  • a ilike 'hi'
  • a = 'hi' collate zh

Why are the changes needed?

The base PR #49961 only supported EqualTo int. This PR adds support for many other useful filter types making Python Data Source filter pushdown API actually useful.

Does this PR introduce any user-facing change?

Yes. Python Data Source now supports more pushdown filter types.

How was this patch tested?

End-to-end tests in test_python_datasource.py.

Was this patch authored or co-authored using generative AI tooling?

No

@wengh wengh changed the title [SPARK-51271][PYTHON] Filter serialization for Python Data Source filter pushdown [WIP][SPARK-51271][PYTHON] Filter serialization for Python Data Source filter pushdown Mar 12, 2025
@wengh wengh force-pushed the pyds-filter-serialization branch 5 times, most recently from 2bda8e3 to 0896723 Compare March 19, 2025 20:59
@wengh wengh force-pushed the pyds-filter-serialization branch from 0896723 to 8811dad Compare March 20, 2025 16:44
@wengh wengh changed the title [WIP][SPARK-51271][PYTHON] Filter serialization for Python Data Source filter pushdown [PYTHON] Filter serialization for Python Data Source filter pushdown Mar 20, 2025
@wengh wengh marked this pull request as ready for review March 20, 2025 18:58
@wengh wengh changed the title [PYTHON] Filter serialization for Python Data Source filter pushdown [PYTHON][SPARK-51574] Filter serialization for Python Data Source filter pushdown Mar 20, 2025
@wengh wengh changed the title [PYTHON][SPARK-51574] Filter serialization for Python Data Source filter pushdown [SPARK-51574][PYTHON] Filter serialization for Python Data Source filter pushdown Mar 20, 2025
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 49295b3 Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants