-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-51883][DOCS][PYTHON] Python Data Source user guide for filter pushdown #50684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the docs!
|
||
Other methods such as DataSource.schema() and DataSourceStreamReader.latestOffset() can be stateful. Changes to the object state made in these methods are visible to future invocations. | ||
|
||
Refer to the documentation of each method for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also link to the documentation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the method documentations don't actually mention whether it can change state :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most methods can change state so I guess we can just list the exceptions here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following methods should not mutate internal state. Changes to the object state made in these methods are not guaranteed to be visible or invisible to future calls.
- DataSourceReader.partitions()
- DataSourceReader.read()
- DataSourceStreamReader.read()
- SimpleDataSourceStreamReader.readBetweenOffsets()
- All writer methods
All other methods such as DataSource.schema() and DataSourceStreamReader.latestOffset() can be stateful. Changes to the object state made in these methods are visible to future calls.
|
||
from pyspark.sql.datasource import EqualTo, Filter, GreaterThan, LessThan | ||
|
||
def pushFilters(self, filters: List[Filter]) -> Iterable[Filter]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a complete example here so that people can copy paste and try it out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to an example source that returns prime numbers sequentially
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wengh this looks good! Could you also attach a screenshot for the updated user guide to the PR description? Thanks!
done 🤗 |
Would you rebase master changes? The CI test failure should have been fixed |
What changes were proposed in this pull request?
Update
python_data_source.rst
to add filter pushdown docs.Why are the changes needed?
Feature was added but documentation was still missing.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Verified locally
Was this patch authored or co-authored using generative AI tooling?
Yes. Initial draft was generated using AI then manually edited.
Generated-by: GitHub Copilot with Claude 3.7 Sonnet