Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add details to the read operator stating that the filter applies to the pre-masked schema and must be fully satisfied #271

Merged

Conversation

westonpace
Copy link
Member

No description provided.

@westonpace
Copy link
Member Author

Closes #137

@westonpace westonpace force-pushed the feature/clarify-read-filter branch 3 times, most recently from afce35a to 83b579d Compare July 29, 2022 18:24
| Properties | A list of name/value pairs associated with the read. | Optional, defaults to empty |

### Read Filtering

Consumers can often take advantage of a ReadRel's filter property, combined with file metadata, statistics, and indices to reduce the amount of data the needs to be read. This technique is often referred to as "pushdown filtering". In many cases this is inexact and a filter can only be partially satisfied. The specification requires that this filter be exactly satisfied. This means that the consumer will often need to apply some kind of in-memory filtering operation to fully satisfy the filter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Consumers can often take advantage of a ReadRel's filter property, combined with file metadata, statistics, and indices to reduce the amount of data the needs to be read. This technique is often referred to as "pushdown filtering". In many cases this is inexact and a filter can only be partially satisfied. The specification requires that this filter be exactly satisfied. This means that the consumer will often need to apply some kind of in-memory filtering operation to fully satisfy the filter.
If a filter is defined, a consumer must guarantee that all records returned from the scan do not match the filter condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we either need to support the case of inexact or remove all the wording referring to it. I'm fine with either. Having a bunch of wording about the dynamics of how someone applies a filter feels out of place (in memory, footer stats, etc).

In general, I've come to the conclusion that the second filter field makes more sense than any kind of boolean (whenever we want to move forward on this). I agree with your producer choice complexity concern but also think that it has a lot to do with how physical/logical someone is working. It's trivial rewrite for a system to replace a scan with a required filter with a scan best-effort filter + filter rel as needed. The only reason you typically do that is right before physical execution (which thus means you need specific understanding of the underlying execution engines capabilities).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave the entire Read Filtering section out entirely and just change the property definition to:

A boolean Substrait expression that describes a filter that must be applied to the data. The filter should be interpreted against the direct schema.

Do you still think the sparse Read Filtering section you suggested would add value or would it be redundant? I mainly added the explanatory paragraph since I had heard a few comments that it would be helpful if the documentation assumed a bit less SQL knowledge but perhaps there is a better place for such "informational" content.

I agree with your producer choice complexity concern but also think that it has a lot to do with how physical/logical someone is working

Indeed, I am coming at this from the consumer / pure-physical end of things and so the fact that the filter must be fully satisfied was actually the more surprising default to me.

It's trivial rewrite for a system to replace a scan with a required filter with a scan best-effort filter + filter rel as needed. The only reason you typically do that is right before physical execution

Yes, this is somewhat of an aside but I'm wondering how we want to tackle this in Acero. Currently we are sticking to a "Acero does no optimization" philosophy so the most faithful approach would be to reject plans that have filter specified and only allow plans that have best_effort_filter. I fear, however, that this will make Acero somewhat unusable by initial producers until optimization / rewrites come along (which I don't think will happen for a bit of time yet). The rewrite is easy enough for us to do internally and so we'll probably just do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Folks, what is the conclusion on this one? I like the idea that we can express a "best_effort_filter", I can see scenarios where we want to push down to computational storage type devices where only certain operators might be implemented (e.g., the "equalities" you list a few comments above).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added best_effort_filter and reworked this paragraph

@CLAassistant
Copy link

CLAassistant commented Oct 6, 2022

CLA assistant check
All committers have signed the CLA.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@westonpace
Copy link
Member Author

I have taken the approach @jacques-n suggested and added a best_effort_filter to the ReadRel

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for writing all of this down @westonpace !

@westonpace westonpace merged commit 4beff87 into substrait-io:main Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants