-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add details to the read operator stating that the filter applies to the pre-masked schema and must be fully satisfied #271
docs: add details to the read operator stating that the filter applies to the pre-masked schema and must be fully satisfied #271
Conversation
Closes #137 |
55dac89
to
3d0a60b
Compare
afce35a
to
83b579d
Compare
| Properties | A list of name/value pairs associated with the read. | Optional, defaults to empty | | ||
|
||
### Read Filtering | ||
|
||
Consumers can often take advantage of a ReadRel's filter property, combined with file metadata, statistics, and indices to reduce the amount of data the needs to be read. This technique is often referred to as "pushdown filtering". In many cases this is inexact and a filter can only be partially satisfied. The specification requires that this filter be exactly satisfied. This means that the consumer will often need to apply some kind of in-memory filtering operation to fully satisfy the filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consumers can often take advantage of a ReadRel's filter property, combined with file metadata, statistics, and indices to reduce the amount of data the needs to be read. This technique is often referred to as "pushdown filtering". In many cases this is inexact and a filter can only be partially satisfied. The specification requires that this filter be exactly satisfied. This means that the consumer will often need to apply some kind of in-memory filtering operation to fully satisfy the filter. | |
If a filter is defined, a consumer must guarantee that all records returned from the scan do not match the filter condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we either need to support the case of inexact or remove all the wording referring to it. I'm fine with either. Having a bunch of wording about the dynamics of how someone applies a filter feels out of place (in memory, footer stats, etc).
In general, I've come to the conclusion that the second filter field makes more sense than any kind of boolean (whenever we want to move forward on this). I agree with your producer choice complexity concern but also think that it has a lot to do with how physical/logical someone is working. It's trivial rewrite for a system to replace a scan with a required filter with a scan best-effort filter + filter rel as needed. The only reason you typically do that is right before physical execution (which thus means you need specific understanding of the underlying execution engines capabilities).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave the entire Read Filtering
section out entirely and just change the property definition to:
A boolean Substrait expression that describes a filter that must be applied to the data. The filter should be interpreted against the direct schema.
Do you still think the sparse Read Filtering
section you suggested would add value or would it be redundant? I mainly added the explanatory paragraph since I had heard a few comments that it would be helpful if the documentation assumed a bit less SQL knowledge but perhaps there is a better place for such "informational" content.
I agree with your producer choice complexity concern but also think that it has a lot to do with how physical/logical someone is working
Indeed, I am coming at this from the consumer / pure-physical end of things and so the fact that the filter must be fully satisfied was actually the more surprising default to me.
It's trivial rewrite for a system to replace a scan with a required filter with a scan best-effort filter + filter rel as needed. The only reason you typically do that is right before physical execution
Yes, this is somewhat of an aside but I'm wondering how we want to tackle this in Acero. Currently we are sticking to a "Acero does no optimization" philosophy so the most faithful approach would be to reject plans that have filter
specified and only allow plans that have best_effort_filter
. I fear, however, that this will make Acero somewhat unusable by initial producers until optimization / rewrites come along (which I don't think will happen for a bit of time yet). The rewrite is easy enough for us to do internally and so we'll probably just do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Folks, what is the conclusion on this one? I like the idea that we can express a "best_effort_filter", I can see scenarios where we want to push down to computational storage type devices where only certain operators might be implemented (e.g., the "equalities" you list a few comments above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added best_effort_filter and reworked this paragraph
|
83b579d
to
efb32d6
Compare
I have taken the approach @jacques-n suggested and added a best_effort_filter to the ReadRel |
efb32d6
to
dbb6845
Compare
…ked schema should be used
dbb6845
to
af15ac3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for writing all of this down @westonpace !
No description provided.