docs: add details to the read operator stating that the filter applies to the pre-masked schema and must be fully satisfied #271

westonpace · 2022-07-29T17:43:50Z

No description provided.

westonpace · 2022-07-29T17:45:29Z

Closes #137

site/docs/relations/logical_relations.md

jacques-n · 2022-07-29T18:40:16Z

site/docs/relations/logical_relations.md

 | Properties        | A list of name/value pairs associated with the read.         | Optional, defaults to empty          |

+### Read Filtering
+
+Consumers can often take advantage of a ReadRel's filter property, combined with file metadata, statistics, and indices to reduce the amount of data the needs to be read. This technique is often referred to as "pushdown filtering". In many cases this is inexact and a filter can only be partially satisfied. The specification requires that this filter be exactly satisfied. This means that the consumer will often need to apply some kind of in-memory filtering operation to fully satisfy the filter.


Suggested change

Consumers can often take advantage of a ReadRel's filter property, combined with file metadata, statistics, and indices to reduce the amount of data the needs to be read. This technique is often referred to as "pushdown filtering". In many cases this is inexact and a filter can only be partially satisfied. The specification requires that this filter be exactly satisfied. This means that the consumer will often need to apply some kind of in-memory filtering operation to fully satisfy the filter.

If a filter is defined, a consumer must guarantee that all records returned from the scan do not match the filter condition.

I think we either need to support the case of inexact or remove all the wording referring to it. I'm fine with either. Having a bunch of wording about the dynamics of how someone applies a filter feels out of place (in memory, footer stats, etc).

In general, I've come to the conclusion that the second filter field makes more sense than any kind of boolean (whenever we want to move forward on this). I agree with your producer choice complexity concern but also think that it has a lot to do with how physical/logical someone is working. It's trivial rewrite for a system to replace a scan with a required filter with a scan best-effort filter + filter rel as needed. The only reason you typically do that is right before physical execution (which thus means you need specific understanding of the underlying execution engines capabilities).

We can leave the entire Read Filtering section out entirely and just change the property definition to:

A boolean Substrait expression that describes a filter that must be applied to the data. The filter should be interpreted against the direct schema.

Do you still think the sparse Read Filtering section you suggested would add value or would it be redundant? I mainly added the explanatory paragraph since I had heard a few comments that it would be helpful if the documentation assumed a bit less SQL knowledge but perhaps there is a better place for such "informational" content.

I agree with your producer choice complexity concern but also think that it has a lot to do with how physical/logical someone is working

Indeed, I am coming at this from the consumer / pure-physical end of things and so the fact that the filter must be fully satisfied was actually the more surprising default to me.

It's trivial rewrite for a system to replace a scan with a required filter with a scan best-effort filter + filter rel as needed. The only reason you typically do that is right before physical execution

Yes, this is somewhat of an aside but I'm wondering how we want to tackle this in Acero. Currently we are sticking to a "Acero does no optimization" philosophy so the most faithful approach would be to reject plans that have filter specified and only allow plans that have best_effort_filter. I fear, however, that this will make Acero somewhat unusable by initial producers until optimization / rewrites come along (which I don't think will happen for a bit of time yet). The rewrite is easy enough for us to do internally and so we'll probably just do that.

Hey Folks, what is the conclusion on this one? I like the idea that we can express a "best_effort_filter", I can see scenarios where we want to push down to computational storage type devices where only certain operators might be implemented (e.g., the "equalities" you list a few comments above).

I added best_effort_filter and reworked this paragraph

CLAassistant · 2022-10-06T23:47:29Z

All committers have signed the CLA.

CLAassistant · 2022-10-06T23:48:14Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

westonpace · 2022-11-10T17:50:01Z

I have taken the approach @jacques-n suggested and added a best_effort_filter to the ReadRel

…ked schema should be used

jacques-n

LGTM. Thanks for writing all of this down @westonpace !

westonpace force-pushed the feature/clarify-read-filter branch from 55dac89 to 3d0a60b Compare July 29, 2022 17:47

jacques-n reviewed Jul 29, 2022

View reviewed changes

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved

westonpace force-pushed the feature/clarify-read-filter branch 3 times, most recently from afce35a to 83b579d Compare July 29, 2022 18:24

jacques-n reviewed Jul 29, 2022

View reviewed changes

westonpace force-pushed the feature/clarify-read-filter branch from 83b579d to efb32d6 Compare November 10, 2022 17:49

westonpace requested a review from jacques-n November 10, 2022 17:50

westonpace force-pushed the feature/clarify-read-filter branch from efb32d6 to dbb6845 Compare November 10, 2022 18:06

feat: add best effort filter to read rel and clarify that the pre-mas…

af15ac3

…ked schema should be used

westonpace force-pushed the feature/clarify-read-filter branch from dbb6845 to af15ac3 Compare November 11, 2022 00:58

jacques-n approved these changes Nov 11, 2022

View reviewed changes

westonpace merged commit 4beff87 into substrait-io:main Nov 14, 2022

mbrobbel mentioned this pull request Nov 29, 2022

chore: prepare 0.0.11 release substrait-io/substrait-validator#65

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add details to the read operator stating that the filter applies to the pre-masked schema and must be fully satisfied #271

docs: add details to the read operator stating that the filter applies to the pre-masked schema and must be fully satisfied #271

westonpace commented Jul 29, 2022

westonpace commented Jul 29, 2022

jacques-n Jul 29, 2022

jacques-n Jul 29, 2022

westonpace Jul 29, 2022

curino Oct 21, 2022

westonpace Nov 10, 2022

CLAassistant commented Oct 6, 2022 •

edited

Loading

CLAassistant commented Oct 6, 2022

westonpace commented Nov 10, 2022

jacques-n left a comment

docs: add details to the read operator stating that the filter applies to the pre-masked schema and must be fully satisfied #271

docs: add details to the read operator stating that the filter applies to the pre-masked schema and must be fully satisfied #271

Conversation

westonpace commented Jul 29, 2022

westonpace commented Jul 29, 2022

jacques-n Jul 29, 2022

Choose a reason for hiding this comment

jacques-n Jul 29, 2022

Choose a reason for hiding this comment

westonpace Jul 29, 2022

Choose a reason for hiding this comment

curino Oct 21, 2022

Choose a reason for hiding this comment

westonpace Nov 10, 2022

Choose a reason for hiding this comment

CLAassistant commented Oct 6, 2022 • edited Loading

CLAassistant commented Oct 6, 2022

westonpace commented Nov 10, 2022

jacques-n left a comment

Choose a reason for hiding this comment

CLAassistant commented Oct 6, 2022 •

edited

Loading