Skip to content

[SPARK-12639] [SQL] Mark Filters Fully Handled By Sources with * #11317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

[SPARK-12639] [SQL] Mark Filters Fully Handled By Sources with * #11317

wants to merge 1 commit into from

Conversation

RussellSpitzer
Copy link
Member

@RussellSpitzer RussellSpitzer commented Feb 23, 2016

What changes were proposed in this pull request?

In order to make it clear which filters are fully handled by the
underlying datasource we will mark them with an *. This will give a
clear visual queue to users that the filter is being treated differently
by catalyst than filters which are just presented to the underlying
DataSource.

Examples from the FilteredScanSuite, in this example c IN (...) is handled by the source, b < ... is not

Before

//SELECT a FROM oneToTenFiltered WHERE a + b > 9 AND b < 16 AND c IN ('bbbbbBBBBB', 'cccccCCCCC', 'dddddDDDDD', 'foo') 
== Physical Plan ==
Project [a#0]
+- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
   +- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]

After

== Physical Plan ==
Project [a#0]
+- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
   +- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), *In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]

How was the this patch tested?

Manually tested with the Spark Cassandra Connector, a source which fully handles underlying filters. Now fully handled filters appear with an * next to their names. I can add an automated test as well if requested

Post 1.6.1
Tested by modifying the FilteredScanSuite to run explains.

@rxin
Copy link
Contributor

rxin commented Feb 23, 2016

cc @yhuai

@yhuai
Copy link
Contributor

yhuai commented Feb 23, 2016

ok to test

@yhuai
Copy link
Contributor

yhuai commented Feb 23, 2016

@RussellSpitzer If you get a chance, can you add an example in the description showing the plan before and after this change?

*/
protected[sql] def selectFilters(
relation: BaseRelation,
predicates: Seq[Expression]): (Seq[Expression], Seq[Filter]) = {
predicates: Seq[Expression]): (Seq[Expression], Seq[Filter], Seq[Filter]) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe (Seq[Expression], Seq[Filter], Set[Filter])?

@yhuai
Copy link
Contributor

yhuai commented Feb 23, 2016

Thank you @RussellSpitzer! It looks good. Only one comment.

@SparkQA
Copy link

SparkQA commented Feb 23, 2016

Test build #51781 has finished for PR 11317 at commit c33daae.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Mar 16, 2016

ok to test

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53374 has finished for PR 11317 at commit c33daae.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@RussellSpitzer I saw you answered my ping before. Excuse my ping here again.

@RussellSpitzer
Copy link
Member Author

@HyukjinKwon + @yhuai Sorry it took so long! Things have been busy :)

@HyukjinKwon
Copy link
Member

@RussellSpitzer Thanks for bearing with my pings here and there!

@SparkQA
Copy link

SparkQA commented May 10, 2016

Test build #58192 has finished for PR 11317 at commit 117999f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@RussellSpitzer
Copy link
Member Author

I don't think this is because of me

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (sharedRuntime.cpp:834), pid=27637, tid=140000737998592
#  fatal error: exception happened outside interpreter, nmethods and vtable stubs at pc 0x00007f54dd188231
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/jenkins/workspace/SparkPullRequestBuilder@2/hs_err_pid27637.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
Traceback (most recent call last):
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/SocketServer.py", line 655, in __init__
    self.handle()
  File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/accumulators.py", line 235, in handle
    num_updates = read_int(self.rfile)
  File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/serializers.py", line 545, in read_int
    raise EOFError
EOFError
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:40970)
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.9.2-src.zip/py4j/java_gateway.py", line 713, in start
    self.socket.connect((self.address, self.port))
  File "<string>", line 1, in connect
error: [Errno 111] Connection refused

@jacek-lewandowski
Copy link
Contributor

test this please

@jacek-lewandowski
Copy link
Contributor

jenkins, test this please

@SparkQA
Copy link

SparkQA commented May 10, 2016

Test build #58268 has finished for PR 11317 at commit 117999f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Jul 6, 2016

@RussellSpitzer Sorry. I missed the last update on update. Would you please update the PR? I will review it and get it merged when it pass all tests.

@RussellSpitzer
Copy link
Member Author

Updated

@SparkQA
Copy link

SparkQA commented Jul 9, 2016

Test build #3176 has finished for PR 11317 at commit 4c23cf1.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

In order to make it clear which filters are fully handled by the
underlying datasource we will mark them with a *. This will give a
clear visual queue to users that the filter is being treated differently
by catalyst than filters which are just presented to the underlying
DataSource.
@yhuai
Copy link
Contributor

yhuai commented Jul 11, 2016

tes this please

@yhuai
Copy link
Contributor

yhuai commented Jul 11, 2016

tes thsi please

@yhuai
Copy link
Contributor

yhuai commented Jul 11, 2016

ok to test

@SparkQA
Copy link

SparkQA commented Jul 12, 2016

Test build #62122 has finished for PR 11317 at commit f0051c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Jul 12, 2016

lgtm. Merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants