Skip to content

[SPARK-18863][SQL] Output non-aggregate expressions without GROUP BY in a subquery does not yield an error #16572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
b988651
[SPARK-16804][SQL] Correlated subqueries containing LIMIT return inco…
nsyca Jul 29, 2016
069ed8f
[SPARK-16804][SQL] Correlated subqueries containing LIMIT return inco…
nsyca Jul 29, 2016
edca333
New positive test cases
nsyca Jul 30, 2016
64184fd
Fix unit test case failure
nsyca Aug 1, 2016
29f82b0
blocking TABLESAMPLE
nsyca Aug 5, 2016
ac43ab4
Fixing code styling
nsyca Aug 5, 2016
631d396
Correcting Scala test style
nsyca Aug 7, 2016
7eb9b2d
One (last) attempt to correct the Scala style tests
nsyca Aug 8, 2016
1387cf5
Merge remote-tracking branch 'upstream/master'
nsyca Aug 12, 2016
3faa2d5
Merge remote-tracking branch 'upstream/master'
nsyca Dec 14, 2016
a308634
Merge remote-tracking branch 'upstream/master'
nsyca Dec 30, 2016
2f463de
first fix (incomplete)
nsyca Jan 1, 2017
6e2f686
first attempt
nsyca Jan 3, 2017
f1524b9
Merge remote-tracking branch 'upstream/master'
nsyca Jan 3, 2017
6dfa8e5
Merge branch 'master' into 18863
nsyca Jan 4, 2017
e9bdde6
New test cases
nsyca Jan 5, 2017
deec874
Masking exprIDs
nsyca Jan 5, 2017
5c36dce
Merge remote-tracking branch 'upstream/master'
nsyca Jan 13, 2017
98cbd60
Merge branch 'master' into 18863
nsyca Jan 13, 2017
bcae336
reverse back accidental change
nsyca Jan 13, 2017
51f7fb9
port from SPARK-19017
nsyca Jan 13, 2017
24397cf
remove unrelated comment
nsyca Jan 13, 2017
ced19c7
address comment #1
nsyca Jan 25, 2017
862b2b8
Merge remote-tracking branch 'upstream/master'
nsyca Jan 25, 2017
203ad7d
Merge branch 'master' into 18863
nsyca Jan 25, 2017
010d27a
remove blank line
nsyca Jan 25, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -117,20 +117,20 @@ trait CheckAnalysis extends PredicateHelper {
failAnalysis(s"Window specification $s is not valid because $m")
case None => w
}
case s @ ScalarSubquery(query, conditions, _)

case s @ ScalarSubquery(query, conditions, _) =>
// If no correlation, the output must be exactly one column
if (conditions.isEmpty && query.output.size != 1) =>
if (conditions.isEmpty && query.output.size != 1) {
failAnalysis(
s"Scalar subquery must return only one column, but got ${query.output.size}")

case s @ ScalarSubquery(query, conditions, _) if conditions.nonEmpty =>

}
else if (conditions.nonEmpty) {
// Collect the columns from the subquery for further checking.
var subqueryColumns = conditions.flatMap(_.references).filter(query.output.contains)

def checkAggregate(agg: Aggregate): Unit = {
// Make sure correlated scalar subqueries contain one row for every outer row by
// enforcing that they are aggregates which contain exactly one aggregate expressions.
// enforcing that they are aggregates containing exactly one aggregate expression.
// The analyzer has already checked that subquery contained only one output column,
// and added all the grouping expressions to the aggregate.
val aggregates = agg.expressions.flatMap(_.collect {
Expand Down Expand Up @@ -177,6 +177,12 @@ trait CheckAnalysis extends PredicateHelper {
case Filter(_, a: Aggregate) => checkAggregate(a)
case fail => failAnalysis(s"Correlated scalar subqueries must be Aggregated: $fail")
}
}
checkAnalysis(query)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best way to view this block of code changes is using a diff with -b. The main part is to call checkAnalysis for both PredicateSubquery and ScalaSubquery.

s

case s: SubqueryExpression =>
checkAnalysis(s.plan)
s
}

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
-- The test file contains negative test cases
-- of invalid queries where error messages are expected.

create temporary view t1 as select * from values
(1, 2, 3)
as t1(t1a, t1b, t1c);

create temporary view t2 as select * from values
(1, 0, 1)
as t2(t2a, t2b, t2c);

create temporary view t3 as select * from values
(3, 1, 2)
as t3(t3a, t3b, t3c);

-- TC 01.01
-- The column t2b in the SELECT of the subquery is invalid
-- because it is neither an aggregate function nor a GROUP BY column.
select t1a, t2b
from t1, t2
where t1b = t2c
and t2b = (select max(avg)
from (select t2b, avg(t2b) avg
from t2
where t2a = t1.t1b
)
)
;

-- TC 01.02
-- Invalid due to the column t2b not part of the output from table t2.
select *
from t1
where t1a in (select min(t2a)
from t2
group by t2c
having t2c in (select max(t3c)
from t3
group by t3b
having t3b > t2b ))
;

Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
-- Automatically generated by SQLQueryTestSuite
-- Number of queries: 5


-- !query 0
create temporary view t1 as select * from values
(1, 2, 3)
as t1(t1a, t1b, t1c)
-- !query 0 schema
struct<>
-- !query 0 output



-- !query 1
create temporary view t2 as select * from values
(1, 0, 1)
as t2(t2a, t2b, t2c)
-- !query 1 schema
struct<>
-- !query 1 output



-- !query 2
create temporary view t3 as select * from values
(3, 1, 2)
as t3(t3a, t3b, t3c)
-- !query 2 schema
struct<>
-- !query 2 output



-- !query 3
select t1a, t2b
from t1, t2
where t1b = t2c
and t2b = (select max(avg)
from (select t2b, avg(t2b) avg
from t2
where t2a = t1.t1b
)
)
-- !query 3 schema
struct<>
-- !query 3 output
org.apache.spark.sql.AnalysisException
expression 't2.`t2b`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;


-- !query 4
select *
from t1
where t1a in (select min(t2a)
from t2
group by t2c
having t2c in (select max(t3c)
from t3
group by t3b
having t3b > t2b ))
-- !query 4 schema
struct<>
-- !query 4 output
org.apache.spark.sql.AnalysisException
resolved attribute(s) t2b#x missing from min(t2a)#x,t2c#x in operator !Filter predicate-subquery#x [(t2c#x = max(t3c)#x) && (t3b#x > t2b#x)];
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,10 @@ class SQLQueryTestSuite extends QueryTest with SharedSQLContext {
} catch {
case a: AnalysisException if a.plan.nonEmpty =>
// Do not output the logical plan tree which contains expression IDs.
(StructType(Seq.empty), Seq(a.getClass.getName, a.getSimpleMessage))
// Also implement a crude way of masking expression IDs in the error message
// with a generic pattern "###".
(StructType(Seq.empty),
Seq(a.getClass.getName, a.getSimpleMessage.replaceAll("#\\d+", "#x")))
case NonFatal(e) =>
// If there is an exception, put the exception class followed by the message.
(StructType(Seq.empty), Seq(e.getClass.getName, e.getMessage))
Expand Down