Skip to content

[SPARK-33045][SQL] Support build-in function like_all and fix StackOverflowError issue. #29999

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 60 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
4a6f903
Reuse completeNextStageWithFetchFailure
beliefer Jun 19, 2020
96456e2
Merge remote-tracking branch 'upstream/master'
beliefer Jul 1, 2020
4314005
Merge remote-tracking branch 'upstream/master'
beliefer Jul 3, 2020
d6af4a7
Merge remote-tracking branch 'upstream/master'
beliefer Jul 9, 2020
f69094f
Merge remote-tracking branch 'upstream/master'
beliefer Jul 16, 2020
b86a42d
Merge remote-tracking branch 'upstream/master'
beliefer Jul 25, 2020
2ac5159
Merge branch 'master' of github.com:beliefer/spark
beliefer Jul 25, 2020
9021d6c
Merge remote-tracking branch 'upstream/master'
beliefer Jul 28, 2020
74a2ef4
Merge branch 'master' of github.com:beliefer/spark
beliefer Jul 28, 2020
9828158
Merge remote-tracking branch 'upstream/master'
beliefer Jul 31, 2020
9cd1aaf
Merge remote-tracking branch 'upstream/master'
beliefer Aug 5, 2020
abfcbb9
Merge remote-tracking branch 'upstream/master'
beliefer Aug 26, 2020
07c6c81
Merge remote-tracking branch 'upstream/master'
beliefer Sep 1, 2020
580130b
Merge remote-tracking branch 'upstream/master'
beliefer Sep 2, 2020
3712808
Merge branch 'master' of github.com:beliefer/spark
beliefer Sep 11, 2020
6107413
Merge remote-tracking branch 'upstream/master'
beliefer Sep 11, 2020
4b799b4
Merge remote-tracking branch 'upstream/master'
beliefer Sep 14, 2020
ee0ecbf
Merge remote-tracking branch 'upstream/master'
beliefer Sep 18, 2020
596bc61
Merge remote-tracking branch 'upstream/master'
beliefer Sep 24, 2020
0164e2f
Merge remote-tracking branch 'upstream/master'
beliefer Sep 27, 2020
90b79fc
Merge remote-tracking branch 'upstream/master'
beliefer Sep 29, 2020
4163382
Support build-in LIKE_ALL function
beliefer Oct 10, 2020
1909298
Fix schema issue.
beliefer Oct 12, 2020
054fc1b
Merge branch 'master' into SPARK-33045-like_all
beliefer Oct 12, 2020
a7cd416
Optimize code
beliefer Oct 12, 2020
0aa4e18
Optimize code
beliefer Oct 12, 2020
3e41cff
Add test cases.
beliefer Oct 12, 2020
2cef3a9
Merge remote-tracking branch 'upstream/master'
beliefer Oct 13, 2020
70b0843
Adjust the value
beliefer Oct 13, 2020
d841b54
Delete like_all and not_like_all
beliefer Oct 14, 2020
369959f
Optimize code
beliefer Oct 15, 2020
1f1f42c
Optimize code
beliefer Oct 15, 2020
de65829
Optimize code
beliefer Oct 15, 2020
1754f0d
Optimize code
beliefer Oct 15, 2020
60f01f4
Keep eval and codegen consistent
beliefer Oct 15, 2020
c32f89b
Keep eval and codegen consistent
beliefer Oct 15, 2020
ec53b83
Keep eval and codegen consistent
beliefer Oct 15, 2020
c52d004
Keep eval and codegen consistent
beliefer Oct 15, 2020
b770f92
Keep eval and codegen consistent
beliefer Oct 15, 2020
be5eb8a
Optimize code
beliefer Oct 16, 2020
fcab4e3
Cache foldable pattern and avoid re-evaluate
beliefer Oct 16, 2020
f657ff0
Cache foldable pattern and avoid re-evaluate
beliefer Oct 16, 2020
c26b64f
Merge remote-tracking branch 'upstream/master'
beliefer Oct 19, 2020
8df5231
Improve performance for codegen.
beliefer Oct 19, 2020
ad4d2d9
Fix bug
beliefer Oct 19, 2020
55465b8
iterator all patterns.
beliefer Oct 22, 2020
2e02cd2
Merge remote-tracking branch 'upstream/master'
beliefer Oct 22, 2020
f160c64
Fix conflict
beliefer Oct 22, 2020
391ba5d
Optimize code
beliefer Oct 23, 2020
7b7120f
Optimize code
beliefer Oct 23, 2020
1fc5214
Simplify code
beliefer Nov 10, 2020
53406d3
Optimize code
beliefer Nov 10, 2020
15bac5b
Add comments.
beliefer Nov 11, 2020
d039c33
Adjust code.
beliefer Nov 11, 2020
0c7785b
Optimize code
beliefer Nov 13, 2020
7af8ffe
Optimize code
beliefer Nov 13, 2020
97c1c73
Revert sql-expression-schema.md
beliefer Nov 17, 2020
1614933
Optimize code.
beliefer Nov 17, 2020
f0e3de1
Revert some code
beliefer Nov 17, 2020
001eb38
Optimize code.
beliefer Nov 19, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ import org.apache.spark.sql.catalyst.expressions.objects.Invoke
import org.apache.spark.sql.catalyst.plans.{Inner, JoinType}
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.types._
import org.apache.spark.unsafe.types.UTF8String

/**
* A collection of implicit conversions that create a DSL for constructing catalyst data structures.
Expand Down Expand Up @@ -102,6 +103,10 @@ package object dsl {
def like(other: Expression, escapeChar: Char = '\\'): Expression =
Like(expr, other, escapeChar)
def rlike(other: Expression): Expression = RLike(expr, other)
def likeAll(others: Expression*): Expression =
LikeAll(expr, others.map(_.eval(EmptyRow).asInstanceOf[UTF8String]))
def notLikeAll(others: Expression*): Expression =
NotLikeAll(expr, others.map(_.eval(EmptyRow).asInstanceOf[UTF8String]))
def contains(other: Expression): Expression = Contains(expr, other)
def startsWith(other: Expression): Expression = StartsWith(expr, other)
def endsWith(other: Expression): Expression = EndsWith(expr, other)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,12 @@ package org.apache.spark.sql.catalyst.expressions
import java.util.Locale
import java.util.regex.{Matcher, MatchResult, Pattern}

import scala.collection.JavaConverters._
import scala.collection.mutable.ArrayBuffer

import org.apache.commons.text.StringEscapeUtils

import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, TypeCheckSuccess}
import org.apache.spark.sql.catalyst.expressions.codegen._
Expand Down Expand Up @@ -178,6 +180,88 @@ case class Like(left: Expression, right: Expression, escapeChar: Char)
}
}

/**
* Optimized version of LIKE ALL, when all pattern values are literal.
*/
abstract class LikeAllBase extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant {

protected def patterns: Seq[UTF8String]

protected def isNotLikeAll: Boolean

override def inputTypes: Seq[DataType] = StringType :: Nil

override def dataType: DataType = BooleanType

override def nullable: Boolean = true

private lazy val hasNull: Boolean = patterns.contains(null)

private lazy val cache = patterns.filterNot(_ == null)
.map(s => Pattern.compile(StringUtils.escapeLikeRegex(s.toString, '\\')))

private lazy val matchFunc = if (isNotLikeAll) {
(p: Pattern, inputValue: String) => !p.matcher(inputValue).matches()
} else {
(p: Pattern, inputValue: String) => p.matcher(inputValue).matches()
}

override def eval(input: InternalRow): Any = {
val exprValue = child.eval(input)
if (exprValue == null) {
null
} else {
if (cache.forall(matchFunc(_, exprValue.toString))) {
if (hasNull) null else true
} else {
false
}
}
}

override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
val eval = child.genCode(ctx)
val patternClass = classOf[Pattern].getName
val javaDataType = CodeGenerator.javaType(child.dataType)
val pattern = ctx.freshName("pattern")
val valueArg = ctx.freshName("valueArg")
val patternCache = ctx.addReferenceObj("patternCache", cache.asJava)

val checkNotMatchCode = if (isNotLikeAll) {
s"$pattern.matcher($valueArg.toString()).matches()"
} else {
s"!$pattern.matcher($valueArg.toString()).matches()"
}

ev.copy(code =
code"""
|${eval.code}
|boolean ${ev.isNull} = false;
|boolean ${ev.value} = true;
|if (${eval.isNull}) {
| ${ev.isNull} = true;
|} else {
| $javaDataType $valueArg = ${eval.value};
| for ($patternClass $pattern: $patternCache) {
| if ($checkNotMatchCode) {
| ${ev.value} = false;
| break;
| }
| }
| if (${ev.value} && $hasNull) ${ev.isNull} = true;
|}
""".stripMargin)
}
}

case class LikeAll(child: Expression, patterns: Seq[UTF8String]) extends LikeAllBase {
override def isNotLikeAll: Boolean = false
}

case class NotLikeAll(child: Expression, patterns: Seq[UTF8String]) extends LikeAllBase {
override def isNotLikeAll: Boolean = true
}

// scalastyle:off line.contains.tab
@ExpressionDescription(
usage = "str _FUNC_ regexp - Returns true if `str` matches `regexp`, or false otherwise.",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1408,7 +1408,20 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
case Some(SqlBaseParser.ANY) | Some(SqlBaseParser.SOME) =>
getLikeQuantifierExprs(ctx.expression).reduceLeft(Or)
case Some(SqlBaseParser.ALL) =>
getLikeQuantifierExprs(ctx.expression).reduceLeft(And)
validate(!ctx.expression.isEmpty, "Expected something between '(' and ')'.", ctx)
val expressions = ctx.expression.asScala.map(expression)
if (expressions.size > SQLConf.get.optimizerLikeAllConversionThreshold &&
expressions.forall(_.foldable) && expressions.forall(_.dataType == StringType)) {
// If there are many pattern expressions, will throw StackOverflowError.
// So we use LikeAll or NotLikeAll instead.
val patterns = expressions.map(_.eval(EmptyRow).asInstanceOf[UTF8String])
ctx.NOT match {
case null => LikeAll(e, patterns)
case _ => NotLikeAll(e, patterns)
}
} else {
getLikeQuantifierExprs(ctx.expression).reduceLeft(And)
}
case _ =>
val escapeChar = Option(ctx.escapeChar).map(string).map { str =>
if (str.length != 1) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,18 @@ object SQLConf {
"for using switch statements in InSet must be non-negative and less than or equal to 600")
.createWithDefault(400)

val OPTIMIZER_LIKE_ALL_CONVERSION_THRESHOLD =
buildConf("spark.sql.optimizer.likeAllConversionThreshold")
.internal()
.doc("Configure the maximum size of the pattern sequence in like all. Spark will convert " +
"the logical combination of like to avoid StackOverflowError. 200 is an empirical value " +
"that will not cause StackOverflowError.")
.version("3.1.0")
.intConf
.checkValue(threshold => threshold >= 0, "The maximum size of pattern sequence " +
"in like all must be non-negative")
.createWithDefault(200)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tree of 200 And-reduced expressions is already a huge expr tree.
I think this could be useful and helpful with a default threshold of 5 or so already.

Copy link
Contributor

@cloud-fan cloud-fan Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have removed this config: beliefer@9273d42#diff-13c5b65678b327277c68d17910ae93629801af00117a0e3da007afd95b6c6764L219

We will always use the new expression for LIKE ALL if values are all literal.


val PLAN_CHANGE_LOG_LEVEL = buildConf("spark.sql.planChangeLog.level")
.internal()
.doc("Configures the log level for logging the change from the original plan to the new " +
Expand Down Expand Up @@ -2972,6 +2984,8 @@ class SQLConf extends Serializable with Logging {

def optimizerInSetSwitchThreshold: Int = getConf(OPTIMIZER_INSET_SWITCH_THRESHOLD)

def optimizerLikeAllConversionThreshold: Int = getConf(OPTIMIZER_LIKE_ALL_CONVERSION_THRESHOLD)

def planChangeLogLevel: String = getConf(PLAN_CHANGE_LOG_LEVEL)

def planChangeRules: Option[String] = getConf(PLAN_CHANGE_LOG_RULES)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,30 @@ class RegexpExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
checkEvaluation(mkExpr(regex), expected, create_row(input)) // check row input
}

test("LIKE ALL") {
checkEvaluation(Literal.create(null, StringType).likeAll("%foo%", "%oo"), null)
checkEvaluation(Literal.create("foo", StringType).likeAll("%foo%", "%oo"), true)
checkEvaluation(Literal.create("foo", StringType).likeAll("%foo%", "%bar%"), false)
checkEvaluation(Literal.create("foo", StringType)
.likeAll("%foo%", Literal.create(null, StringType)), null)
checkEvaluation(Literal.create("foo", StringType)
.likeAll(Literal.create(null, StringType), "%foo%"), null)
checkEvaluation(Literal.create("foo", StringType)
.likeAll("%feo%", Literal.create(null, StringType)), false)
checkEvaluation(Literal.create("foo", StringType)
.likeAll(Literal.create(null, StringType), "%feo%"), false)
checkEvaluation(Literal.create("foo", StringType).notLikeAll("tee", "%yoo%"), true)
checkEvaluation(Literal.create("foo", StringType).notLikeAll("%oo%", "%yoo%"), false)
checkEvaluation(Literal.create("foo", StringType)
.notLikeAll("%foo%", Literal.create(null, StringType)), false)
checkEvaluation(Literal.create("foo", StringType)
.notLikeAll(Literal.create(null, StringType), "%foo%"), false)
checkEvaluation(Literal.create("foo", StringType)
.notLikeAll("%yoo%", Literal.create(null, StringType)), null)
checkEvaluation(Literal.create("foo", StringType)
.notLikeAll(Literal.create(null, StringType), "%yoo%"), null)
}

test("LIKE Pattern") {

// null handling
Expand Down
4 changes: 4 additions & 0 deletions sql/core/src/test/resources/sql-tests/inputs/like-all.sql
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
-- test cases for like all
--CONFIG_DIM1 spark.sql.optimizer.likeAllConversionThreshold=0
--CONFIG_DIM1 spark.sql.optimizer.likeAllConversionThreshold=200

CREATE OR REPLACE TEMPORARY VIEW like_all_table AS SELECT * FROM (VALUES
('google', '%oo%'),
('facebook', '%oo%'),
Expand Down