Filter push down #1641

penghuo · 2023-05-18T23:18:01Z

Predicate Push Down

Spark provide predicates: Array[Predicate] to DataSource, DataSource decide which predicate could be pushed down. I will use example to describe what predicate is provide by Spark.

value = 1

[
  (value IS NOT NULL),
  (value = 1)
]

value = 1 and id = ‘a’

[
  (value IS NOT NULL),
  (id IS NOT NULL),
  (value = 1),
  (id = 'a')
]

value = 1 or value = 2

OR(
  value = 1,
  value = 2
)

NOT (value = 1 or value = 2).

💡 Spark does not optimize the expression to (value ≠ 1 AND value ≠ 2)

[
  (value IS NOT NULL),
  (NOT (value = 1)),
  (NOT (value = 2))
]

Spark SQL in ANSI mode

Spark SQL has an ANSI mode configuration spark.sql.ansi.enabled, by default is false.

spark.conf.set("spark.sql.ansi.enabled", "true")

The ANSI mode could change the filter push down logic. The following example explain it.

value + 1 = 2, ANSI mode = false

💡 Spark does not normalize the expression to value = 1. it does not push down the expression also

[
  (value IS NOT NULL)
]

value + 1 = 2, ANSI mode = true

[
  (value IS NOT NULL)
  (value + 1 = 2)
]

The V2ExpressionBuilder class decide which expression could be push down. The V2ExpressionBuilder is not stable yet. More expression push down is adding to it.

The text was updated successfully, but these errors were encountered:

penghuo · 2023-05-18T23:28:01Z

How to support pushdown.

For instance, push down, array_contains(aIntArray, 500)

Solutions

There are 2 possible solutions, (1) Create UDF and register through FunctionCatalog. (2) Add query optimization rule to rewrite Filter. The PR below is solution (1).

PR penghuo@c64440f
- The PR required SPARK 3.4.0, which is not supported by AWS EMR now.
- The flint.array_contains(aInt, 1) = 1 grammer is not idea, actually we expected using flint.array_contains(aInt, 1), but SPARK V2ExpressionBuilder can not rewrite it as a Predicate.

Demo

###
PUT {{baseUrl}}/t001
Content-Type: application/json

{
  "mappings": {
    "dynamic": false,
  "properties": {
    "aInt": {
      "type": "integer"
    },
    "aString": {
      "type": "keyword"
    },
    "aText": {
      "type": "text"
    }
  }
  }
}
###
POST {{baseUrl}}/t001/_bulk
Content-Type: application/x-ndjson

{ "create" : { "_id" : "1" } }
{"aInt": [1,2,3],"aString": "a","aText": "i am first"}
{ "create" : { "_id" : "2" } }
{"aInt": [4,5,6],"aString": "b","aText": "i am second"}


import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext._
import org.apache.spark.sql._
import org.apache.spark.sql.types._

spark.conf.set("spark.sql.catalog.flint", classOf[org.apache.spark.sql.flint.FlintCatalog].getName)

val schema = StructType(Seq(StructField("aInt", ArrayType(IntegerType), nullable = true)))
val openSearchOptions = Map("host" -> "localhost", "port" -> "9200")

val sql = new SQLContext(sc)
val df = sql.read.format("flint").options(openSearchOptions).schema(schema).load("t001")

df.filter("flint.array_contains(aInt, 1) = 1").show

+---------+
|     aInt|
+---------+
|[1, 2, 3]|
+---------+

penghuo mentioned this issue Jul 11, 2023

[Feature] OpenSearch and Apache Spark Integration opensearch-project/opensearch-spark#3

Closed

github-actions bot added the untriaged label May 18, 2023

penghuo removed the untriaged label May 18, 2023

penghuo mentioned this issue May 18, 2023

add filter push down #1642

Merged

6 tasks

Yury-Fridlyand added the spark integration label May 19, 2023

penghuo closed this as completed May 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter push down #1641

Filter push down #1641

penghuo commented May 18, 2023 •

edited

Loading

penghuo commented May 18, 2023 •

edited

Loading

Filter push down #1641

Filter push down #1641

Comments

penghuo commented May 18, 2023 • edited Loading

Predicate Push Down

penghuo commented May 18, 2023 • edited Loading

How to support pushdown.

Solutions

Demo

penghuo commented May 18, 2023 •

edited

Loading

penghuo commented May 18, 2023 •

edited

Loading