[core][spark] Supports to push down limit #2367

YannByron · 2023-11-22T07:33:11Z

Purpose

Support to push down limit to accelerate query.

Linked issue: close #xxx

Tests

API and Format

Documentation

JingsongLi · 2023-11-22T10:05:42Z

paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/SparkScanBuilder.scala

+  }
+
+  override def pushLimit(limit: Int): Boolean = {
+    if (table.isInstanceOf[AppendOnlyFileStoreTable]) {


No need to if append table? Paimon-core scan should take care about this.

Here, a boolean value is needed, which is used in SparkSQL. Or, paimon-core needs to provide an api to be called.

We can just return false here? Best effort pushdown?

As a temporary solution, it's ok.

JingsongLi · 2023-11-22T10:06:46Z

paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/SparkScanBuilder.scala

+    this.projectedIndexes = Some(projected)
+  }
+
+  override def pushLimit(limit: Int): Boolean = {


Does Spark SQL will not push limit if there is filter?

yes. SparkSQL can guarantee this.

JingsongLi · 2023-11-22T10:11:41Z

...aimon-spark-common/src/test/scala/org/apache/paimon/spark/sql/PaimonSQLPerformanceTest.scala

+
+import scala.collection.JavaConverters._
+
+class PaimonSQLPerformanceTest extends PaimonSparkTestBase {


What this class for? It looks like we can not find performance result from these tests?

Maybe it's not a good class name. It aims to check whether some operations can work, like this SupportsPushDownLimit which can reduce the number of splits scanned.

YannByron · 2023-11-27T08:25:46Z

These failed UTs are not related to this pr.

YannByron · 2023-11-28T10:09:33Z

#2404

JingsongLi · 2023-11-29T02:51:57Z

paimon-core/src/main/java/org/apache/paimon/operation/AbstractFileStoreScan.java

@@ -199,6 +202,11 @@ public FileStoreScan withMetrics(ScanMetrics metrics) {
        return this;
    }

+    @Override
+    public FileStoreScan withLimit(int limit) {


Can limit implementation just in AppendOnlyFileStoreScan? It can override plan method to do post-limit.

JingsongLi · 2023-11-29T03:14:22Z

Another solution: we can make this limit pushdown generic. We can do this limit in InnerTableScanImpl, and we can just check Split.convertToRawFiles, check the rowcount, because rawFiles don't need to be merged.

JingsongLi · 2023-11-29T10:08:38Z

paimon-core/src/main/java/org/apache/paimon/table/source/RawFile.java

    public void serialize(DataOutputView out) throws IOException {
        out.writeUTF(path);
        out.writeLong(offset);
        out.writeLong(length);
        out.writeUTF(format);
        out.writeLong(schemaId);
+        out.writeLong(rowCount);


Increment DataSplit.serialVersionUID too.

JingsongLi · 2023-11-29T10:12:00Z

paimon-spark/paimon-spark-3.1/src/main/scala/org/apache/paimon/spark/SparkScanBuilder.scala

+
+import scala.collection.mutable
+
+class SparkScanBuilder(table: Table)


Can we have a SparkBaseScanBuilder to reuse code?

JingsongLi · 2023-11-29T10:13:20Z

paimon-spark/paimon-spark-3.1/src/main/scala/org/apache/paimon/spark/SparkScanBuilder.scala

+
+import scala.collection.mutable
+
+class SparkScanBuilder(table: Table)


You need to add class for spark-3.2 too? And add some itcase for spark-3.2 too (For ITCase, we also can have some abstraction to reduce code replication).

For UT, just copy codes for now. Let refine it in another pr.

JingsongLi · 2023-11-29T10:17:48Z

paimon-core/src/main/java/org/apache/paimon/table/source/InnerTableScanImpl.java

+        long scannedRowCount = 0;
+
+        List<DataFileMeta> originalDataFiles = split.dataFiles();
+        List<RawFile> originalRawFiles = split.convertToRawFiles().get();


Just get? Exception when it is empty?

JingsongLi · 2023-11-29T10:18:50Z

paimon-core/src/main/java/org/apache/paimon/table/source/InnerTableScanImpl.java

+                    limitedSplits.add(split);
+                    scannedRowCount += splitRowCount;
+                } else {
+                    DataSplit newSplit = composeDataSplit(split, pushDownLimit - scannedRowCount);


I think we don't need to introduce composeDataSplit.

The reason we introduce limit pushdown is to reduce split number. We don't need to reduce files in split.

JingsongLi

+1

JingsongLi reviewed Nov 22, 2023

View reviewed changes

YannByron force-pushed the master_limitpushdown branch from a5d6d48 to 65520d4 Compare November 28, 2023 09:48

[core][spark] Supports to push down limit

6fcd11a

YannByron force-pushed the master_limitpushdown branch from 65520d4 to 6fcd11a Compare November 28, 2023 09:49

YannByron added 2 commits November 28, 2023 18:24

[followup] fix UT

2af7ef2

[followup] UTs

1424954

YannByron closed this Nov 28, 2023

YannByron reopened this Nov 28, 2023

JingsongLi reviewed Nov 29, 2023

View reviewed changes

YannByron added 2 commits November 29, 2023 14:09

[followup] apply pushDownLimit in InnerTableScanImpl

b6ded6f

[followup] fix

f73994b

JingsongLi reviewed Nov 29, 2023

View reviewed changes

YannByron added 2 commits November 29, 2023 18:54

[followup] comments

1345dfc

[followup] comments

316321d

JingsongLi approved these changes Nov 29, 2023

View reviewed changes

JingsongLi merged commit 9a19973 into apache:master Nov 29, 2023

JingsongLi pushed a commit that referenced this pull request Dec 1, 2023

[core][spark] Supports to push down limit (#2367)

3f1f5b3


		import scala.collection.JavaConverters._

		class PaimonSQLPerformanceTest extends PaimonSparkTestBase {


		import scala.collection.mutable

		class SparkScanBuilder(table: Table)

[core][spark] Supports to push down limit #2367

[core][spark] Supports to push down limit #2367

Uh oh!

Conversation

YannByron commented Nov 22, 2023

Purpose

Tests

API and Format

Documentation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YannByron commented Nov 27, 2023

Uh oh!

YannByron commented Nov 28, 2023

Uh oh!

JingsongLi Nov 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented Nov 29, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JingsongLi Nov 29, 2023 •

edited

Loading