Skip to content

[SPARK-48770][SS] Change to read operator metadata once on driver to check if we can find info for numColsPrefixKey used for session window agg queries #47167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

anishshri-db
Copy link
Contributor

What changes were proposed in this pull request?

Change to read operator metadata once on driver to check if we can find info for numColsPrefixKey used for session window agg queries

Why are the changes needed?

Avoid reading the operator metadata file multiple times on the executors

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests

===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.v2.state.RocksDBStateDataSourceReadSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), Idle Worker Monitor for python3 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-8 (daemon=true), shuffle-boss-6-1 (daemon=tru...
[info] Run completed in 1 minute, 39 seconds.
[info] Total number of tests run: 14
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

Was this patch authored or co-authored using generative AI tooling?

No

…k if we can find info for numColsPrefixKey used for session window agg queries
@github-actions github-actions bot added the SQL label Jul 1, 2024
@anishshri-db
Copy link
Contributor Author

cc - @HeartSaVioR - PTAL, thx !

@anishshri-db
Copy link
Contributor Author

I don't think test failure is related - org.apache.spark.sql.jdbc.v2.OracleIntegrationSuite

Seems like other PRs are hitting this too

@HyukjinKwon
Copy link
Member

you can retrigger the failed test (https://github.com/anishshri-db/spark/runs/26909263548)

@anishshri-db
Copy link
Contributor Author

@HyukjinKwon - yes I did once. let me try once again

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment, thanks for the work!

@anishshri-db anishshri-db requested a review from HeartSaVioR July 2, 2024 06:31
Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@HeartSaVioR
Copy link
Contributor

Thanks! Merging to master.

eason-yuchen-liu added a commit to eason-yuchen-liu/spark that referenced this pull request Jul 2, 2024
commit 261c671
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Tue Jul 2 13:57:57 2024 -0700

    solve conflict

commit 39d0b17
Merge: 9af25f1 c2d59b0
Author: Yuchen Liu <170372783+eason-yuchen-liu@users.noreply.github.com>
Date:   Tue Jul 2 13:45:12 2024 -0700

    rebase to master

commit c2d59b0
Merge: 9cf8b25 9af25f1
Author: Yuchen Liu <170372783+eason-yuchen-liu@users.noreply.github.com>
Date:   Tue Jul 2 13:44:50 2024 -0700

    Merge branch 'skipSnapshotAtBatch' into state-cdc

commit 9af25f1
Merge: 8fa9ef5 fea930a
Author: Yuchen Liu <170372783+eason-yuchen-liu@users.noreply.github.com>
Date:   Tue Jul 2 13:23:25 2024 -0700

    Merge branch 'apache:master' into skipSnapshotAtBatch

commit fea930a
Author: Anish Shrigondekar <anish.shrigondekar@databricks.com>
Date:   Wed Jul 3 05:21:50 2024 +0900

    [SPARK-48770][SS] Change to read operator metadata once on driver to check if we can find info for numColsPrefixKey used for session window agg queries

    ### What changes were proposed in this pull request?
    Change to read operator metadata once on driver to check if we can find info for numColsPrefixKey used for session window agg queries

    ### Why are the changes needed?
    Avoid reading the operator metadata file multiple times on the executors

    ### Does this PR introduce _any_ user-facing change?
    No

    ### How was this patch tested?
    Existing unit tests

    ```
    ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.v2.state.RocksDBStateDataSourceReadSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), Idle Worker Monitor for python3 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-8 (daemon=true), shuffle-boss-6-1 (daemon=tru...
    [info] Run completed in 1 minute, 39 seconds.
    [info] Total number of tests run: 14
    [info] Suites: completed 1, aborted 0
    [info] Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0
    [info] All tests passed.
    ```

    ### Was this patch authored or co-authored using generative AI tooling?
    No

    Closes apache#47167 from anishshri-db/task/SPARK-48770.

    Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com>
    Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

commit 8fa9ef5
Merge: 9dbe295 ee0d306
Author: Yuchen Liu <170372783+eason-yuchen-liu@users.noreply.github.com>
Date:   Tue Jul 2 13:21:01 2024 -0700

    Merge branch 'apache:master' into skipSnapshotAtBatch

commit 9cf8b25
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Tue Jul 2 10:53:53 2024 -0700

    add input error tests

commit 7354408
Merge: 6d6d511 9dbe295
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Tue Jul 2 10:17:34 2024 -0700

    Merge branch 'skipSnapshotAtBatch' into state-cdc

commit 9dbe295
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Mon Jul 1 21:54:33 2024 -0700

    minor

commit 6d6d511
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Mon Jul 1 15:53:04 2024 -0700

    move StateStoreChangeDataReader to other files and delete it

commit 104ba9c
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Mon Jul 1 15:36:08 2024 -0700

    rename PUT to update

commit 12298b2
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Mon Jul 1 13:09:02 2024 -0700

    minor

commit 75839ac
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Mon Jul 1 13:03:59 2024 -0700

    name all cdc to changeData

commit ace711c
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Mon Jul 1 12:49:07 2024 -0700

    check validity of input to options

commit 3834cc9
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Fri Jun 28 17:51:16 2024 -0700

    solve format issue

commit 337785d
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Fri Jun 28 17:07:18 2024 -0700

    address comments from Anish

commit 15a8316
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Fri Jun 28 16:46:57 2024 -0700

    refactor StateStoreChangeDataReader

commit b1eb8c4
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Fri Jun 28 15:03:09 2024 -0700

    add integration tests to the new features

commit 7c6cdad
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 27 16:35:46 2024 -0700

    unify the two traits

commit cd6a39b
Merge: 271b98e d140708
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 27 16:22:45 2024 -0700

    Merge branch 'skipSnapshotAtBatch' into state-cdc

commit d140708
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 27 15:17:06 2024 -0700

    provide the script to regenerate golden files

commit 4deb63e
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 27 14:22:00 2024 -0700

    throw the exception

commit 6f1425d
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 27 12:09:54 2024 -0700

    reflect more comments from Jungtaek

commit 42d952f
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 27 11:11:33 2024 -0700

    rename SupportsFineGrainedReplayFromSnapshot to SupportsFineGrainedReplay

commit e15213e
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 27 11:05:50 2024 -0700

    rename to startVersion to snapshotVersion to make its function clear

commit 271b98e
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Wed Jun 26 15:46:33 2024 -0700

    make sure StateStoreChangeData is used everywhere

commit ff5bff2
Merge: 6922595 40b6dc6
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Wed Jun 26 15:22:19 2024 -0700

    Merge branch 'skipSnapshotAtBatch' into state-cdc

commit 40b6dc6
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Wed Jun 26 10:59:17 2024 -0700

    move error to StateStoreErrors

commit 23639f4
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Wed Jun 26 10:44:22 2024 -0700

    create new error for SupportsFineGrainedReplayFromSnapshot

commit 97ee3ef
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Wed Jun 26 10:25:57 2024 -0700

    some naming and formatting comments from Anish and Jungtaek

commit 1a23abb
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Tue Jun 25 14:56:07 2024 -0700

    refactor the code to isolate from current state stores used by streaming queries

commit 876256e
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Tue Jun 25 12:29:40 2024 -0700

    reflect comments from Jungtaek

commit ef9b095
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Tue Jun 25 12:08:34 2024 -0700

    create integration test against golden files

commit 6922595
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Mon Jun 24 13:44:19 2024 -0700

    stage

commit 3ece6f2
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Fri Jun 21 21:22:50 2024 -0700

    resort error-conditions

commit be30817
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Fri Jun 21 17:30:12 2024 -0700

    Reflect more comments from Anish

commit cf84d50
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Fri Jun 21 14:02:58 2024 -0700

    support hdfs state store provider

commit 752cdc7
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 20 17:51:33 2024 -0700

    separate CDCPartitionReader from StatePartitionReader

commit bd87055
Merge: 2184396 2eb6646
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 20 17:29:31 2024 -0700

    Merge branch 'skipSnapshotAtBatch' into state-cdc

commit 2eb6646
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 20 17:10:45 2024 -0700

    also update the name of StateTable

commit 2184396
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 20 17:03:18 2024 -0700

    hdfs initial implementation

commit 3f266c1
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Mon Jun 17 09:46:07 2024 -0700

    style

commit fe9cea1
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Fri Jun 14 12:50:21 2024 -0700

    address more comments from Anish

commit 1870b35
Merge: 4d4cd70 9eb6c76
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 13 14:25:23 2024 -0700

    Merge branch 'skipSnapshotAtBatch' of https://github.com/eason-yuchen-liu/spark into skipSnapshotAtBatch

commit 4d4cd70
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 13 14:24:55 2024 -0700

    log StateSourceOptions optionally

commit 9eb6c76
Merge: 20e1b9c 08e741b
Author: Yuchen Liu <170372783+eason-yuchen-liu@users.noreply.github.com>
Date:   Thu Jun 13 14:18:50 2024 -0700

    Merge branch 'master' into skipSnapshotAtBatch

commit 20e1b9c
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 13 14:16:14 2024 -0700

    address comments from Anish & Wei

commit 4825215
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 13 11:45:55 2024 -0700

    address reviews by Wei partially

commit 5229152
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Wed Jun 12 11:29:46 2024 -0700

    support reading join states

commit 61dea35
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Tue Jun 11 13:16:56 2024 -0700

    minor

commit 1656580
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Tue Jun 11 12:07:06 2024 -0700

    improve doc

commit 4ebd078
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Tue Jun 11 11:48:30 2024 -0700

    move partition error

commit dfa712e
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Tue Jun 11 11:42:09 2024 -0700

    clean up and format

commit aa337c1
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Tue Jun 11 10:22:59 2024 -0700

    add new test on partition not found error

commit 292ec5d
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Mon Jun 10 16:54:38 2024 -0700

    delete useless test files

commit 1a3d20a
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Mon Jun 10 16:52:22 2024 -0700

    make sure test is stable

commit eddb3c7
Merge: 9d902d7 5a2f374
Author: Yuchen Liu <170372783+eason-yuchen-liu@users.noreply.github.com>
Date:   Mon Jun 10 11:43:03 2024 -0700

    Merge branch 'apache:master' into skipSnapshotAtBatch

commit 9d902d7
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Mon Jun 10 11:13:02 2024 -0700

    test directly on the method instead of end to end

commit 07267b5
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Fri Jun 7 16:43:45 2024 -0700

    allow rocksdb to reconstruct state from a specific checkpoint

commit 2475173
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Thu Jun 6 10:32:56 2024 -0700

    add test cases for two options in HDFS state store

commit 7dad0c1
Merge: 6db0e3d 8a0927c
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Tue Jun 4 15:30:20 2024 -0700

    Merge branch 'skipSnapshotAtBatch' of https://github.com/eason-yuchen-liu/spark into skipSnapshotAtBatch

commit 6db0e3d
Author: Yuchen Liu <yuchen.liu@databricks.com>
Date:   Tue Jun 4 15:28:49 2024 -0700

    initial implementation
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Jul 10, 2024
…check if we can find info for numColsPrefixKey used for session window agg queries

### What changes were proposed in this pull request?
Change to read operator metadata once on driver to check if we can find info for numColsPrefixKey used for session window agg queries

### Why are the changes needed?
Avoid reading the operator metadata file multiple times on the executors

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing unit tests

```
===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.v2.state.RocksDBStateDataSourceReadSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), Idle Worker Monitor for python3 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-8 (daemon=true), shuffle-boss-6-1 (daemon=tru...
[info] Run completed in 1 minute, 39 seconds.
[info] Total number of tests run: 14
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47167 from anishshri-db/task/SPARK-48770.

Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…check if we can find info for numColsPrefixKey used for session window agg queries

### What changes were proposed in this pull request?
Change to read operator metadata once on driver to check if we can find info for numColsPrefixKey used for session window agg queries

### Why are the changes needed?
Avoid reading the operator metadata file multiple times on the executors

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing unit tests

```
===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.v2.state.RocksDBStateDataSourceReadSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), Idle Worker Monitor for python3 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-8 (daemon=true), shuffle-boss-6-1 (daemon=tru...
[info] Run completed in 1 minute, 39 seconds.
[info] Total number of tests run: 14
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47167 from anishshri-db/task/SPARK-48770.

Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants