Skip to content

Conversation

@WweiL
Copy link
Contributor

@WweiL WweiL commented Jul 24, 2024

This reverts commit d067fc6, which reverted 042804a, essentially brings it back. 042804a failed the 3.5 client <> 4.0 server test, but the test was decided to turned off for cross-version test in #47468

What changes were proposed in this pull request?

This PR is created after discussion in this closed one: #46886
I was trying to fix a bug (in connect, query.lastProgress doesn't have numInputRows, inputRowsPerSecond, and processedRowsPerSecond), and we reached the conclusion that what purposed in this PR should be the ultimate fix.

In python, for both classic spark and spark connect, the return type of lastProgress is Dict (and recentProgress is List[Dict]), but in scala it's the actual StreamingQueryProgress object:

def recentProgress: Array[StreamingQueryProgress]
/**
* Returns the most recent [[StreamingQueryProgress]] update of this streaming query.
*
* @since 2.1.0
*/
def lastProgress: StreamingQueryProgress

This API discrepancy brings some confusion, like in Scala, users can do query.lastProgress.batchId, while in Python they have to do query.lastProgress["batchId"].

This PR makes StreamingQuery.lastProgress to return the actual StreamingQueryProgress (and StreamingQuery.recentProgress to return List[StreamingQueryProgress]).

To prevent breaking change, we extend StreamingQueryProgress to be a subclass of dict, so existing code accessing using dictionary method (e.g. query.lastProgress["id"]) is still functional.

Why are the changes needed?

API parity

Does this PR introduce any user-facing change?

Yes, now StreamingQuery.lastProgress returns the actual StreamingQueryProgress (and StreamingQuery.recentProgress returns List[StreamingQueryProgress]).

How was this patch tested?

Added unit test

Was this patch authored or co-authored using generative AI tooling?

No

…the actual StreamingQueryProgress"

This reverts commit d067fc6.
@WweiL
Copy link
Contributor Author

WweiL commented Jul 24, 2024

cc @HyukjinKwon

@HyukjinKwon
Copy link
Member

Merged to master.

ilicmarkodb pushed a commit to ilicmarkodb/spark that referenced this pull request Jul 29, 2024
… the actual StreamingQueryProgress

This reverts commit d067fc6, which reverted 042804a, essentially brings it back. 042804a failed the 3.5 client <> 4.0 server test, but the test was decided to turned off for cross-version test in apache#47468

### What changes were proposed in this pull request?

This PR is created after discussion in this closed one: apache#46886
I was trying to fix a bug (in connect, query.lastProgress doesn't have `numInputRows`, `inputRowsPerSecond`, and `processedRowsPerSecond`), and we reached the conclusion that what purposed in this PR should be the ultimate fix.

In python, for both classic spark and spark connect, the return type of `lastProgress` is `Dict` (and `recentProgress` is `List[Dict]`), but in scala it's the actual `StreamingQueryProgress` object:
https://github.com/apache/spark/blob/1a5d22aa2ffe769435be4aa6102ef961c55b9593/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala#L94-L101

This API discrepancy brings some confusion, like in Scala, users can do `query.lastProgress.batchId`, while in Python they have to do `query.lastProgress["batchId"]`.

This PR makes `StreamingQuery.lastProgress` to return the actual `StreamingQueryProgress` (and `StreamingQuery.recentProgress` to return `List[StreamingQueryProgress]`).

To prevent breaking change, we extend `StreamingQueryProgress` to be a subclass of `dict`, so existing code accessing using dictionary method (e.g. `query.lastProgress["id"]`) is still functional.

### Why are the changes needed?

API parity

### Does this PR introduce _any_ user-facing change?

Yes, now `StreamingQuery.lastProgress` returns the actual `StreamingQueryProgress` (and `StreamingQuery.recentProgress` returns `List[StreamingQueryProgress]`).

### How was this patch tested?

Added unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47470 from WweiL/bring-back-lastProgress.

Authored-by: Wei Liu <wei.liu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants