Skip to content

feat: Short circuit query for local scan #1618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Apr 23, 2025
Merged

Conversation

TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Apr 15, 2025
@TrevorBergeron TrevorBergeron marked this pull request as ready for review April 22, 2025 21:50
@TrevorBergeron TrevorBergeron requested review from a team as code owners April 22, 2025 21:50
@TrevorBergeron TrevorBergeron requested a review from ZehaoXU April 22, 2025 21:50
@TrevorBergeron TrevorBergeron requested review from tswast and chelsea-lin and removed request for ZehaoXU April 22, 2025 21:50
array.offsets, values, mask=array.is_null()
)
return new_value.fill_null([]), bigframes.dtypes.list_type(values_type)
if array.type == bigframes.dtypes.JSON_ARROW_TYPE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With _canonicalize_json, you probably can remove _validate_content method in this module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

def _canonicalize_scalar(json_string):
if json_string is None:
return None
return json.dumps(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments why we need to canonicalize json here? Refer to:
# sort_keys=True sorts dictionary keys before serialization, making
# JSON comparisons deterministic.
# separators=(',', ':') eliminate whitespace to get the most compact
# JSON representation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment

@@ -0,0 +1,68 @@
# Copyright 2025 Google LLC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename the file as local_scan_executor to match with bq_caching_executor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

from bigframes.session import executor, semi_executor


class LocalScanExecutor(semi_executor.SemiExecutor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it inherits from executor.Executor, what would be the problem? The polar executor and sql compiler executor (coming for sqlglot unit tests) are inheriting from executor.Executor. Thinking if we can avoid introduce additional classes for better readable and less maintaining.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SemiExecutor is a much smaller interface with only a single method. The general executor interface supports exporting, caching, etc. Can hopefully much simplify the general executor interface at some point, but until then, semi-executor helps provide the interface for hybrid execution components

@@ -155,7 +155,7 @@ def test_json_extract_array_from_json_strings():
)
actual = bbq.json_extract_array(s, "$.a")
expected = bpd.Series(
[['"ab"', '"2"', '"3 xy"'], [], ['"4"', '"5"'], None],
[['"ab"', '"2"', '"3 xy"'], [], ['"4"', '"5"'], []],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed, can we please check if None->[] is required or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting, it works without this change

@TrevorBergeron TrevorBergeron merged commit e84f232 into main Apr 23, 2025
23 of 24 checks passed
@TrevorBergeron TrevorBergeron deleted the local_semi_exec2 branch April 23, 2025 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants