Skip to content

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Dec 26, 2024

What changes were proposed in this pull request?

This PR proposes Pythonic approach of setting Spark SQL configurations as below.

Get/set/unset the configurations

>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"] = "false"
>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"]
'false'
>>> spark.conf.spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled = "true"
>>> spark.conf.spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled
'true'
>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"] = "false"
>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"]
'false'
>>> del spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"]
>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"]
'true'

List sub configurations

>>> dir(spark.conf["spark.sql.optimizer"])
['avoidCollapseUDFWithExpensiveExpr', 'collapseProjectAlwaysInline', 'dynamicPartitionPruning.enabled', 'enableCsvExpressionOptimization', 'enableJsonExpressionOptimization', 'excludedRules', 'ptimizer.excludedRules', 'runtime.bloomFilter.applicationSideScanSizeThreshold', 'runtime.bloomFilter.creationSideThreshold', 'runtime.bloomFilter.enabled', 'runtime.bloomFilter.expectedNumItems', 'runtime.bloomFilter.maxNumBits', 'runtime.bloomFilter.maxNumItems', 'runtime.bloomFilter.numBits', 'runtime.rowLevelOperationGroupFilter.enabled', 'runtimeFilter.number.threshold']
>>> dir(spark.conf.spark.sql.optimizer)
['avoidCollapseUDFWithExpensiveExpr', 'collapseProjectAlwaysInline', 'dynamicPartitionPruning.enabled', 'enableCsvExpressionOptimization', 'enableJsonExpressionOptimization', 'excludedRules', 'ptimizer.excludedRules', 'runtime.bloomFilter.applicationSideScanSizeThreshold', 'runtime.bloomFilter.creationSideThreshold', 'runtime.bloomFilter.enabled', 'runtime.bloomFilter.expectedNumItems', 'runtime.bloomFilter.maxNumBits', 'runtime.bloomFilter.maxNumItems', 'runtime.bloomFilter.numBits', 'runtime.rowLevelOperationGroupFilter.enabled', 'runtimeFilter.number.threshold']

Get documentation from the configuration

>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"].desc()
"Enables runtime group filtering for group-based row-level operations. Data sources that replace groups of data (e.g. files, partitions) may prune entire groups using provided data source filters when planning a row-level operation scan. However, such filtering is limited as not all expressions can be converted into data source filters and some expressions can only be evaluated by Spark (e.g. subqueries). Since rewriting groups is expensive, Spark can execute a query at runtime to find what records match the condition of the row-level operation. The information about matching records will be passed back to the row-level operation scan, allowing data sources to discard groups that don't have to be rewritten."
>>> spark.conf.spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled.version()
'3.4.0'

Why are the changes needed?

To provide Pythonic ways of setting options. This is also supported in pandas as a reference (https://pandas.pydata.org/docs/user_guide/options.html).

This should be pretty useful for interactive shell users in a way that they do not have to open a SQL configuration documentation, and check configurations and their documentation.

Does this PR introduce any user-facing change?

Yes, it provides users more Pythonic way of setting SQL configurations as demonstrated above.

How was this patch tested?

TBD

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon
Copy link
Member Author

let me close this for now

@HyukjinKwon HyukjinKwon closed this Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants