-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-31020][SPARK-31023][SPARK-31025][SPARK-31044][SQL] Support foldable args by from_csv/json
and schema_of_csv/json
#27804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@HyukjinKwon @cloud-fan Please, review the PR. |
Test build #119371 has finished for PR 27804 at commit
|
jenkins, retest this, please |
As mentioned in other PRs, I am okay. I don't feel strongly. Let me know if you guys have preferences @cloud-fan and @dongjoon-hyun. |
@@ -24,7 +24,7 @@ select from_csv('1', 1) | |||
struct<> | |||
-- !query output | |||
org.apache.spark.sql.AnalysisException | |||
Schema should be specified in DDL format as a string literal or output of the schema_of_csv function instead of 1;; line 1 pos 7 | |||
The expression '1' must be evaluated to a valid string.;; line 1 pos 7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valid schema string
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and as an error message, seems better to have The expression '1' is not a valid schema string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced
I'm fine as well, though I don't think it has a large impact. |
Test build #119375 has finished for PR 27804 at commit
|
Test build #119408 has finished for PR 27804 at commit
|
Test build #119415 has finished for PR 27804 at commit
|
thanks, merging to master! |
…dable args by `from_csv/json` and `schema_of_csv/json` ### What changes were proposed in this pull request? In the PR, I propose: 1. To replace matching by `Literal` in `ExprUtils.evalSchemaExpr()` to checking foldable property of the `schema` expression. 2. To replace matching by `Literal` in `ExprUtils.evalTypeExpr()` to checking foldable property of the `schema` expression. 3. To change checking of the input parameter in the `SchemaOfCsv` expression, and allow foldable `child` expression. 4. To change checking of the input parameter in the `SchemaOfJson` expression, and allow foldable `child` expression. ### Why are the changes needed? This should improve Spark SQL UX for `from_csv`/`from_json`. Currently, Spark expects only literals: ```sql spark-sql> select from_csv('1,Moscow', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_csv function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7 spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7 ``` and only string literals are acceptable as CSV examples by `schema_of_csv`/`schema_of_json`: ```sql spark-sql> select schema_of_csv(concat_ws(',', 0.1, 1)); Error in query: cannot resolve 'schema_of_csv(concat_ws(',', CAST(0.1BD AS STRING), CAST(1 AS STRING)))' due to data type mismatch: The input csv should be a string literal and not null; however, got concat_ws(',', CAST(0.1BD AS STRING), CAST(1 AS STRING)).; line 1 pos 7; 'Project [unresolvedalias(schema_of_csv(concat_ws(,, cast(0.1 as string), cast(1 as string))), None)] +- OneRowRelation spark-sql> select schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', '')); Error in query: cannot resolve 'schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', ''))' due to data type mismatch: The input json should be a string literal and not null; however, got regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', '').; line 1 pos 7; 'Project [unresolvedalias(schema_of_json(regexp_replace({"item_id": 1, "item_price": 0.1}, item_, )), None)] +- OneRowRelation ``` ### Does this PR introduce any user-facing change? Yes, after the changes users can pass any foldable string expression as the `schema` parameter to `from_csv()/from_json()`. For the example above: ```sql spark-sql> select from_csv('1,Moscow', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); {"id":1,"city":"Moscow"} spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); {"id":1,"city":"Moscow"} ``` After change the `schema_of_csv`/`schema_of_json` functions accept foldable expressions, for example: ```sql spark-sql> select schema_of_csv(concat_ws(',', 0.1, 1)); struct<_c0:double,_c1:int> spark-sql> select schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', '')); struct<id:bigint,price:double> ``` ### How was this patch tested? Added new test to `CsvFunctionsSuite` and to `JsonFunctionsSuite`. Closes apache#27804 from MaxGekk/foldable-arg-csv-json-func. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
In the PR, I propose:
Literal
inExprUtils.evalSchemaExpr()
to checking foldable property of theschema
expression.Literal
inExprUtils.evalTypeExpr()
to checking foldable property of theschema
expression.SchemaOfCsv
expression, and allow foldablechild
expression.SchemaOfJson
expression, and allow foldablechild
expression.Why are the changes needed?
This should improve Spark SQL UX for
from_csv
/from_json
. Currently, Spark expects only literals:and only string literals are acceptable as CSV examples by
schema_of_csv
/schema_of_json
:Does this PR introduce any user-facing change?
Yes, after the changes users can pass any foldable string expression as the
schema
parameter tofrom_csv()/from_json()
. For the example above:After change the
schema_of_csv
/schema_of_json
functions accept foldable expressions, for example:How was this patch tested?
Added new test to
CsvFunctionsSuite
and toJsonFunctionsSuite
.