-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-25672][SQL] schema_of_csv() - schema inference from an example #22666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
57aacdd
4f1f25a
2e98585
1636db5
1ab4e8b
b843fec
101739f
8b9a1a4
9a1bb07
9efa823
e3d39c3
587caae
d44a319
513f058
ee26f0d
3580c2c
cd7cfdf
fef8a9e
21e2dc4
6b1f408
e343d4d
26fb354
b068d9f
4696cdd
b8c6c94
3aa79d4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,3 +7,11 @@ select from_csv('1', 'a InvalidType'); | |
select from_csv('1', 'a INT', named_struct('mode', 'PERMISSIVE')); | ||
select from_csv('1', 'a INT', map('mode', 1)); | ||
select from_csv(); | ||
-- infer schema of json literal | ||
select from_csv('1,abc', schema_of_csv('1,abc')); | ||
select schema_of_csv('1|abc', map('delimiter', '|')); | ||
select schema_of_csv(null); | ||
CREATE TEMPORARY VIEW csvTable(csvField, a) AS SELECT * FROM VALUES ('1,abc', 'a'); | ||
SELECT schema_of_csv(csvField) FROM csvTable; | ||
-- Clean up | ||
DROP VIEW IF EXISTS csvTable; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually we don't need to clean up temp views. The golden file test is run with a fresh session. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see but isn't it still better to explicitly clean tables up? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yea we need to clean up tables, as they are permanent. Actually I'm fine with it, as we clean up temp views in a lot of golden files. We can have another PR to remove these temp view clean up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also need to update https://github.com/apache/spark/pull/22666/files#diff-5321c01e95bffc4413c5f3457696213eR157
in case the constant folding rule is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's what I initially thought that we should allow constant-foldable expressions as well but just decided to follow the initial intent - literal only support. I wasn't also sure about when we would need constant folding to construct a JSON example because I suspected that's usually copied and pasted from, for instance, a file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, a column with CSV string may be a result of string functions. So, you could just invoke the functions with an particular inputs. Currently, we force people to materialize an example and copy-past it to
schema_of_csv()
. That could cause maintainability issues, so, users should keep in sync the example inschema_of_csv()
with the code which forms CSV column.I prepared the PR #27777 to avoid the restriction which is not necessary from my point of view.