Skip to content

[SPARK-39904][SQL] Rename inferDate to prefersDate and clarify semantics of the option in CSV data source #37327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

sadikovi
Copy link
Contributor

@sadikovi sadikovi commented Jul 28, 2022

What changes were proposed in this pull request?

This is a follow-up for #36871.

PR renames inferDate to prefersDate to avoid confusion when dates inference would change the column type and result in confusion when the user meant to infer timestamps.

The patch also updates semantics of the option: prefersDate is allowed to be used during schema inference (inferSchema) as well as user-provided schema where it could be used as a fallback mechanism when parsing timestamps.

Why are the changes needed?

Fixes ambiguity when setting prefersDate to true and clarifies semantics of the option.

Does this PR introduce any user-facing change?

Although it is an option rename, the original PR was merged a few days ago and the config option has not been included in a Spark release.

How was this patch tested?

I added a unit test for prefersDate = true with a user schema.

@sadikovi
Copy link
Contributor Author

cc @cloud-fan @HyukjinKwon @Jonathancui123

@HyukjinKwon HyukjinKwon changed the title [SPARK-39904] Rename inferDate to preferDate and add check for inferSchema = false [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false Jul 28, 2022
Copy link
Contributor

@Jonathancui123 Jonathancui123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out the gap! @sadikovi Based on the above I think we should:

  • Still name the option inferDate because the primary impact of the option is that columns with dates will be correctly inferred
  • Make the change to throw an error when inferSchema is false but inferDate is true. This will clarify that the primary purpose of the inferDate flag is for allowing dates in the inferred schema.

@sadikovi
Copy link
Contributor Author

sadikovi commented Jul 29, 2022

Should we keep requirement that inferDate = true needs inferSchema = true? I think we should clarify semantics.

@Jonathancui123
Copy link
Contributor

Jonathancui123 commented Jul 29, 2022

Should we keep requirement that inferDate = true needs inferSchema = true? I think we should clarify semantics.

@sadikovi I think we should keep the requirement and the new exception type in this PR. The exception will clarify that the primary purpose of the inferDate flag is for allowing dates in the inferred schema. The requirement that inferDate = true needs inferSchema = true makes sense. Without such a requirement, inferDate=true with inferSchema = false is modifying parsing fallback behavior for no reason.

@sadikovi
Copy link
Contributor Author

Yes, that was my thinking too. Okay, I will make a few changes to the PR.

@cloud-fan
Copy link
Contributor

The problem is, at the reading phase, we don't know if the schema is inferred or user specified. Even if inferSchema=true, people can still specify schema via spark.read.schema(...), right?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@sadikovi sadikovi changed the title [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false [SPARK-39904][SQL] Rename inferDate to prefersDate and clarify semantics of the option in CSV data source Aug 1, 2022
@sadikovi
Copy link
Contributor Author

sadikovi commented Aug 1, 2022

@Jonathancui123 @cloud-fan @HyukjinKwon I have renamed the option and clarified the option description. Can you review again? Thanks.

<td>false</td>
<td>Whether or not to infer columns that satisfy the <code>dateFormat</code> option as <code>Date</code>. Requires <code>inferSchema</code> to be <code>true</code>. When <code>false</code>, columns with dates will be inferred as <code>String</code> (or as <code>Timestamp</code> if it fits the <code>timestampFormat</code>).</td>
<td>Attempts to infer string columns that contain dates or timestamps as <code>Date</code> if the values satisfy <code>dateFormat</code> option and failed to be parsed by the respective formatter during schema inference (<code>inferSchema</code>). When used in conjunction with a user-provided schema, attempts parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, the parsed values will be cast to timestamp type afterwards.</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contain dates or timestamps reads a bit confusing. How about just ... infer string columns as Date ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to understand failed to be parsed by the respective formatter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me update.

<td>false</td>
<td>Whether or not to infer columns that satisfy the <code>dateFormat</code> option as <code>Date</code>. Requires <code>inferSchema</code> to be <code>true</code>. When <code>false</code>, columns with dates will be inferred as <code>String</code> (or as <code>Timestamp</code> if it fits the <code>timestampFormat</code>).</td>
<td>Attempts to infer string columns that contain dates or timestamps as <code>Date</code> if the values satisfy <code>dateFormat</code> option and failed to be parsed by the respective formatter during schema inference (<code>inferSchema</code>). When used in conjunction with a user-provided schema, attempts parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, the parsed values will be cast to timestamp type afterwards.</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<td>Attempts to infer string columns that contain dates or timestamps as <code>Date</code> if the values satisfy <code>dateFormat</code> option and failed to be parsed by the respective formatter during schema inference (<code>inferSchema</code>). When used in conjunction with a user-provided schema, attempts parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, the parsed values will be cast to timestamp type afterwards.</td>
<td>Attempts to infer string columns that contain dates or timestamps as <code>Date</code> if the values satisfy <code>dateFormat</code> option and failed to be parsed by the respective formatter during schema inference (<code>inferSchema</code>). When used in conjunction with a user-provided schema, attempts to parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, the parsed values will be cast to timestamp type afterwards.</td>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.

@sadikovi sadikovi requested a review from Jonathancui123 August 2, 2022 01:55
<td>false</td>
<td>Whether or not to infer columns that satisfy the <code>dateFormat</code> option as <code>Date</code>. Requires <code>inferSchema</code> to be <code>true</code>. When <code>false</code>, columns with dates will be inferred as <code>String</code> (or as <code>Timestamp</code> if it fits the <code>timestampFormat</code>).</td>
<td>Attempts to infer string columns as <code>Date</code> if the values satisfy <code>dateFormat</code> option and failed to be parsed by the respective formatter during schema inference (<code>inferSchema</code>). When used in conjunction with a user-provided schema, attempts to parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, the parsed values will be cast to timestamp type afterwards.</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<td>Attempts to infer string columns as <code>Date</code> if the values satisfy <code>dateFormat</code> option and failed to be parsed by the respective formatter during schema inference (<code>inferSchema</code>). When used in conjunction with a user-provided schema, attempts to parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, the parsed values will be cast to timestamp type afterwards.</td>
<td>During schema inference (<code>inferSchema</code>), attempts to infer string columns that contain dates or timestamps as <code>Date</code> if the values satisfy the <code>dateFormat</code> option and failed to be parsed by the respective formatter. Attempts to parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>. The parsed values will be cast to timestamp type afterwards.</td>
  1. Re-order sentence for readability
  2. Removed: "When used in conjunction with a user-provided schema" because parsing timestamp columns as dates happens regardless of whether there is a user-provided schema
  3. Split sentence at comma for readabilityWhen used in conjunction with a user-provided schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still prefer to keep the change for user-provided schema. There is a different behaviour when schema inference or providing one. During the former, string columns would be converted to dates but in the latter they would be kept as timestamp columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me update and we can discuss further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the text, essentially kept your suggestions and just inserted "With a user-provided schema". Can you take another look? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense. In my opinion, it is more confusing for the reader. But LGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks. I just wanted to clarify the behaviour around user-provided schema as opposed to schema inference to remove any ambiguity that would arise from users.

@HyukjinKwon
Copy link
Member

Merged to master.

HyukjinKwon pushed a commit that referenced this pull request Feb 16, 2023
…TimestampNTZ

### What changes were proposed in this pull request?

Similar with #37327, this PR renames the JDBC data source option `inferTimestampNTZType` as `preferTimestampNTZ`

### Why are the changes needed?

It is simpler and more straightforward. Also, it is consistent with the CSV data source option introduced in #37327,

### Does this PR introduce _any_ user-facing change?

No, the TimestampNTZ project is not released yet.

### How was this patch tested?

UT

Closes #40042 from gengliangwang/inferNTZOption.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Feb 16, 2023
…TimestampNTZ

Similar with #37327, this PR renames the JDBC data source option `inferTimestampNTZType` as `preferTimestampNTZ`

It is simpler and more straightforward. Also, it is consistent with the CSV data source option introduced in #37327,

No, the TimestampNTZ project is not released yet.

UT

Closes #40042 from gengliangwang/inferNTZOption.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8194522)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…TimestampNTZ

Similar with apache#37327, this PR renames the JDBC data source option `inferTimestampNTZType` as `preferTimestampNTZ`

It is simpler and more straightforward. Also, it is consistent with the CSV data source option introduced in apache#37327,

No, the TimestampNTZ project is not released yet.

UT

Closes apache#40042 from gengliangwang/inferNTZOption.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8194522)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants