-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-39904][SQL] Rename inferDate to prefersDate and clarify semantics of the option in CSV data source #37327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out the gap! @sadikovi Based on the above I think we should:
- Still name the option
inferDate
because the primary impact of the option is that columns with dates will be correctly inferred - Make the change to throw an error when
inferSchema
is false butinferDate
is true. This will clarify that the primary purpose of theinferDate
flag is for allowing dates in the inferred schema.
Should we keep requirement that |
@sadikovi I think we should keep the requirement and the new exception type in this PR. The exception will clarify that the primary purpose of the |
Yes, that was my thinking too. Okay, I will make a few changes to the PR. |
The problem is, at the reading phase, we don't know if the schema is inferred or user specified. Even if |
Can one of the admins verify this patch? |
@Jonathancui123 @cloud-fan @HyukjinKwon I have renamed the option and clarified the option description. Can you review again? Thanks. |
docs/sql-data-sources-csv.md
Outdated
<td>false</td> | ||
<td>Whether or not to infer columns that satisfy the <code>dateFormat</code> option as <code>Date</code>. Requires <code>inferSchema</code> to be <code>true</code>. When <code>false</code>, columns with dates will be inferred as <code>String</code> (or as <code>Timestamp</code> if it fits the <code>timestampFormat</code>).</td> | ||
<td>Attempts to infer string columns that contain dates or timestamps as <code>Date</code> if the values satisfy <code>dateFormat</code> option and failed to be parsed by the respective formatter during schema inference (<code>inferSchema</code>). When used in conjunction with a user-provided schema, attempts parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, the parsed values will be cast to timestamp type afterwards.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
contain dates or timestamps
reads a bit confusing. How about just ... infer string columns as Date ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to understand failed to be parsed by the respective formatter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me update.
docs/sql-data-sources-csv.md
Outdated
<td>false</td> | ||
<td>Whether or not to infer columns that satisfy the <code>dateFormat</code> option as <code>Date</code>. Requires <code>inferSchema</code> to be <code>true</code>. When <code>false</code>, columns with dates will be inferred as <code>String</code> (or as <code>Timestamp</code> if it fits the <code>timestampFormat</code>).</td> | ||
<td>Attempts to infer string columns that contain dates or timestamps as <code>Date</code> if the values satisfy <code>dateFormat</code> option and failed to be parsed by the respective formatter during schema inference (<code>inferSchema</code>). When used in conjunction with a user-provided schema, attempts parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, the parsed values will be cast to timestamp type afterwards.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<td>Attempts to infer string columns that contain dates or timestamps as <code>Date</code> if the values satisfy <code>dateFormat</code> option and failed to be parsed by the respective formatter during schema inference (<code>inferSchema</code>). When used in conjunction with a user-provided schema, attempts parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, the parsed values will be cast to timestamp type afterwards.</td> | |
<td>Attempts to infer string columns that contain dates or timestamps as <code>Date</code> if the values satisfy <code>dateFormat</code> option and failed to be parsed by the respective formatter during schema inference (<code>inferSchema</code>). When used in conjunction with a user-provided schema, attempts to parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, the parsed values will be cast to timestamp type afterwards.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
docs/sql-data-sources-csv.md
Outdated
<td>false</td> | ||
<td>Whether or not to infer columns that satisfy the <code>dateFormat</code> option as <code>Date</code>. Requires <code>inferSchema</code> to be <code>true</code>. When <code>false</code>, columns with dates will be inferred as <code>String</code> (or as <code>Timestamp</code> if it fits the <code>timestampFormat</code>).</td> | ||
<td>Attempts to infer string columns as <code>Date</code> if the values satisfy <code>dateFormat</code> option and failed to be parsed by the respective formatter during schema inference (<code>inferSchema</code>). When used in conjunction with a user-provided schema, attempts to parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, the parsed values will be cast to timestamp type afterwards.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<td>Attempts to infer string columns as <code>Date</code> if the values satisfy <code>dateFormat</code> option and failed to be parsed by the respective formatter during schema inference (<code>inferSchema</code>). When used in conjunction with a user-provided schema, attempts to parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, the parsed values will be cast to timestamp type afterwards.</td> | |
<td>During schema inference (<code>inferSchema</code>), attempts to infer string columns that contain dates or timestamps as <code>Date</code> if the values satisfy the <code>dateFormat</code> option and failed to be parsed by the respective formatter. Attempts to parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>. The parsed values will be cast to timestamp type afterwards.</td> |
- Re-order sentence for readability
- Removed: "When used in conjunction with a user-provided schema" because parsing timestamp columns as dates happens regardless of whether there is a user-provided schema
- Split sentence at comma for readabilityWhen used in conjunction with a user-provided schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still prefer to keep the change for user-provided schema. There is a different behaviour when schema inference or providing one. During the former, string columns would be converted to dates but in the latter they would be kept as timestamp columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me update and we can discuss further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the text, essentially kept your suggestions and just inserted "With a user-provided schema". Can you take another look? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense. In my opinion, it is more confusing for the reader. But LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks. I just wanted to clarify the behaviour around user-provided schema as opposed to schema inference to remove any ambiguity that would arise from users.
Merged to master. |
…TimestampNTZ ### What changes were proposed in this pull request? Similar with #37327, this PR renames the JDBC data source option `inferTimestampNTZType` as `preferTimestampNTZ` ### Why are the changes needed? It is simpler and more straightforward. Also, it is consistent with the CSV data source option introduced in #37327, ### Does this PR introduce _any_ user-facing change? No, the TimestampNTZ project is not released yet. ### How was this patch tested? UT Closes #40042 from gengliangwang/inferNTZOption. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…TimestampNTZ Similar with #37327, this PR renames the JDBC data source option `inferTimestampNTZType` as `preferTimestampNTZ` It is simpler and more straightforward. Also, it is consistent with the CSV data source option introduced in #37327, No, the TimestampNTZ project is not released yet. UT Closes #40042 from gengliangwang/inferNTZOption. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 8194522) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…TimestampNTZ Similar with apache#37327, this PR renames the JDBC data source option `inferTimestampNTZType` as `preferTimestampNTZ` It is simpler and more straightforward. Also, it is consistent with the CSV data source option introduced in apache#37327, No, the TimestampNTZ project is not released yet. UT Closes apache#40042 from gengliangwang/inferNTZOption. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 8194522) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This is a follow-up for #36871.
PR renames
inferDate
toprefersDate
to avoid confusion when dates inference would change the column type and result in confusion when the user meant to infer timestamps.The patch also updates semantics of the option:
prefersDate
is allowed to be used during schema inference (inferSchema
) as well as user-provided schema where it could be used as a fallback mechanism when parsing timestamps.Why are the changes needed?
Fixes ambiguity when setting
prefersDate
to true and clarifies semantics of the option.Does this PR introduce any user-facing change?
Although it is an option rename, the original PR was merged a few days ago and the config option has not been included in a Spark release.
How was this patch tested?
I added a unit test for prefersDate = true with a user schema.