Skip to content

Commit 31ab8bc

Browse files
sadikoviHyukjinKwon
authored andcommitted
[SPARK-39904][SQL] Rename inferDate to prefersDate and clarify semantics of the option in CSV data source
### What changes were proposed in this pull request? This is a follow-up for #36871. PR renames `inferDate` to `prefersDate` to avoid confusion when dates inference would change the column type and result in confusion when the user meant to infer timestamps. The patch also updates semantics of the option: `prefersDate` is allowed to be used during schema inference (`inferSchema`) as well as user-provided schema where it could be used as a fallback mechanism when parsing timestamps. ### Why are the changes needed? Fixes ambiguity when setting `prefersDate` to true and clarifies semantics of the option. ### Does this PR introduce _any_ user-facing change? Although it is an option rename, the original PR was merged a few days ago and the config option has not been included in a Spark release. ### How was this patch tested? I added a unit test for prefersDate = true with a user schema. Closes #37327 from sadikovi/rename_config. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
1 parent e9cc102 commit 31ab8bc

File tree

8 files changed

+72
-30
lines changed

8 files changed

+72
-30
lines changed

docs/sql-data-sources-csv.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -109,9 +109,9 @@ Data source options of CSV can be set via:
109109
<td>read</td>
110110
</tr>
111111
<tr>
112-
<td><code>inferDate</code></td>
112+
<td><code>prefersDate</code></td>
113113
<td>false</td>
114-
<td>Whether or not to infer columns that satisfy the <code>dateFormat</code> option as <code>Date</code>. Requires <code>inferSchema</code> to be <code>true</code>. When <code>false</code>, columns with dates will be inferred as <code>String</code> (or as <code>Timestamp</code> if it fits the <code>timestampFormat</code>).</td>
114+
<td>During schema inference (<code>inferSchema</code>), attempts to infer string columns that contain dates or timestamps as <code>Date</code> if the values satisfy the <code>dateFormat</code> option and failed to be parsed by the respective formatter. With a user-provided schema, attempts to parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, in this case the parsed values will be cast to timestamp type afterwards.</td>
115115
<td>read</td>
116116
</tr>
117117
<tr>
@@ -176,8 +176,8 @@ Data source options of CSV can be set via:
176176
</tr>
177177
<tr>
178178
<td><code>enableDateTimeParsingFallback</code></td>
179-
<td>Enabled if the time parser policy is legacy or no custom date or timestamp pattern was provided</td>
180-
<td>Allows to fall back to the backward compatible (Spark 1.x and 2.0) behavior of parsing dates and timestamps if values do not match the set patterns.</td>
179+
<td>Enabled if the time parser policy has legacy settings or if no custom date or timestamp pattern was provided.</td>
180+
<td>Allows falling back to the backward compatible (Spark 1.x and 2.0) behavior of parsing dates and timestamps if values do not match the set patterns.</td>
181181
<td>read</td>
182182
</tr>
183183
<tr>

docs/sql-data-sources-json.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -204,8 +204,8 @@ Data source options of JSON can be set via:
204204
</tr>
205205
<tr>
206206
<td><code>enableDateTimeParsingFallback</code></td>
207-
<td>Enabled if the time parser policy is legacy or no custom date or timestamp pattern was provided</td>
208-
<td>Allows to fall back to the backward compatible (Spark 1.x and 2.0) behavior of parsing dates and timestamps if values do not match the set patterns.</td>
207+
<td>Enabled if the time parser policy has legacy settings or if no custom date or timestamp pattern was provided.</td>
208+
<td>Allows falling back to the backward compatible (Spark 1.x and 2.0) behavior of parsing dates and timestamps if values do not match the set patterns.</td>
209209
<td>read</td>
210210
</tr>
211211
<tr>

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -124,9 +124,9 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
124124
case _: DecimalType => tryParseDecimal(field)
125125
case DoubleType => tryParseDouble(field)
126126
case DateType => tryParseDateTime(field)
127-
case TimestampNTZType if options.inferDate => tryParseDateTime(field)
127+
case TimestampNTZType if options.prefersDate => tryParseDateTime(field)
128128
case TimestampNTZType => tryParseTimestampNTZ(field)
129-
case TimestampType if options.inferDate => tryParseDateTime(field)
129+
case TimestampType if options.prefersDate => tryParseDateTime(field)
130130
case TimestampType => tryParseTimestamp(field)
131131
case BooleanType => tryParseBoolean(field)
132132
case StringType => StringType
@@ -178,7 +178,7 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
178178
private def tryParseDouble(field: String): DataType = {
179179
if ((allCatch opt field.toDouble).isDefined || isInfOrNan(field)) {
180180
DoubleType
181-
} else if (options.inferDate) {
181+
} else if (options.prefersDate) {
182182
tryParseDateTime(field)
183183
} else {
184184
tryParseTimestampNTZ(field)

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -149,23 +149,29 @@ class CSVOptions(
149149
val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
150150

151151
/**
152-
* Infer columns with all valid date entries as date type (otherwise inferred as timestamp type).
153-
* Disabled by default for backwards compatibility and performance. When enabled, date entries in
154-
* timestamp columns will be cast to timestamp upon parsing. Not compatible with
155-
* legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters
152+
* Infer columns with all valid date entries as date type (otherwise inferred as timestamp type)
153+
* if schema inference is enabled. When being used with user-provided schema, tries to parse
154+
* timestamp values as dates if the values do not conform to the timestamp formatter before
155+
* falling back to the backward compatible parsing - the parsed values will be cast to timestamp
156+
* afterwards.
157+
*
158+
* Disabled by default for backwards compatibility and performance.
159+
*
160+
* Not compatible with legacyTimeParserPolicy == LEGACY since legacy date parser will accept
161+
* extra trailing characters.
156162
*/
157-
val inferDate = {
158-
val inferDateFlag = getBool("inferDate")
159-
if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) {
163+
val prefersDate = {
164+
val inferDateFlag = getBool("prefersDate")
165+
if (inferDateFlag && SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) {
160166
throw QueryExecutionErrors.inferDateWithLegacyTimeParserError()
161167
}
162168
inferDateFlag
163169
}
164170

165-
// Provide a default value for dateFormatInRead when inferDate. This ensures that the
171+
// Provide a default value for dateFormatInRead when prefersDate. This ensures that the
166172
// Iso8601DateFormatter (with strict date parsing) is used for date inference
167173
val dateFormatInRead: Option[String] =
168-
if (inferDate) {
174+
if (prefersDate) {
169175
Option(parameters.getOrElse("dateFormat", DateFormatter.defaultPattern))
170176
} else {
171177
parameters.get("dateFormat")

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -235,7 +235,7 @@ class UnivocityParser(
235235
} catch {
236236
case NonFatal(e) =>
237237
// There may be date type entries in timestamp column due to schema inference
238-
if (options.inferDate) {
238+
if (options.prefersDate) {
239239
daysToMicros(dateFormatter.parse(datum), options.zoneId)
240240
} else {
241241
// If fails to parse, then tries the way used in 2.0 and 1.x for backwards
@@ -254,7 +254,7 @@ class UnivocityParser(
254254
try {
255255
timestampNTZFormatter.parseWithoutTimeZone(datum, false)
256256
} catch {
257-
case NonFatal(e) if (options.inferDate) =>
257+
case NonFatal(e) if options.prefersDate =>
258258
daysToMicros(dateFormatter.parse(datum), TimeZoneUTC.toZoneId)
259259
}
260260
}

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -201,30 +201,30 @@ class CSVInferSchemaSuite extends SparkFunSuite with SQLHelper {
201201

202202
test("SPARK-39469: inferring date type") {
203203
// "yyyy/MM/dd" format
204-
var options = new CSVOptions(Map("dateFormat" -> "yyyy/MM/dd", "inferDate" -> "true"),
204+
var options = new CSVOptions(Map("dateFormat" -> "yyyy/MM/dd", "prefersDate" -> "true"),
205205
false, "UTC")
206206
var inferSchema = new CSVInferSchema(options)
207207
assert(inferSchema.inferField(NullType, "2018/12/02") == DateType)
208208
// "MMM yyyy" format
209-
options = new CSVOptions(Map("dateFormat" -> "MMM yyyy", "inferDate" -> "true"),
209+
options = new CSVOptions(Map("dateFormat" -> "MMM yyyy", "prefersDate" -> "true"),
210210
false, "GMT")
211211
inferSchema = new CSVInferSchema(options)
212212
assert(inferSchema.inferField(NullType, "Dec 2018") == DateType)
213213
// Field should strictly match date format to infer as date
214214
options = new CSVOptions(
215215
Map("dateFormat" -> "yyyy-MM-dd", "timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss",
216-
"inferDate" -> "true"),
216+
"prefersDate" -> "true"),
217217
columnPruning = false,
218218
defaultTimeZoneId = "GMT")
219219
inferSchema = new CSVInferSchema(options)
220220
assert(inferSchema.inferField(NullType, "2018-12-03T11:00:00") == TimestampType)
221221
assert(inferSchema.inferField(NullType, "2018-12-03") == DateType)
222222
}
223223

224-
test("SPARK-39469: inferring date and timestamp types in a mixed column with inferDate=true") {
224+
test("SPARK-39469: inferring date and timestamp types in a mixed column with prefersDate=true") {
225225
var options = new CSVOptions(
226226
Map("dateFormat" -> "yyyy_MM_dd", "timestampFormat" -> "yyyy|MM|dd",
227-
"timestampNTZFormat" -> "yyyy/MM/dd", "inferDate" -> "true"),
227+
"timestampNTZFormat" -> "yyyy/MM/dd", "prefersDate" -> "true"),
228228
columnPruning = false,
229229
defaultTimeZoneId = "UTC")
230230
var inferSchema = new CSVInferSchema(options)

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -373,10 +373,10 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper {
373373
assert(err.getMessage.contains("Illegal pattern character: n"))
374374
}
375375

376-
test("SPARK-39469: dates should be parsed correctly in a timestamp column when inferDate=true") {
376+
test("SPARK-39469: dates should be parsed correctly in timestamp column when prefersDate=true") {
377377
def checkDate(dataType: DataType): Unit = {
378378
val timestampsOptions =
379-
new CSVOptions(Map("inferDate" -> "true", "timestampFormat" -> "dd/MM/yyyy HH:mm",
379+
new CSVOptions(Map("prefersDate" -> "true", "timestampFormat" -> "dd/MM/yyyy HH:mm",
380380
"timestampNTZFormat" -> "dd-MM-yyyy HH:mm", "dateFormat" -> "dd_MM_yyyy"),
381381
false, DateTimeUtils.getZoneId("-08:00").toString)
382382
// Use CSVOption ZoneId="-08:00" (PST) to test that Dates in TimestampNTZ column are always

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2797,13 +2797,13 @@ abstract class CSVSuite
27972797
"inferSchema" -> "true",
27982798
"timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss",
27992799
"dateFormat" -> "yyyy-MM-dd",
2800-
"inferDate" -> "true")
2800+
"prefersDate" -> "true")
28012801
val options2 = Map(
28022802
"header" -> "true",
28032803
"inferSchema" -> "true",
2804-
"inferDate" -> "true")
2804+
"prefersDate" -> "true")
28052805

2806-
// Error should be thrown when attempting to inferDate with Legacy parser
2806+
// Error should be thrown when attempting to prefersDate with Legacy parser
28072807
if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) {
28082808
val msg = intercept[IllegalArgumentException] {
28092809
spark.read
@@ -2840,6 +2840,42 @@ abstract class CSVSuite
28402840
}
28412841
}
28422842

2843+
test("SPARK-39904: Parse incorrect timestamp values with prefersDate=true") {
2844+
withTempPath { path =>
2845+
Seq(
2846+
"2020-02-01 12:34:56",
2847+
"2020-02-02",
2848+
"invalid"
2849+
).toDF()
2850+
.repartition(1)
2851+
.write.text(path.getAbsolutePath)
2852+
2853+
val schema = new StructType()
2854+
.add("ts", TimestampType)
2855+
2856+
val output = spark.read
2857+
.schema(schema)
2858+
.option("prefersDate", "true")
2859+
.csv(path.getAbsolutePath)
2860+
2861+
if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) {
2862+
val msg = intercept[IllegalArgumentException] {
2863+
output.collect()
2864+
}.getMessage
2865+
assert(msg.contains("CANNOT_INFER_DATE"))
2866+
} else {
2867+
checkAnswer(
2868+
output,
2869+
Seq(
2870+
Row(Timestamp.valueOf("2020-02-01 12:34:56")),
2871+
Row(Timestamp.valueOf("2020-02-02 00:00:00")),
2872+
Row(null)
2873+
)
2874+
)
2875+
}
2876+
}
2877+
}
2878+
28432879
test("SPARK-39731: Correctly parse dates and timestamps with yyyyMMdd pattern") {
28442880
withTempPath { path =>
28452881
Seq(

0 commit comments

Comments
 (0)