[SPARK-26546][SQL] Caching of java.time.format.DateTimeFormatter #23462

MaxGekk · 2019-01-05T12:18:13Z

What changes were proposed in this pull request?

Added a cache for java.time.format.DateTimeFormatter instances with keys consist of pattern and locale. This should allow to avoid parsing of timestamp/date patterns each time when new instance of TimestampFormatter/DateFormatter is created.

How was this patch tested?

By existing test suites TimestampFormatterSuite/DateFormatterSuite and JsonFunctionsSuite/JsonSuite.

SparkQA · 2019-01-05T16:19:52Z

Test build #100782 has finished for PR 23462 at commit aa3c146.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-01-05T16:47:15Z

@cloud-fan Please, take a look at the PR.

viirya · 2019-01-06T03:46:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala

+}
+
+object DateTimeFormatterHelper {
+  private val cache = new ConcurrentHashMap[(String, Locale), DateTimeFormatter]()


Do we need to consider cleaning up old entries in this map?

Yea, I was wondering about that too. This combinations looks huge.

In real life, locale is constant (Locale.US) and number of used date/timestamp patterns is small.

Also answering my own question: the formatter is thread-safe, so this is fine.
I agree that this cache won't grow large as I can't imagine an app using more than a handful of distinct patterns and locales.

Yea but this can be performed row-based operation as a corner case. For instance, an expression that allows different formats for each value can make this cache grows a lot. It's unlikely but pretty possible. Also, most importantly currently it doesn't look super useful for the same reason of #23462 (comment)

True. This could wrap some weak-ref based cache if we needed to.
Good point about whether this is really created a lot.

shall we use com.google.common.cache.Cache? It's used in several places inside Spark.

If we have significant concern of infinitely cache growing, I will switch on fixed size LRU cache here.

I think it makes sense to restrict growing of the cache. I replaced it by Guava's cache.

cloud-fan · 2019-01-06T04:23:18Z

How helpful it is? I was convinced by you that the formatter is created per RDD partition and the creation time doesn't matter too much.

HyukjinKwon · 2019-01-06T04:35:34Z

This looks also able to execute per record, for instance,

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

Lines 564 to 577 in e0054b8

    
             override protected def nullSafeEval(timestamp: Any, format: Any): Any = { 
        
               val df = TimestampFormatter(format.toString, timeZone, Locale.US) 
        
               UTF8String.fromString(df.format(timestamp.asInstanceOf[Long])) 
        
             } 
        
             override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { 
        
               val tf = TimestampFormatter.getClass.getName.stripSuffix("$") 
        
               val tz = ctx.addReferenceObj("timeZone", timeZone) 
        
               val locale = ctx.addReferenceObj("locale", Locale.US) 
        
               defineCodeGen(ctx, ev, (timestamp, format) => { 
        
                 s"""UTF8String.fromString($tf$$.MODULE$$.apply($format.toString(), $tz, $locale) 
        
                     .format($timestamp))""" 
        
               }) 
        
             }

HyukjinKwon · 2019-01-06T04:40:05Z

BTW, we haven't never cached before even before switching the library in that code path.

MaxGekk · 2019-01-06T05:25:02Z

BTW, we haven't never cached before even before switching the library in that code path.

Look at Spark 2.4, It uses FastDateFormat in JSON and CSV (https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala#L83-L88) datasource which takes formatters from a cache internally implemented in the same way as in the PR via ConcurrentHashMap.

MaxGekk · 2019-01-06T05:42:43Z

I was convinced by you that the formatter is created per RDD partition and the creation time doesn't matter too much.

@cloud-fan This is true for current usage of the formatter and for the PR #23391 . This cache is for future usage to create formatter faster when it is hard to create it in advance.

HyukjinKwon · 2019-01-06T10:45:55Z

@MaxGekk, I mean here

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

Lines 564 to 577 in e0054b8

    
             override protected def nullSafeEval(timestamp: Any, format: Any): Any = { 
        
               val df = TimestampFormatter(format.toString, timeZone, Locale.US) 
        
               UTF8String.fromString(df.format(timestamp.asInstanceOf[Long])) 
        
             } 
        
             override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { 
        
               val tf = TimestampFormatter.getClass.getName.stripSuffix("$") 
        
               val tz = ctx.addReferenceObj("timeZone", timeZone) 
        
               val locale = ctx.addReferenceObj("locale", Locale.US) 
        
               defineCodeGen(ctx, ev, (timestamp, format) => { 
        
                 s"""UTF8String.fromString($tf$$.MODULE$$.apply($format.toString(), $tz, $locale) 
        
                     .format($timestamp))""" 
        
               }) 
        
             }

. For other code paths, I basically agree with Wenchen's #23462 (comment)

srowen · 2019-01-06T14:32:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala

+}
+
+object DateTimeFormatterHelper {
+  private val cache = new ConcurrentHashMap[(String, Locale), DateTimeFormatter]()


Also answering my own question: the formatter is thread-safe, so this is fine.
I agree that this cache won't grow large as I can't imagine an app using more than a handful of distinct patterns and locales.

srowen · 2019-01-06T14:32:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala

+
+  def getFormatter(pattern: String, locale: Locale): DateTimeFormatter = {
+    val key = (pattern, locale)
+    var formatter = cache.get(key)


computeIfAbsent ought to be simpler and more efficient here, to compute the value only if it isn't already present

but it can block other threads during lambda computation: ...Some attempted update operations on this map by other threads may be blocked while computation is in progress, so the computation should be short and simple... but this implementation does not.

Sure, but that's necessary to avoid computing it more than once right? and only is an issue if multiple threads need the value at once, the first time. if it's blocking for milliseconds that seems OK. It would be an issue if it meant every subsequent access slowed down or was unnecessarily contended.

I just follow implementation of FastDateFormat from Apache Commons lang3: https://github.com/apache/commons-lang/blob/8e8b8e05e4eb9aa009444c2fea3552d28b57aa98/src/main/java/org/apache/commons/lang3/time/FormatCache.java#L71-L91

Commons lang3 supports Java 7, so couldn't use computeIfAbsent. I presume it would if it could. I don't feel super strongly about it, but think we can take advantage of Java 8 here. It saves a second lookup, and in so doing, avoids the (fairly harmless) race condition here -- multiple threads can find the instance isn't cached and compute and try to put the result. It is still correct but not optimal.

I agree with @srowen. If the blocking frequently occurs (i.e. key does not exist), this cache does not work effectively.
If a key frequently exists, the blocking will not occur frequently.

SparkQA · 2019-01-06T20:28:24Z

Test build #100838 has finished for PR 23462 at commit f08c71e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-01-07T20:33:17Z

@hvanhovell Could you look at this, please.

SparkQA · 2019-01-07T22:17:13Z

Test build #100904 has finished for PR 23462 at commit 61038ea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-08T00:47:29Z

Test build #100908 has finished for PR 23462 at commit 68ec759.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-01-08T07:23:28Z

jenkins, retest this, please

SparkQA · 2019-01-08T08:05:02Z

Test build #100919 has finished for PR 23462 at commit 68ec759.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-01-08T08:21:19Z

jenkins, retest this, please

SparkQA · 2019-01-08T12:30:35Z

Test build #100920 has finished for PR 23462 at commit 68ec759.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-08T13:08:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala

+  private val cache = CacheBuilder.newBuilder()
+    .initialCapacity(8)
+    .maximumSize(128)
+    .expireAfterAccess(1, TimeUnit.HOURS)


do we really need the expire policy?

Agree, I think it has to make a thread to deal with it and it's not worth it. Min size doesn't really matter either.

cloud-fan · 2019-01-08T13:11:14Z

LGTM

srowen · 2019-01-08T14:36:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala

+  private val cache = CacheBuilder.newBuilder()
+    .initialCapacity(8)
+    .maximumSize(128)
+    .expireAfterAccess(1, TimeUnit.HOURS)


Agree, I think it has to make a thread to deal with it and it's not worth it. Min size doesn't really matter either.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala

SparkQA · 2019-01-08T23:44:15Z

Test build #100946 has finished for PR 23462 at commit 0737991.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-08T23:55:41Z

Test build #100943 has finished for PR 23462 at commit 0ea07db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-09T00:03:10Z

Test build #100944 has finished for PR 23462 at commit b7413a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-09T01:15:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala

+  def getOrCreateFormatter(pattern: String, locale: Locale): DateTimeFormatter = {
+    val key = (pattern, locale)
+    var formatter = cache.getIfPresent(key)
+    if (formatter == null) {


let's add a comment to say that, we intentionally drop the synchronized here, as the worst case is we create the same formatter more than once, which doesn't matter.

without the comment, I'm afraid people may open PRs to add the synchronized later, as they don't know the context.

cloud-fan · 2019-01-09T01:15:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala

+    Instant.from(zonedDateTime)
+  }
+
+  def getOrCreateFormatter(pattern: String, locale: Locale): DateTimeFormatter = {


protected

SparkQA · 2019-01-09T16:18:19Z

Test build #100967 has finished for PR 23462 at commit c68778c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-10T02:29:49Z

Merged to master

## What changes were proposed in this pull request? Added a cache for java.time.format.DateTimeFormatter instances with keys consist of pattern and locale. This should allow to avoid parsing of timestamp/date patterns each time when new instance of `TimestampFormatter`/`DateFormatter` is created. ## How was this patch tested? By existing test suites `TimestampFormatterSuite`/`DateFormatterSuite` and `JsonFunctionsSuite`/`JsonSuite`. Closes apache#23462 from MaxGekk/time-formatter-caching. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Added a cache for DateTimeFormatter

aa3c146

viirya reviewed Jan 6, 2019

View reviewed changes

srowen requested changes Jan 6, 2019

View reviewed changes

Use computeIfAbsent

f08c71e

Using Google's cache

61038ea

Unwrap the original exception

68ec759

cloud-fan reviewed Jan 8, 2019

View reviewed changes

viirya approved these changes Jan 8, 2019

View reviewed changes

srowen reviewed Jan 8, 2019

View reviewed changes

HyukjinKwon approved these changes Jan 8, 2019

View reviewed changes

HyukjinKwon reviewed Jan 8, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jan 8, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala Outdated Show resolved Hide resolved

Avoid Callable

5c2d99d

MaxGekk added 3 commits January 8, 2019 20:30

Less settings for the cache

2a67f44

Moving getOrCreateFormatter to the DateTimeFormatterHelper trait

0ea07db

Removing exception handling

b7413a4

srowen approved these changes Jan 8, 2019

View reviewed changes

Removing unused import

0737991

cloud-fan reviewed Jan 9, 2019

View reviewed changes

MaxGekk added 2 commits January 9, 2019 13:07

Adding protected to getOrCreateFormatter

545a710

Added a comment for getOrCreateFormatter

c68778c

cloud-fan approved these changes Jan 9, 2019

View reviewed changes

asfgit closed this in 73c7b12 Jan 10, 2019

MaxGekk deleted the time-formatter-caching branch August 17, 2019 13:36

[SPARK-26546][SQL] Caching of java.time.format.DateTimeFormatter #23462

[SPARK-26546][SQL] Caching of java.time.format.DateTimeFormatter #23462

Uh oh!

Conversation

MaxGekk commented Jan 5, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 5, 2019

Uh oh!

MaxGekk commented Jan 5, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 6, 2019

Uh oh!

HyukjinKwon commented Jan 6, 2019

Uh oh!

HyukjinKwon commented Jan 6, 2019

Uh oh!

MaxGekk commented Jan 6, 2019

Uh oh!

MaxGekk commented Jan 6, 2019

Uh oh!

HyukjinKwon commented Jan 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jan 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 6, 2019

Uh oh!

MaxGekk commented Jan 7, 2019

Uh oh!

SparkQA commented Jan 7, 2019

Uh oh!

SparkQA commented Jan 8, 2019

Uh oh!

MaxGekk commented Jan 8, 2019

Uh oh!

SparkQA commented Jan 8, 2019

Uh oh!

MaxGekk commented Jan 8, 2019

Uh oh!

SparkQA commented Jan 8, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 8, 2019

HyukjinKwon commented Jan 6, 2019 •

edited

Loading

MaxGekk Jan 6, 2019 •

edited

Loading