[SPARK-30429][SQL] Optimize catalogString and usage in ValidateExternalType.errMsg to avoid OOM #27117

viirya · 2020-01-07T09:08:11Z

What changes were proposed in this pull request?

This patch proposes:

Fix OOM at WideSchemaBenchmark: make ValidateExternalType.errMsg lazy variable, i.e. not to initiate it in the constructor
Truncate errMsg: Replacing catalogString with simpleString which is truncated
Optimizing override def catalogString in StructType: Make catalogString more efficient in string generation by using StringConcat

Why are the changes needed?

In the JIRA, it is found that WideSchemaBenchmark fails with OOM, like:

[error] Exception in thread "main" org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: validateexternaltype(getexternalrowfield(input[0, org.apac
he.spark.sql.Row, true], 0, a), StructField(b,StructType(StructField(c,StructType(StructField(value_1,LongType,true), StructField(value_10,LongType,true), StructField(value_
100,LongType,true), StructField(value_1000,LongType,true), StructField(value_1001,LongType,true), StructField(value_1002,LongType,true), StructField(value_1003,LongType,true
), StructField(value_1004,LongType,true), StructField(value_1005,LongType,true), StructField(value_1006,LongType,true), StructField(value_1007,LongType,true), StructField(va
lue_1008,LongType,true), StructField(value_1009,LongType,true), StructField(value_101,LongType,true), StructField(value_1010,LongType,true), StructField(value_1011,LongType,
...
ue), StructField(value_99,LongType,true), StructField(value_990,LongType,true), StructField(value_991,LongType,true), StructField(value_992,LongType,true), StructField(value
_993,LongType,true), StructField(value_994,LongType,true), StructField(value_995,LongType,true), StructField(value_996,LongType,true), StructField(value_997,LongType,true), 
StructField(value_998,LongType,true), StructField(value_999,LongType,true)),true)) 
[error]         at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)                                                                                
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:435)                                                                                 
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:408)                                                                              
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)                                                                              
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)           
....
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:404)                                                                   
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)                                                                       
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)                                                                              
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)                                                                              
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)                                                                              
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:307)                                                                   
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)                                                                   
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)                                                                       
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)                                                                              
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)                                                                              
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)                                                                              
[error]         at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.<init>(ExpressionEncoder.scala:198)                                                              
[error]         at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:71)                                                                             
[error]         at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)                                                                                                    
[error]         at org.apache.spark.sql.SparkSession.internalCreateDataFrame(SparkSession.scala:554)                                                                         
[error]         at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:476)                                                                                      
[error]         at org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark$.$anonfun$wideShallowlyNestedStructFieldReadAndWrite$1(WideSchemaBenchmark.scala:126) 
...
[error] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
[error]         at java.util.Arrays.copyOf(Arrays.java:3332)
[error]         at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
[error]         at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
[error]         at java.lang.StringBuilder.append(StringBuilder.java:136)
[error]         at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:213)
[error]         at scala.collection.TraversableOnce.$anonfun$addString$1(TraversableOnce.scala:368)
[error]         at scala.collection.TraversableOnce$$Lambda$67/667447085.apply(Unknown Source)
[error]         at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
[error]         at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
[error]         at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
[error]         at scala.collection.TraversableOnce.addString(TraversableOnce.scala:362)
[error]         at scala.collection.TraversableOnce.addString$(TraversableOnce.scala:358)
[error]         at scala.collection.mutable.ArrayOps$ofRef.addString(ArrayOps.scala:198)
[error]         at scala.collection.TraversableOnce.mkString(TraversableOnce.scala:328)
[error]         at scala.collection.TraversableOnce.mkString$(TraversableOnce.scala:327)
[error]         at scala.collection.mutable.ArrayOps$ofRef.mkString(ArrayOps.scala:198)
[error]         at scala.collection.TraversableOnce.mkString(TraversableOnce.scala:330)
[error]         at scala.collection.TraversableOnce.mkString$(TraversableOnce.scala:330)
[error]         at scala.collection.mutable.ArrayOps$ofRef.mkString(ArrayOps.scala:198)
[error]         at org.apache.spark.sql.types.StructType.catalogString(StructType.scala:411)
[error]         at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1695)
[error]         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[error]         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[error]         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[error]         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$7(TreeNode.scala:468)
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode$$Lambda$934/387827651.apply(Unknown Source)
[error]         at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$1(TreeNode.scala:467)
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode$$Lambda$929/449240381.apply(Unknown Source)
[error]         at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:435)

It is after cb5ea20 commit which refactors ExpressionEncoder.

The stacktrace shows it fails at transformUp on objSerializer in ExpressionEncoder. In particular, it fails at initializing ValidateExternalType.errMsg, that interpolates catalogString of given expected data type in a string. In WideSchemaBenchmark we have very deeply nested data type. When we transform on the serializer which contains ValidateExternalType, we create redundant big string errMsg. Because we just in transforming it and don't use it yet, it is useless and waste a lot of memory.

After make ValidateExternalType.errMsg as lazy variable, WideSchemaBenchmark works.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual test with WideSchemaBenchmark.

viirya · 2020-01-07T09:10:37Z

cc @dongjoon-hyun @cloud-fan @MaxGekk

MaxGekk

If we really need to output the error message for large expected.catalogString, it will fail with OOM, right? which is not good. To avoid such kind of issues, it would be nice to use StringConcat:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala

Lines 102 to 107 in d9b3069

    
             /** 
        
              * Concatenation of sequence of strings to final string with cheap append method 
        
              * and one memory allocation for the final string.  Can also bound the final size of 
        
              * the string. 
        
              */ 
        
             class StringConcat(val maxLength: Int = ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {

or truncate expected.catalogString somehow else.

@viirya Or OOM happens because we create a lot of ValidateExternalType.errMsg?

SparkQA · 2020-01-07T13:08:32Z

Test build #116234 has finished for PR 27117 at commit 5fea579.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-07T15:40:18Z

+1 to truncate expected.catalogString

viirya · 2020-01-07T16:30:41Z

If we really need to output the error message for large expected.catalogString, it will fail with OOM, right? which is not good. To avoid such kind of issues, it would be nice to use StringConcat:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala

Lines 102 to 107 in d9b3069

/**

* Concatenation of sequence of strings to final string with cheap append method

* and one memory allocation for the final string. Can also bound the final size of

* the string.

*/

class StringConcat(val maxLength: Int = ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {

or truncate expected.catalogString somehow else.

@viirya Or OOM happens because we create a lot of ValidateExternalType.errMsg?

When we transform on the serializer, we copy expressions there, it creates redundant ValidateExternalType.errMsgs. So we initiates more than needed errMsg (of the transformed serializer) now. Currently the OOM at WideSchemaBenchmark is due to this.

As we will initiate errMsg when we actually need to use ValidateExternalType. The current nested level at WideSchemaBenchmark is ok. But more deeply nested case might cause OOM possibly if it asks more memory.

I think as you and @cloud-fan, we may need to truncate expected.catalogString in ValidateExternalType to prevent that.

StructType.catalogString.

SparkQA · 2020-01-07T17:35:06Z

Test build #116251 has finished for PR 27117 at commit 3a6d3c2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala

SparkQA · 2020-01-07T19:01:57Z

Test build #116253 has finished for PR 27117 at commit 25a1f63.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-01-07T19:16:29Z

empty array.init isn't empty but throws java.lang.UnsupportedOperationException...

sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala

dongjoon-hyun · 2020-01-07T20:05:02Z

Thank you for reporting, @MaxGekk . And, thank you for quick fix, @viirya !

dongjoon-hyun

In this PR, there are three different themes. Could you update the PR title and description accordingly?

Using lazy: With lazy, the OOM is gone at benchmark
Replacing catalogString with simpleString: This looks independent improvement for errMsg.
Optimizing override def catalogString: This is not used at errMsg due to (2). So, this becomes independent in this PR.

If the PR title and description is correctly updated, we don't need to split this PR since the code is simple. Otherwise, we may want to split this into multiple PRs.

viirya · 2020-01-07T20:53:43Z

@dongjoon-hyun Thanks! I've updated the PR description and title.

dongjoon-hyun

+1, LGTM. (Pending Jenkins).
Thank you for updating, @viirya .

viirya · 2020-01-07T22:07:44Z

Thanks! @dongjoon-hyun @MaxGekk

SparkQA · 2020-01-07T23:20:38Z

Test build #116256 has finished for PR 27117 at commit 87eb2b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-08T00:06:30Z

Test build #116258 has finished for PR 27117 at commit 5f25d79.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-01-08T02:45:42Z

Merged to master. Thank you all!

Make errMsg string lazy as it wastes a lot memory.

5fea579

MaxGekk reviewed Jan 7, 2020

View reviewed changes

Use truncated simpleString in errMsg. Use StringConcat in

3a6d3c2

StructType.catalogString.

MaxGekk reviewed Jan 7, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala Outdated Show resolved Hide resolved

Fix additional space.

25a1f63

empty array can't call init.

87eb2b6

MaxGekk reviewed Jan 7, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala Show resolved Hide resolved

Use while loop instead.

5f25d79

dongjoon-hyun added the SQL label Jan 7, 2020

dongjoon-hyun reviewed Jan 7, 2020

View reviewed changes

viirya changed the title ~~[SPARK-30429][SQL] ValidateExternalType should not initiate errMsg in the constructor~~ [SPARK-30429][SQL] Optimize catalogString and usage in ValidateExternalType.errMsg to avoid OOM Jan 7, 2020

dongjoon-hyun approved these changes Jan 7, 2020

View reviewed changes

MaxGekk approved these changes Jan 7, 2020

View reviewed changes

dongjoon-hyun closed this in 1160457 Jan 8, 2020

viirya deleted the SPARK-30429 branch December 27, 2023 18:38

	/**
	* Concatenation of sequence of strings to final string with cheap append method
	* and one memory allocation for the final string. Can also bound the final size of
	* the string.
	*/
	class StringConcat(val maxLength: Int = ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {

[SPARK-30429][SQL] Optimize catalogString and usage in ValidateExternalType.errMsg to avoid OOM #27117

[SPARK-30429][SQL] Optimize catalogString and usage in ValidateExternalType.errMsg to avoid OOM #27117

Uh oh!

Conversation

viirya commented Jan 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

viirya commented Jan 7, 2020

Uh oh!

MaxGekk left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 7, 2020

Uh oh!

cloud-fan commented Jan 7, 2020

Uh oh!

viirya commented Jan 7, 2020

Uh oh!

SparkQA commented Jan 7, 2020

Uh oh!

Uh oh!

SparkQA commented Jan 7, 2020

Uh oh!

viirya commented Jan 7, 2020

Uh oh!

Uh oh!

dongjoon-hyun commented Jan 7, 2020

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jan 7, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Jan 7, 2020

Uh oh!

SparkQA commented Jan 7, 2020

Uh oh!

SparkQA commented Jan 8, 2020

Uh oh!

dongjoon-hyun commented Jan 8, 2020

Uh oh!

Uh oh!

viirya commented Jan 7, 2020 •

edited

Loading

MaxGekk left a comment •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading